替代正则表达式获取xml标记的内容

I'm processing a XML file and I need to get all content inside <section> tags.

Right now I'm using this regex:

<?php preg_match_all('/<section[^>]*>(.*?)<\/section>/i', $myXmlString, $results);?>

The code inside the <section> tags is pretty complex. It include math equations and stuff like that. In my local machine the regex works perfect. It is php 5.3.10 over apache 2.2.22 (Ubuntu)

BUT in my staging server it doesn't work. It is php 5.3.3 over apache 2.2.15 (Red Hat)

I would ask 2 questions:

Is there any issue with preg_match_all for php 5.3.3?

Is there a better way to express the regex?

--EDIT: VARIATIONS OF REGEX USED UNSUCCESSFULY--

<?php preg_match_all('/<section[^>]*>(.*?)<\/section>/is', $myXmlString, $results);?>
<?php preg_match_all('/<section[^>]*>(.*?)<\/section>/ims', $myXmlString, $results);?>
<?php preg_match_all('#<section[^>]*>(.*?)<\/section>#ims', $myXmlString, $results);?>
<?php preg_match_all('#<section[^>]*>([^\00]*?)<\/section>#ims', $myXmlString, $results);?>

--EDIT: Why haven't I used a parser?

The XML consists of two <sections>. Each section groups n questions for an exam.

Each question can include math equations represented by its own XML. An equation may be something like this:

<inlineequation><m:math baseline="-16.5" display="inline" overflow="scroll"><m:mrow><m:mtable columnalign="left"><m:mtr><m:mtd><m:mrow><m:mo stretchy="true">[</m:mo><m:mrow><m:mtable columnalign="right"><m:mtr><m:mtd><m:mn>4</m:mn></m:mtd><m:mtd columnalign="right"><m:mrow><m:mo>-</m:mo><m:mn>9</m:mn></m:mrow></m:mtd><m:mtd columnalign="right"><m:mrow><m:mn>54</m:mn></m:mrow></m:mtd></m:mtr><m:mtr><m:mtd columnalign="right"><m:mrow><m:mo>&minus;</m:mo><m:mn>28</m:mn></m:mrow></m:mtd><m:mtd columnalign="right"><m:mo>&minus;</m:mo><m:mn>1</m:mn></m:mtd><m:mtd columnalign="right"><m:mo>&minus;</m:mo><m:mn>14</m:mn></m:mtd></m:mtr></m:mtable></m:mrow><m:mo stretchy="true">]</m:mo></m:mrow></m:mtd></m:mtr></m:mtable></m:mrow></m:math></inlineequation>

I need that code to remain XML (no array) because I will pass that code as it is to a jQuery plugin which will render the equation (it will look like LaTeX equations).

If I parse the XML it will be really difficult to create the string for the equation again and locate it in the right place inside the question's statement.

dongmei8071
dongmei8071 这些事情比你在第一,第二,甚至第三眼看上去都难。
6 年多之前 回复
dongpao5658
dongpao5658 它在PHP5.3.3no5.3.6上失败了。我的第一种方法是使用解析器,但在部分内部有很多代码需要保留为XML,因为它将由jQuery插件解释以呈现数学方程式。
6 年多之前 回复
duanpan7011
duanpan7011 另外,你还在费心阅读文档吗?你似乎错过了PHP5.3.6的一个特定点。
6 年多之前 回复
duanbenzan4050
duanbenzan4050 由于未转义的分隔符,手头的代码不适用于任何一个版本。
6 年多之前 回复
dongtang3155
dongtang3155 为什么不使用xml解析器?使用正则表达式解析XML有一些问题,比如,理智。
6 年多之前 回复

2个回答

regex can be resource intensive.

perhaps consider using xml_parse_into_struct;

<?php
    $xmlp = xml_parser_create();
    xml_parse_into_struct($xmlp, $myXmlString, $vals, $index);
    xml_parser_free($xmlp);
    print_r($vals);
?>
duancoubeng5909
duancoubeng5909 谢谢@flauntster。 我编辑了这个问题来回答为什么我不能使用解析器。
6 年多之前 回复

As others have said, don't use regex to parse XML. Having said that, let's answer your actual question:

Is it at all likely that your XML document contains line breaks? Do you realise that the . character will match everything except line-breaks unless you explicitly turn this feature on?

Try this:

<?php preg_match_all('/<section[^>]*>(.*?)<\/section>/si', $myXmlString, $results);?>

The extra s at the end, tells the regex engine to allow . to match line-breaks.

Honestly though, a lot of people get too hung up on "not parsing XML with regex" without actually thinking about why it's a bad idea. Performance aside, it's essentially because there's no proper way of dealing with nested tags - there's more to it than that, but this is basically what it boils down to. XML documents are not regular so you can't use regular expressions to parse them.

HOWEVER! Sometimes the data that you want to get out of an XML document definitely IS regular. If you throw away the fact that you're dealing with XML for a moment and treat it as just a string of text - you can establish definite patterns that you ABSOLUTELY can use regex to pull out.

In your case, I'd say it's a safe bet that your XML document has a flat structure; there wouldn't be tags nested inside other tags for example. In that case, if we forget the XML component and just think about the patterns you've got

  • Unmatched text
  • Pattern that denotes the start of a match
  • Matched text
  • Patten that denotes the end of a match
  • Unmatched text
  • etc ...

This is absolutely regular and - save for some insane edge cases I wouldn't bother worrying about - it's pretty damned safe!

dsfhe34889789708
dsfhe34889789708 谢啦。 我编辑了这个问题,包括我已经尝试过的正则表达式的变体以及为什么我需要使用常规表达而不是解析器。
6 年多之前 回复
Csdn user default icon
上传中...
上传图片
插入图片
抄袭、复制答案,以达到刷声望分或其他目的的行为,在CSDN问答是严格禁止的,一经发现立刻封号。是时候展现真正的技术了!
立即提问