2014-05-23 16:46 阅读 44

preg_replace vs DOMDocument replaceChild

I was wondering which method mentioned in the title is more efficient to replace content in a html page.

I have this custom tag in my page: <includes module='footer'/> which will be replaced with some content.

Now there are some downsides with using DOMDocument->getElementsByTagName('includes')->item(0)->parentNode->replaceChild for instance when i forgot to add the slash in the tag, like so <includes module='footer'> the whole site crashes.

Regex allows exceptions like these, as long it matches the rule. It would even allow me to replace any string, like {includes:footer}.

Now back to my actual question. Are there any downsides using regex for this purpose, like performance issues...?

More here: Append child/element in head using XML Manipulation


  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享

2条回答 默认 最新

  • 已采纳
    dongshuobei1037 dongshuobei1037 2014-05-23 17:42

    I wouldn't be too worried about performance here, I would consider them "comparable". Benchmarks would need to be ran to truly determine this, as it would depend on the size of the document and how the regular expression is written.

    Instead, I would be concerned about accuracy. In general DOMDocument will be much better at parsing XML since it was built to read and understand the language. However, it does fail on <includes module='footer'> because it is an un-closed tag (expecting: </includes>).

    Most common HTML/XML formatting issues can be fixed with PHP's Tidy class. I would check this out, since you should receive much more "expected results" compared to if you used regex to parse. If you used a regular expression, there could technically be attributes before/after the module, elements within the includes element, unexpected characters like <includes module='foo>bar'>, etc.

    In the end, if your XML is in a "controlled" environment (i.e. you know what can and can't happen, you know what possible characters module will contain, you know that it will always be a self closing element containing now children, etc.) than by all means use a regular expression. Just know it is looking for a very specific set of rules. However, if you expect for this to work with "anything you throw at it"..please use a DOM parser (after Tidy'ing to avoid the exceptions), regardless of performance (although I bet it will be very comparable in many instances).

    Also, final note, if you plan to find/replace/manipulate many nodes in a document, you will see a large performance increase by going with a DOM parser. A DOM parser will take a document and parse it, once. Then you just traverse the data it already has loaded into its class. This is compared to using regular expressions, where each individual one will be ran across the whole document looking for a set of matches.

    If you want me to get more specific in any area (i.e. give a Tidy example, or work on a benchmark), let me know.

    点赞 评论 复制链接分享
  • doumianfeng6979 doumianfeng6979 2014-05-24 10:22

    So i did some naive performance testing using microtime(true). And it turns out using preg_replace is the faster option. While DOM replaceChild needed between 2.0 and 3.5 ms, preg_replace needed between 0.5 and 1.2 ms! But i guess thats only in my case.

    This is how my html looks like:

    <!DOCTYPE html>
            allot more here

    this is the regex is used: /{([ ]*)includes:([ ]*)$key([^}]*)}/i

    As i said, i'm not fully proficient in using regex, but this did the job. Guess if you optimize it, it would run even faster.

    For the replaceChild method i used a custom tag like this: <includes module='body'/>

    Again, this is testet on my local server, therefore i still need to make some tests of how it will behave on my online server...

    点赞 评论 复制链接分享