douqian6194
douqian6194
2011-06-10 15:20
浏览 40
已采纳

解析HTML - 正则表达式是这种情况下唯一的选择吗?

A user will supply HTML, it may be valid or invalid (malformed). I need to be able to determine such things as:

  1. Is there a style tag in the body
  2. Is there a div that has a style attribute that makes use of width or background-image.

I have tried using the DOMDocument class but it can only do 1 and not 2 with xPath.

I have also tried simple_html_dom and that can only do 1 but not 2.

Do you think its a good idea that I just use regular expressions or is there something that I haven't thought of?

图片转代码服务由CSDN问答提供 功能建议

用户将提供HTML,它可能有效或无效(格式错误)。 我需要能够确定以下内容:

  1. 正文中是否有样式标记
  2. 是否存在div 使用宽度或背景图像的样式属性

    我尝试过使用DOMDocument类,但它只能执行1 而不是2与xPath。

    我也尝试过simple_html_dom,只能做1而不是2。

    你觉得这是个好主意吗? 我只是使用正则表达式,还是有一些我没想过的东西?

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 邀请回答

4条回答 默认 最新

  • doushi4864
    doushi4864 2011-06-10 16:38
    已采纳

    XPath can do both (1) and (2):

    To test if there's a style tag in the body:

    //body//style
    

    To test if there's a div with a style attribute using width or background-image:

    //div[contains(@style,'width:') or contains(@style,'background-image:')]
    

    And, as you were curious about in your comments, seeing if a style tag contains a:hover or font-size:

    //style[contains(text(),'a:hover') or contains(text(),'font-size:')]
    
    点赞 评论
  • duanchuopi2298
    duanchuopi2298 2011-06-10 15:24

    Regex is NEVER (again: NEVER!) a solution for parsing HTML!

    Regex can be used for Type-3 Chomsky languages (regular language).
    HTML however is a Type-2 Chomsky language (context-free language).

    If still in doubt: http://en.wikipedia.org/wiki/Chomsky_hierarchy#The_hierarchy

    To safely work with type-2 language you need a context free language parser. You might want to try a LL-parser or a recursive descent parser, e.g.


    That being said:

    Match body with style:

    <body\s+[^>]*style\s*=\s*["'].*?[^"']*?["'][^>]*>
    

    Match div with width|background-image in style:

    <div\s+[^>]*style\s*=\s*["'][^"']*?(width|background-image)[^"']*?["'][^>]*>
    

    They both falsely match said tags if commented out (which is why I said not possible).

    点赞 评论
  • douhuike3199
    douhuike3199 2011-06-10 15:27

    You can use Tidy to clean up the HTML, then parse it as XML. Then it's easy to use xpath to find nodes. Try something like this:

    $tidyConfig = array(
        "add-xml-decl" => true,
        "output-xml" => true,
        "numeric-entities" => true
    );
    $tidy = new tidy();
    $tidy->parseString($html, $tidyConfig, "utf8");
    $tidy->cleanRepair();
    $xml = new SimpleXMLElement($tidy);
    $matches = $xml->xpath('style');
    

    As for parsing a style attribute to look for specific selectors, I think you'll have to do that manually. You could use a CSS parser if you want.

    点赞 评论
  • douxuan1284
    douxuan1284 2011-06-10 15:33

    It's rarely a good idea to parse HTML with regex. However, any good HTML parser will be able to find all the divs with style tags, and regex could be useful for parsing the style attributes once you've done that.

    It's still possible for complex (yet valid) CSS to break most regex, however, so the really durable thing here would be an HTML parser combined with a CSS parser. That could be overkill, though; a regex like \bwidth\s*:\s*(\w+) is likely to catch any width value unless someone's actively trying to fool it.

    Edit:

    A good HTML parser won't choke on anything that wouldn't choke a browser. I'm not a PHP guy anymore, but I've heard some good things about HTML Purifier.

    点赞 评论

相关推荐