doulai2573 2014-07-22 09:22
浏览 13
已采纳

too long

I'm currently working on a way to parse a HTML-document into a database. I'm not allowed to change any formatting from the HTML document. In the following example i need to find which tags have class id "Category", and then grab the data within this tag, in this example "Example Text".

How do I get the code to not only match tags which are directly ended afterwards?

$tags = "<p class=Category style='margin-left:0in;text-indent:0in'><a name='_
Toc390163149'></a><a name='_Ref388370252'></a><a
name='_Toc122858606'><span lang=EN-GB>3.<span style='font:7.0pt 'Times New 
Roman''>&nbsp;</span></span><span lang=EN-GB>Example Text</span></a></p>";

preg_match_all("/(<([\w]+)[^>]*>)(.*?)(<\/\\2>)/", $tags, $matches, PREG_SET_
        foreach ($matches as $val) {
            echo "matched: " . htmlspecialchars($val[0]) . "</br>";
            echo "part 1: " . htmlspecialchars($val[1]) . "</br>";
            echo "part 2: " . htmlspecialchars($val[2]) . "</br>";
            echo "part 3: " . htmlspecialchars($val[3]) . "</br>";
            echo "part 4: " . htmlspecialchars($val[4]) . "</br></br>";
        }

Outputs:

matched: <a name="_Toc390163149"></a>
part 1: <a name="_Toc390163149">
part 2: a
part 3:
part 4: </a>

matched: <a name="_Ref388370252"></a>
part 1: <a name="_Ref388370252">
part 2: a
part 3:
part 4: </

matched: <span lang=EN-GB>When not to follow Rules</span>
part 1: <span lang=EN-GB>
part 2: span
part 3: When not to follow Rules
part 4: </span>

Any ideas?

  • 写回答

1条回答 默认 最新

  • dongwenhui8900 2014-07-22 09:34
    关注

    Short answer, you can't parse complicated data formats such as HTML with regex, or at least you shouldn't.

    Long answer, PHP provides a number of libraries for parsing HTML that would be both far less effort and far less prone to errors than the regex solution would be. The two of interest are going to be SimpleXML (if you're parsing XHTML) and DOMDocument (if you're parsing markup that may or may not be XML). I'd be inclined to use the latter for HTML.

    Once you've loaded the markup into a DOMDocument, you can use an XPath query to locate all the p.category tags and iterate over them to get their child nodes and content.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 多电路系统共用电源的串扰问题
  • ¥15 shape_predictor_68_face_landmarks.dat
  • ¥15 slam rangenet++配置
  • ¥15 有没有研究水声通信方面的帮我改俩matlab代码
  • ¥15 对于相关问题的求解与代码
  • ¥15 ubuntu子系统密码忘记
  • ¥15 信号傅里叶变换在matlab上遇到的小问题请求帮助
  • ¥15 保护模式-系统加载-段寄存器
  • ¥15 电脑桌面设定一个区域禁止鼠标操作
  • ¥15 求NPF226060磁芯的详细资料