dou4064 2015-11-27 14:18 采纳率: 100%
浏览 75
已采纳

XPath直到下一个标签

A question similar to others asked here before, but as I cannot figure out how to apply these suggestions, I'd need some help.

I'd like to find the nodes of an html-document which has a structure like this (extracts, can vary):

<h2>My title 1</h2>
<h3>Sub-heading</h3>
<p>...<span><a href='#'>...</a></span></p>
<div>...</div>
<h2>My title 2</h2>
<p>No sub-heading here :O</p>
<h3>But here</h3>
<p>No link</p>
<h2>And so on...</h2>
<p>...</p>

What I'd like to accomplish is to find all nodes from one h2 until the last item before the next h2, including the h2 itself. As in my example I'd like to retreive "blocks" like these:

Block 1:

<h2>My title 1</h2>
<h3>Sub-heading</h3>
<p>...<span><a href='#'>...</a></span></p>
<div>...</div>

Block 2:

<h2>My title 2</h2>
<p>No sub-heading here :O</p>
<h3>But here</h3>
<p>No link</p>

Block 3:

<h2>And so on...</h2>
<p>...</p>

I have no whatsoever more to aim for (no id, no text content I could know about, no for-sure content, etc), apart from the h2's.

  • 写回答

1条回答 默认 最新

  • dongzhou5344 2015-11-27 17:03
    关注

    You can use DOMXpath and query method.

    First find all the h2 elements from the body (not nested h2 elements)

    Then start a foreach loop for every h2 found. Then add that h2 to an array $set because you want to save it. Then loop the siblings and add those to the array $set up to the next h2 that you find.

    Add $set to $sets array.

    For example:

    $html = <<<HTML
    <h2>My title 1</h2>
    <h3>Sub-heading</h3>
    <p>...<span><a href='#'>...</a></span></p>
    <div>...</div>
    <h2>My title 2</h2>
    <p>No sub-heading here :O</p>
    <h3>But here</h3>
    <p>No link</p>
    <h2>And so on...</h2>
    <p>...</p>
    <div><h2>This is nested</h2></div>
    HTML;
    
    $doc = new DOMDocument();
    $doc->loadHTML($html);
    $xpath = new DOMXpath($doc);
    $domNodeList = $xpath->query('/html/body/h2');
    
    $sets = array();
    
    foreach($domNodeList as $element) {
        // Save the h2
        $set = array($element);
    
        // Loop the siblings unit the next h2
        while ($element = $element->nextSibling) {
            if ($element->nodeName === "h2") {
                break;
            }
            // if Node is a DOMElement
            if ($element->nodeType === 1) {
                $set[] = $element;
            }
        }
    
        $sets[] = $set;
    }
    

    The $sets will now contain 3 arrays which will contain your added DOMElements.

    Demo with var_dump of $sets

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥100 Jenkins自动化部署—悬赏100元
  • ¥15 关于#python#的问题:求帮写python代码
  • ¥20 MATLAB画图图形出现上下震荡的线条
  • ¥15 关于#windows#的问题:怎么用WIN 11系统的电脑 克隆WIN NT3.51-4.0系统的硬盘
  • ¥15 perl MISA分析p3_in脚本出错
  • ¥15 k8s部署jupyterlab,jupyterlab保存不了文件
  • ¥15 ubuntu虚拟机打包apk错误
  • ¥199 rust编程架构设计的方案 有偿
  • ¥15 回答4f系统的像差计算
  • ¥15 java如何提取出pdf里的文字?