duankuai6586 2012-10-29 18:03
浏览 44

DOMDocument - 提取标记的textcontent,但首先删除某些子元素

Sample source HTML:

<p>
 <strong>Byline:</strong> Introductory text. 

 <a href="1.html" target="">Link 1</a> |
 <span class="foo"></span> 
 <a href="2.html">Link 2</a>
 <a href="3.html">Link 3</a>
</p>

What I'm trying to do:

I'd like to load the HTML in, get rid of the links and other extraneous tags (not a problem if I have to specify what they are), things like the '|' and so on, keeping the "Byline" and "Introductory text". This is a script that parses a 3rd-party site, so I've no ability to add CSS classes, etc.

I first attempted this with (not very widely used now) PHP Simple HTML DOM Parser, and more recently have been trying DOMDocument.

However I'm getting absolutely nowhere - e.g. right now I can't even traverse the tree underneath <p>:

$doc = new DOMDocument();
$doc->loadHTML($somehtml);

$p = $doc->getElementsbyTagName('p');

foreach($p->childNodes as $item) {
  ...    
}

The above gives me a 'Undefined property: DOMNodeList::$childNodes' error for the foreach line.

Also: I'm finding it frustrating that I apparently can't visualise the DOM using print_r, var_dump etc. and also when I looped through the links using xpath->query (which seems inappropriate here as I don't really want to search for/extract specific stuff, rather take the HTML, get rid of the nodes I don't want and then save it) using print_r showed me the link text but not the contents of href="".

Could anyone recommend an understandable guide to DOMDocument? The PHP manual seems very short on practical examples.

  • 写回答

0条回答 默认 最新

    报告相同问题?

    悬赏问题

    • ¥100 任意维数的K均值聚类
    • ¥15 stamps做sbas-insar,时序沉降图怎么画
    • ¥15 unity第一人称射击小游戏,有demo,在原脚本的基础上进行修改以达到要求
    • ¥15 买了个传感器,根据商家发的代码和步骤使用但是代码报错了不会改,有没有人可以看看
    • ¥15 关于#Java#的问题,如何解决?
    • ¥15 加热介质是液体,换热器壳侧导热系数和总的导热系数怎么算
    • ¥100 嵌入式系统基于PIC16F882和热敏电阻的数字温度计
    • ¥15 cmd cl 0x000007b
    • ¥20 BAPI_PR_CHANGE how to add account assignment information for service line
    • ¥500 火焰左右视图、视差(基于双目相机)