DOMDocument - 提取标记的textcontent，但首先删除某些子元素

Sample source HTML:

<p>
 <strong>Byline:</strong> Introductory text. 

 <a href="1.html" target="">Link 1</a> |
 <span class="foo"></span> 
 <a href="2.html">Link 2</a>
 <a href="3.html">Link 3</a>
</p>

What I'm trying to do:

I'd like to load the HTML in, get rid of the links and other extraneous tags (not a problem if I have to specify what they are), things like the '|' and so on, keeping the "Byline" and "Introductory text". This is a script that parses a 3rd-party site, so I've no ability to add CSS classes, etc.

I first attempted this with (not very widely used now) PHP Simple HTML DOM Parser, and more recently have been trying DOMDocument.

However I'm getting absolutely nowhere - e.g. right now I can't even traverse the tree underneath <p>:

$doc = new DOMDocument();
$doc->loadHTML($somehtml);

$p = $doc->getElementsbyTagName('p');

foreach($p->childNodes as $item) {
  ...    
}

The above gives me a 'Undefined property: DOMNodeList::$childNodes' error for the foreach line.

Also: I'm finding it frustrating that I apparently can't visualise the DOM using print_r, var_dump etc. and also when I looped through the links using xpath->query (which seems inappropriate here as I don't really want to search for/extract specific stuff, rather take the HTML, get rid of the nodes I don't want and then save it) using print_r showed me the link text but not the contents of href="".

Could anyone recommend an understandable guide to DOMDocument? The PHP manual seems very short on practical examples.

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

报告相同问题？

关注问题

php DOMDocument - 将子元素列出到数组 php
2014-12-02 16:43

回答 3 已采纳 A simple example using php DOMDocument - <?php $html = <<<HTML <html> <body&
如何使用PHP DOMDocument（）检索子元素中的值？ php
2019-06-17 18:20

回答 1 已采纳 What you can do is to look at the next element from the <img> tag (using nextSibling) and if
XPath和PHP DomDocument - 如何选择具有特定标记和属性的所有子项 php xml
2014-02-20 13:54

回答 2 已采纳 You've got a typo in your query, descendent vs. descendant. descendant::p[@class="4textlist"] work
JavaScript基础系列9---DOM操作
2016-12-08 20:13

做人要厚道2013的博客 DOM 描绘了一个层次化的节点树，允许开发人员添加、移除和修改页面的某一部分。W3C DOM标准分为3部分： core DOM - 针对任何结构化文档的标准模型 XML DOM - 针对 XML 文档的标准模板 HTML DOM - 针对 HTML 文档...
DOMDocument-> saveHTML不起作用 php
2017-02-23 19:13

回答 1 已采纳 My_ friend this is not how it works. You should have your edited HTML in the result of saveHTML()
将“Image”标记替换为“a”标记PHP DOMDocument html php
2019-02-26 06:41

回答 3 已采纳 This is a case of when you alter the content of the document your iterating over a (your list of t
php DomDocument - 找到空文本节点？ php xml
2014-12-24 08:33

回答 1 已采纳 It's not telling you there is an empty text node there because there isn't a text node there - tha
php xml常用函数的集合及四种方法
2013-12-21 17:16

weixin_33730836的博客 a、DOMDocument->load()作用：加载xml文件用法：DOMDocument->load( string filename )参数：filename，xml文件；返回：如果成功则返回 TRUE，失败则返回 FALSE。 b、DOMDocument->loadXML()作用：加载xml...
PHP使用DOMDocument和/或Regex从HTML中提取URL php
2018-09-26 15:31

回答 1 已采纳 I think you can use regex to fetch this value which will be easier. $txt = <<<TXT <ht
PHP简单HTML DOM - 如何获取标记内的文本 html php
2016-04-02 09:04

回答 1 已采纳 try: innertext() innertext used for Read or write the inner HTML text of element. foreach($ht
PHP DOMDocument使用HTML5 doctype正确加载HTML UTF-8编码 html5 php
2017-03-13 17:37

回答 1 已采纳 I found why. The DOM extension was built on libxml2 whose HTML parser was made for HTML 4. I
JS基础之DOM操作
2019-01-20 10:00

爬山的小明的博客一、DOM概述 1.1 DOM概念 1、DOM（Document Object Model 文档对象模型）是针对 HTML 和 XML 文档的一个 API（应用程序编程接口）。DOM 描绘了一个层次化的节点树，允许开发人员添加、移除和修改页面的某一部分。...
DomDocument / DOMXPath - 如何通过itemprop和img src获取HTML Dom元素 php
2015-05-14 12:39

回答 2 已采纳 For your examples: $xpath->query('//img/@src)->item(0)->nodeValue This means Select a
DOM基础
2018-10-25 09:23

零zero度的博客 DOM基础一、DOM概述 1.1 DOM概念 1、DOM（Document Object Model 文档对象模型）是针对 HTML 和 XML 文档的一个 API（应用程序编程接口）。DOM 描绘了一个层次化的节点树，允许开发人员添加、移除和修改页面的某一...
11-18复盘济南市swtd
2021-11-29 10:15

北辰怀朔的博客 visibility:hidden — 会被子元素继承，通过设置子元素 visibility:visible 来显示子元素。 opacity:0 — 会被子元素继承，但是不能设置子元素 opacity:0 来先重新显示。三、事件绑定。 display:none 的元素都已经...
没有解决我的问题, 去提问

悬赏问题

¥100 任意维数的K均值聚类
¥15 stamps做sbas-insar，时序沉降图怎么画
¥15 unity第一人称射击小游戏，有demo，在原脚本的基础上进行修改以达到要求
¥15 买了个传感器，根据商家发的代码和步骤使用但是代码报错了不会改，有没有人可以看看
¥15 关于#Java#的问题，如何解决？
¥15 加热介质是液体，换热器壳侧导热系数和总的导热系数怎么算
¥100 嵌入式系统基于PIC16F882和热敏电阻的数字温度计
¥15 cmd cl 0x000007b
¥20 BAPI_PR_CHANGE how to add account assignment information for service line
¥500 火焰左右视图、视差（基于双目相机）

码龄粉丝数原力等级 --

DOMDocument - 提取标记的textcontent，但首先删除某些子元素

0条回答默认最新

悬赏问题

DOMDocument - 提取标记的textcontent，但首先删除某些子元素

0条回答 默认 最新

悬赏问题

0条回答默认最新