Sample source HTML:
<p>
<strong>Byline:</strong> Introductory text.
<a href="1.html" target="">Link 1</a> |
<span class="foo"></span>
<a href="2.html">Link 2</a>
<a href="3.html">Link 3</a>
</p>
What I'm trying to do:
I'd like to load the HTML in, get rid of the links and other extraneous tags (not a problem if I have to specify what they are), things like the '|' and so on, keeping the "Byline" and "Introductory text". This is a script that parses a 3rd-party site, so I've no ability to add CSS classes, etc.
I first attempted this with (not very widely used now) PHP Simple HTML DOM Parser, and more recently have been trying DOMDocument.
However I'm getting absolutely nowhere - e.g. right now I can't even traverse the tree underneath <p>
:
$doc = new DOMDocument();
$doc->loadHTML($somehtml);
$p = $doc->getElementsbyTagName('p');
foreach($p->childNodes as $item) {
...
}
The above gives me a 'Undefined property: DOMNodeList::$childNodes' error for the foreach
line.
Also: I'm finding it frustrating that I apparently can't visualise the DOM using print_r
, var_dump
etc. and also when I looped through the links using xpath->query
(which seems inappropriate here as I don't really want to search for/extract specific stuff, rather take the HTML, get rid of the nodes I don't want and then save it) using print_r showed me the link text but not the contents of href="".
Could anyone recommend an understandable guide to DOMDocument? The PHP manual seems very short on practical examples.