I'm parsing a third-party web page using PHP's DOMElement controls. When I use the web page with my browser and view the source, it's clean, but when I access some of the nodes through the DOMElement->nodeValue parameter the HTML tags aren't there, and there are several newlines and this character Â. According to this answer, this is the character that shows up when there's an encoding issue.
I also get that gobbly-gook using:
- simplexml_import_dom($node)->asXML();
- $doc->saveXML($node);
My question is how I can simply get the clean HTML code inside the DOMElement?
Here is the clean HTML code:
<b>Author:</b> AUTHOR<br>
<b>ISBN:</b> 9780684857220 <br>
<b>Edition/Copyright:</b> 7<br>
<b>Publisher:</b> J+M<br>
<b>Published Date:</b> 1989<br>
Here is what nodeValue gives:
Â
Author:Â AUTHOR ISBN:Â 9780684857220 Edition/Copyright:Â 7 Publisher:Â J+M Published Date:Â
1989