I'm trying to write a script to parse a remote RSS feed, and output the result in JSON format.
The raw RSS feed contains HTML entities like –
, …
,etc.
I use html_entity_decode
on the raw content first, so that json_encode
will generate correct output:
$rss = new DOMDocument();
$rss->load('https://www.example.com/feed');
$feed = array();
foreach ($rss->getElementsByTagName('item') as $node) {
$item = array (
'title' => html_entity_decode($node->getElementsByTagName('title')->item(0)->nodeValue,ENT_COMPAT,'UTF-8'),
'desc' => html_entity_decode($node->getElementsByTagName('description')->item(0)->nodeValue,ENT_COMPAT,'UTF-8'),
'link' => $node->getElementsByTagName('link')->item(0)->nodeValue,
'date' => $node->getElementsByTagName('pubDate')->item(0)->nodeValue,
);
$feed[] = $item;
}
$data = array();
foreach($feed as $item){
$data[] = array('url'=>$item['link'],'date'=>date('l, F d, Y g:i A',strtotime($item['date'])),'title'=>$item['title'],'desc'=>$item['desc']);
}
echo json_encode($data);
It works well except for some HTML entites that are missing semicolons. html_entity_decode
won't recognize them.
I'm thinking maybe I can use regex to find and fix those entities without semicolons. But I don't know how to write such code. Any idea?
Or is there any other way to deal with this?