douguanyun2169 2016-09-28 18:42
浏览 97
已采纳

PHP - 处理缺少分号的HTML实体

I'm trying to write a script to parse a remote RSS feed, and output the result in JSON format.

The raw RSS feed contains HTML entities like –, …,etc.

I use html_entity_decode on the raw content first, so that json_encode will generate correct output:

$rss = new DOMDocument();
$rss->load('https://www.example.com/feed');
$feed = array();
foreach ($rss->getElementsByTagName('item') as $node) {
    $item = array ( 
        'title' => html_entity_decode($node->getElementsByTagName('title')->item(0)->nodeValue,ENT_COMPAT,'UTF-8'),
        'desc' => html_entity_decode($node->getElementsByTagName('description')->item(0)->nodeValue,ENT_COMPAT,'UTF-8'),
        'link' => $node->getElementsByTagName('link')->item(0)->nodeValue,
        'date' => $node->getElementsByTagName('pubDate')->item(0)->nodeValue,
    );
    $feed[] = $item;
}
$data = array();
foreach($feed as $item){
    $data[] = array('url'=>$item['link'],'date'=>date('l, F d, Y g:i A',strtotime($item['date'])),'title'=>$item['title'],'desc'=>$item['desc']);
}
echo json_encode($data);

It works well except for some HTML entites that are missing semicolons. html_entity_decode won't recognize them.

I'm thinking maybe I can use regex to find and fix those entities without semicolons. But I don't know how to write such code. Any idea?

Or is there any other way to deal with this?

  • 写回答

2条回答 默认 最新

  • doumei1955 2016-09-28 18:56
    关注

    It seems you just want to match &# followed with 4 digits that are not followed with ;. Use

    '~&#\d{4}(?!;)~'
    

    and relace with $0;. See the regex demo.

    Details:

    • &# - literal sequence &#
    • \d{4} - 4 digits
    • (?!;) - a negative lookahead that fails the match if there is a ; right after the 4 digits.

    The $0 in the replacement pattern is the backreference to the whole match value.

    PHP snippet:

    $re = '~&#\d{4}(?!;)~';
    $str = '&#8211&#8210––';
    $subst = '$0;';
    $result = preg_replace($re, $subst, $str);
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 如何在scanpy上做差异基因和通路富集?
  • ¥20 关于#硬件工程#的问题,请各位专家解答!
  • ¥15 关于#matlab#的问题:期望的系统闭环传递函数为G(s)=wn^2/s^2+2¢wn+wn^2阻尼系数¢=0.707,使系统具有较小的超调量
  • ¥15 FLUENT如何实现在堆积颗粒的上表面加载高斯热源
  • ¥30 截图中的mathematics程序转换成matlab
  • ¥15 动力学代码报错,维度不匹配
  • ¥15 Power query添加列问题
  • ¥50 Kubernetes&Fission&Eleasticsearch
  • ¥15 報錯:Person is not mapped,如何解決?
  • ¥15 c++头文件不能识别CDialog