dqyitt2954 2014-01-05 21:26
浏览 127
已采纳

PHP - html_entity_decode没有解码所有内容

I am parsing an HTML page. At some point I am getting the text between a div and using html_entity_decode to print that text.

The problem is that the page contains characters like this star or others like shapes like ⬛︎, ◄, ◉, etc. I have checked and these characters are not encoded on the source page, they are like you see them normally.

The page is using charset="UTF-8"

So, when I use

html_entity_decode($string, ENT_QUOTES, 'UTF-8');

The star, for example, is "decoded" to â˜

$string is being obtained by using

document.getElementById("id-of-div").innerText

I would like to decode them correctly. How do I do that in PHP?

NOTE: I have tried htmlspecialchars_decode($string, ENT_QUOTES); and it produces the same result.

  • 写回答

1条回答 默认 最新

  • dongqian0763 2014-01-05 21:55
    关注

    I've tried to reproduce your issue with this simple bit of PHP:

    <?php
      // Make sure our client knows we're sending UTF-8
      header('Content-Type: text/plain; charset=utf-8');
      $string = "The page contains characters like this star ★ or others like shapes like ⬛︎, ◄, ◉, etc. Here are some entities: This is a &quot;test&quot;.";
      echo 'String: ' . $string . "
    ";
      echo 'Decoded: ' . html_entity_decode($string, ENT_QUOTES, 'UTF-8');
    

    As expected, the output is:

    String: The page contains characters like this star ★ or others like shapes like ⬛︎, ◄, ◉, etc. Here are some entities: This is a &quot;test&quot;.
    Decoded: The page contains characters like this star ★ or others like shapes like ⬛︎, ◄, ◉, etc. Here are some entities: This is a "test".
    

    If I change the charset in the header to iso-8859-1, I see this:

    String: The page contains characters like this star ★ or others like shapes like ⬛︎, ◄, ◉, etc. Here are some entities: <span>This is a &quot;test&quot;.
    Decoded: The page contains characters like this star ★ or others like shapes like ⬛︎, ◄, ◉, etc. Here are some entities: <span>This is a "test".
    

    So, I'd say that your issue is a display issue. The "interesting" characters are being left completely untouched by html_entity_decode, as you'd expect. It's just that whatever code you've got, or whatever you're using to look at your output, is using incorrectly using iso-8859-1 to display them.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 微信小程序协议怎么写
  • ¥15 c语言怎么用printf(“\b \b”)与getch()实现黑框里写入与删除?
  • ¥20 怎么用dlib库的算法识别小麦病虫害
  • ¥15 华为ensp模拟器中S5700交换机在配置过程中老是反复重启
  • ¥15 java写代码遇到问题,求帮助
  • ¥15 uniapp uview http 如何实现统一的请求异常信息提示?
  • ¥15 有了解d3和topogram.js库的吗?有偿请教
  • ¥100 任意维数的K均值聚类
  • ¥15 stamps做sbas-insar,时序沉降图怎么画
  • ¥15 买了个传感器,根据商家发的代码和步骤使用但是代码报错了不会改,有没有人可以看看