dqyitt2954 2014-01-05 21:26
浏览 127
已采纳

PHP - html_entity_decode没有解码所有内容

I am parsing an HTML page. At some point I am getting the text between a div and using html_entity_decode to print that text.

The problem is that the page contains characters like this star or others like shapes like ⬛︎, ◄, ◉, etc. I have checked and these characters are not encoded on the source page, they are like you see them normally.

The page is using charset="UTF-8"

So, when I use

html_entity_decode($string, ENT_QUOTES, 'UTF-8');

The star, for example, is "decoded" to â˜

$string is being obtained by using

document.getElementById("id-of-div").innerText

I would like to decode them correctly. How do I do that in PHP?

NOTE: I have tried htmlspecialchars_decode($string, ENT_QUOTES); and it produces the same result.

  • 写回答

1条回答 默认 最新

  • dongqian0763 2014-01-05 21:55
    关注

    I've tried to reproduce your issue with this simple bit of PHP:

    <?php
      // Make sure our client knows we're sending UTF-8
      header('Content-Type: text/plain; charset=utf-8');
      $string = "The page contains characters like this star ★ or others like shapes like ⬛︎, ◄, ◉, etc. Here are some entities: This is a &quot;test&quot;.";
      echo 'String: ' . $string . "
    ";
      echo 'Decoded: ' . html_entity_decode($string, ENT_QUOTES, 'UTF-8');
    

    As expected, the output is:

    String: The page contains characters like this star ★ or others like shapes like ⬛︎, ◄, ◉, etc. Here are some entities: This is a &quot;test&quot;.
    Decoded: The page contains characters like this star ★ or others like shapes like ⬛︎, ◄, ◉, etc. Here are some entities: This is a "test".
    

    If I change the charset in the header to iso-8859-1, I see this:

    String: The page contains characters like this star ★ or others like shapes like ⬛︎, ◄, ◉, etc. Here are some entities: <span>This is a &quot;test&quot;.
    Decoded: The page contains characters like this star ★ or others like shapes like ⬛︎, ◄, ◉, etc. Here are some entities: <span>This is a "test".
    

    So, I'd say that your issue is a display issue. The "interesting" characters are being left completely untouched by html_entity_decode, as you'd expect. It's just that whatever code you've got, or whatever you're using to look at your output, is using incorrectly using iso-8859-1 to display them.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 目详情-五一模拟赛详情页
  • ¥15 有了解d3和topogram.js库的吗?有偿请教
  • ¥100 任意维数的K均值聚类
  • ¥15 stamps做sbas-insar,时序沉降图怎么画
  • ¥15 买了个传感器,根据商家发的代码和步骤使用但是代码报错了不会改,有没有人可以看看
  • ¥15 关于#Java#的问题,如何解决?
  • ¥15 加热介质是液体,换热器壳侧导热系数和总的导热系数怎么算
  • ¥100 嵌入式系统基于PIC16F882和热敏电阻的数字温度计
  • ¥15 cmd cl 0x000007b
  • ¥20 BAPI_PR_CHANGE how to add account assignment information for service line