2018-01-25 16:09
浏览 69


I have a PHP script that uses CURL to fetch the title and description of a user-entered URL and displays them on the page (which includes a utf-8 charset meta tag), and I'm having problems with characters not displaying correctly.

I read in this answer that the PHP CURL function encodes strings to utf-8 and that I need to decode strings with utf8_decode. But I'm finding that using utf8_decode is a hit or miss proposition -- sometimes it helps, sometimes, it creates unknown characters where there were none in the string before it was decoded.

I've included some examples below.

What's the proper way to handle encoding in this case?


Here's the content fetched from a NY Times article with an emdash in the description. In this case, the decoded version displays the character properly:

enter image description here

Here's content from another NY Times article with an emdash in the description, and here, decoding made the character display improperly:

enter image description here

I'm finding that decoding causes problems with foreign language sites like this one in Spanish:

enter image description here

I know I can detect the language of the URL and decode or not based on that, but I'm finding plenty of English language sites where encoding causes problems, like this one:

enter image description here

图片转代码服务由CSDN问答提供 功能建议

我有一个PHP脚本,它使用CURL来获取用户输入的URL的标题和描述并显示在 页面(包括一个utf-8字符集元标记),我遇到了无法正确显示字符的问题。

我在这个答案,PHP CURL函数将字符串编码为utf-8,我需要用utf8_decode解码字符串。 但我发现使用utf8_decode是一个命中或错过命题 - 有时它会有所帮助,有时它会创建未知字符,在字符串被解码之前没有字符串。

I 下面是一些例子。



以下是从纽约时报的一篇文章,描述中有一个emdash。 在这种情况下,解码版本正确显示字符:

以下内容来自< a href =“”rel =“nofollow noreferrer”>另一篇纽约时报文章,其中包含一个emdash,在这里 ,解码使角色显示不正确:

我发现解码会导致问题 使用西班牙语这一个等外语网站:

我知道我可以检测到URL的语言并根据它进行解码,但我发现很多 编码导致问题的英语网站,例如这一个 \ n

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 邀请回答

2条回答 默认 最新

  • dtml3340 2018-01-26 19:12

    After doing a lot more experimenting I stumbled on this solution, which fixed everything.

    My script fetched the URL contents and loaded them into a DOM document like this:

    $html = file_get_contents_curl($link_url);
    $doc = new DOMDocument();

    Per the linked article, I changed it to this:

    $html = file_get_contents_curl($link_url);
    $doc = new DOMDocument();
    @$doc->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));

    I also eliminated the use of utf8_decode.

    And everything displayed properly.

    点赞 评论
  • doutangshuan6473 2018-01-25 22:44

    The server will enforce the page encoding and you have to decode according to that. You can get the page encoding in advance issuing a HEAD request. Look for charsetat Content-typeheader

    curl --head HTTP/1.1 200 OK Server: Apache Cache-Control: no-cache X-ESI: 1 X-App-Response-Time: 0.70 Content-Type: text/html; charset=utf-8 X-PageType: homepage ... ...
    Vary: Accept-Encoding, Fastly-SSL

    点赞 评论

相关推荐 更多相似问题