dpzbzp8728 2011-03-10 03:31
浏览 52
已采纳

PHP - 如何检测编码?

I'm using Amazon's API to obtain the description of books. The API returns XML responses and the description is marked up (with HTML) very poorly. To deal with this poorly marked up description, which oftentimes breaks the layout of my site, I'm trying to use HTML Tidy to "clean it up."

In order to prevent "weird" characters from being displayed on my web page, I think I need to tell Tidy what the input encoding is and what the desired output encoding is. I know I want the output to be UTF8. However, I'm not sure how to determine the encoding of the input (Amazon's book description).

I've tried something like this:

mb_detect_encoding($amazon_description);

It's helped, but I'm still occasionally getting weird characters (a black diamond with a question mark in it: �). My guess is that I'm not detecting the encoding properly.

Any suggestions what I need to do?

EDIT:

This is my current solution:

$sanitized_amazon_markup = preg_replace('/[^\w`~!@#$%^&*()-=_+[\]{}|;\':",.\/<>? ]/', '', $sanitized_amazon_markup);

I'm not sure about this as this may delete stuff that I should be keeping.

  • 写回答

1条回答 默认 最新

  • doukekui0914 2011-03-10 08:49
    关注

    Can you provide your tidy repairString call?

    If you tried to use input-encoding and output-encoding from tidy options, try to not use these and use the third argument or repairString instead, something like this :

    $oTidy = new tidy();
    $page_content = $oTidy->repairString($page_content,
        array("show-errors" => 0, "show-warnings" => false),
        "utf8"
    );
    

    Edit :

    After doing some tests, what I said before cannot work if you don't have utf8 encoding in $page_content already before calling repairString

    But you will mostly end up with ISO-8859-1 (latin1) encoding if not UTF-8 already.

    May I suggest you try :

    $charset = mb_detect_encoding($amazon_description, 'UTF-8, ISO-8859-1');
    if ($charset == "ISO-8859-1") {
        $amazon_description = utf8_encode($amazon_description);
    }
    $oTidy = new tidy();
    $amazon_description = $oTidy->repairString($amazon_description,
        array("show-errors" => 0, "show-warnings" => false),
        "utf8"
    );
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 虚幻5 UE美术毛发渲染
  • ¥15 CVRP 图论 物流运输优化
  • ¥15 Tableau online 嵌入ppt失败
  • ¥100 支付宝网页转账系统不识别账号
  • ¥15 基于单片机的靶位控制系统
  • ¥15 真我手机蓝牙传输进度消息被关闭了,怎么打开?(关键词-消息通知)
  • ¥15 装 pytorch 的时候出了好多问题,遇到这种情况怎么处理?
  • ¥20 IOS游览器某宝手机网页版自动立即购买JavaScript脚本
  • ¥15 手机接入宽带网线,如何释放宽带全部速度
  • ¥30 关于#r语言#的问题:如何对R语言中mfgarch包中构建的garch-midas模型进行样本内长期波动率预测和样本外长期波动率预测