dougong5817 2010-11-19 10:32
浏览 52
已采纳

PHP:使用过滤器删除XML中的无效utf-8字符

I have a large file, so I have created a filter for removing invalid utf-8 characters from XML.

class ValidUTF8XMLFilter extends php_user_filter {

    protected static $pattern = '/([\x09\x0A\x0D\x20-\x7E]|[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|\xED[\x80-\x9F][\x80-\xBF]|\xF0[\x90-\xBF][\x80-\xBF]{2}|[\xF1-\xF3][\x80-\xBF]{3}|\xF4[\x80-\x8F][\x80-\xBF]{2})|./x';

    function filter($in, $out, &$consumed, $closing)
    {
        while ($bucket = stream_bucket_make_writeable($in)) {
            $bucket->data = preg_replace(self::$pattern, '$1', $bucket->data);
            $consumed += $bucket->datalen;
            stream_bucket_append($out, $bucket);
        }
        return PSFS_PASS_ON;
    }
}

This filter will remove also utf-8 characters not only invalid in xml, but also in utf-8. The regex is taken from Multilingual form encoding. The class was taken from this answer: How to skip invalid characters in XML file using PHP and rewritten. The pattern in that answer won't work for invalid utf-8 characters, eg. 0x1D.

Will this filter work, in situation, where invalid bytes starts at the end of buffer and ends in beginning of next filtering? Is this situation possible?

  • 写回答

1条回答 默认 最新

  • doujia7517 2010-11-19 11:21
    关注

    No, I don't think it will work. It will strip valid sequences of code units that happen to be split between several buckets.

    It should not consume potentially incomplete sequences in the end (and, if necessary, it should pass nothing and return PSFS_FEED_ME).

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 想通过pywinauto自动电机应用程序按钮,但是找不到应用程序按钮信息
  • ¥15 MATLAB中streamslice问题
  • ¥15 如何在炒股软件中,爬到我想看的日k线
  • ¥15 51单片机中C语言怎么做到下面类似的功能的函数(相关搜索:c语言)
  • ¥15 seatunnel 怎么配置Elasticsearch
  • ¥15 PSCAD安装问题 ERROR: Visual Studio 2013, 2015, 2017 or 2019 is not found in the system.
  • ¥15 (标签-MATLAB|关键词-多址)
  • ¥15 关于#MATLAB#的问题,如何解决?(相关搜索:信噪比,系统容量)
  • ¥500 52810做蓝牙接受端
  • ¥15 基于PLC的三轴机械手程序