douqing5981 2015-12-08 08:26
浏览 631
已采纳

将CESU-8转换为高性能的UTF-8

I have some raw text that is usually a valid UTF-8 string. However, every now and then it turns out that the input is in fact a CESU-8 string, instead. It is possible to technically detect this and convert to UTF-8 but as this happens rarely, I would rather not spend lots of CPU time to do this.

Is there any fast method to detect if a string is encoded with CESU-8 or UTF-8? I guess I could always blindly convert "UTF-8" to UTF-16LE and then to UTF-8 using iconv() and I would probably get the correct result every time because CESU-8 is close enough to UTF-8 for this to work. Can you suggest anything faster? (I'm expecting the input string to be CESU-8 instead of valid UTF-8 around 0.01-0.1% of all string occurrences.)

(CESU-8 is a non-standard string format which contains 16-bit surrogate pairs encoded in UTF-8. Technically UTF-8 strings should contain the characters represented by those surrogate pairs, not the surrogate pairs itself.)

  • 写回答

3条回答 默认 最新

  • doujia7094 2015-12-13 15:37
    关注

    Here's a more efficient version of your conversion function:

    $regex = '@(\xED[\xA0-\xAF][\x80-\xBF]\xED[\xB0-\xBF][\x80-\xBF])@';
    $s = preg_replace_callback($regex, function($m) {
        $in = unpack("C*", $m[0]);
        $in[2] += 1; // Effectively adds 0x10000 to the codepoint.
        return pack("C*",
            0xF0 | (($in[2] & 0x1C) >> 2),
            0x80 | (($in[2] & 0x03) << 4) | (($in[3] & 0x3C) >> 2),
            0x80 | (($in[3] & 0x03) << 4) | ($in[5] & 0x0F),
            $in[6]
        );
    }, $s);
    

    The code only converts high surrogates followed by low surrogates, and converts the two three-byte CESU-8 sequences directly into a four-byte UTF-8 sequence, i.e. from

    ED       A0-AF    80-BF    ED       B0-BF    80-BF
    11101101 1010aaaa 10bbbbbb 11101101 1011cccc 10dddddd
    

    to

    F0-F4    80-BF    80-BF    80-BF
    11110oaa 10aabbbb 10bbcccc 10dddddd    // o is "overflow" bit
    

    Here's an online example.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(2条)

报告相同问题?

悬赏问题

  • ¥15 HFSS 中的 H 场图与 MATLAB 中绘制的 B1 场 部分对应不上
  • ¥15 如何在scanpy上做差异基因和通路富集?
  • ¥20 关于#硬件工程#的问题,请各位专家解答!
  • ¥15 关于#matlab#的问题:期望的系统闭环传递函数为G(s)=wn^2/s^2+2¢wn+wn^2阻尼系数¢=0.707,使系统具有较小的超调量
  • ¥15 FLUENT如何实现在堆积颗粒的上表面加载高斯热源
  • ¥30 截图中的mathematics程序转换成matlab
  • ¥15 动力学代码报错,维度不匹配
  • ¥15 Power query添加列问题
  • ¥50 Kubernetes&Fission&Eleasticsearch
  • ¥15 報錯:Person is not mapped,如何解決?