doushang1778 2019-02-16 09:30
浏览 96

将unicode代码点转换为整个文件中的字符串

I am running a PHP web application which accepts the file from the user, append some data to it and provide user new files to download.

Occasionally I get files which contains invisible control characters like BOM, zero-width-no-break-space etc. in it (In plain text editor it does not show but when checked with 'less' command or in 'vi' editor, it shows <U+200F>, <U+FEFF>, <U+0083> etc) and that causes an issue with our processing. Currently, I have list of few such code points which I remove from the file using 'sed' before processing it (below is the command I use). Then I also use "iconv" to convert non-utf files to utf-8.

exec("sed -i 's/\xE2\x80\x8F|\xC2\x81|\xE2\x80\x8B|\xE2\x80\x8E|\xEF\xBB\xBF|\xC2\xAD|\xC2\x89|\xC2\x83|\xC2\x87|\xC2\x82//g' 'my_file_path'");

But the list of such character is increasing and when not handled properly, such characters are causing file encoding to be 'unknown-8bit' which is not proper and will show corrupted content. Now I need to for a solution which should be efficient and does not need me to the lookup code table.

How should I do this so it automatically handles every code point in the file and doesn't need to maintain a list of such code to replace. I am open for Perl /python/bash script solution also.

P.S. I need to support all languages (not just US ascii or extended ascii) and I also dont want any data loss.

  • 写回答

0条回答 默认 最新

    报告相同问题?

    悬赏问题

    • ¥15 执行 virtuoso 命令后,界面没有,cadence 启动不起来
    • ¥50 comfyui下连接animatediff节点生成视频质量非常差的原因
    • ¥20 有关区间dp的问题求解
    • ¥15 多电路系统共用电源的串扰问题
    • ¥15 slam rangenet++配置
    • ¥15 有没有研究水声通信方面的帮我改俩matlab代码
    • ¥15 ubuntu子系统密码忘记
    • ¥15 保护模式-系统加载-段寄存器
    • ¥15 电脑桌面设定一个区域禁止鼠标操作
    • ¥15 求NPF226060磁芯的详细资料