将unicode代码点转换为整个文件中的字符串

I am running a PHP web application which accepts the file from the user, append some data to it and provide user new files to download.

Occasionally I get files which contains invisible control characters like BOM, zero-width-no-break-space etc. in it (In plain text editor it does not show but when checked with 'less' command or in 'vi' editor, it shows <U+200F>, <U+FEFF>, <U+0083> etc) and that causes an issue with our processing. Currently, I have list of few such code points which I remove from the file using 'sed' before processing it (below is the command I use). Then I also use "iconv" to convert non-utf files to utf-8.

exec("sed -i 's/\xE2\x80\x8F|\xC2\x81|\xE2\x80\x8B|\xE2\x80\x8E|\xEF\xBB\xBF|\xC2\xAD|\xC2\x89|\xC2\x83|\xC2\x87|\xC2\x82//g' 'my_file_path'");

But the list of such character is increasing and when not handled properly, such characters are causing file encoding to be 'unknown-8bit' which is not proper and will show corrupted content. Now I need to for a solution which should be efficient and does not need me to the lookup code table.

How should I do this so it automatically handles every code point in the file and doesn't need to maintain a list of such code to replace. I am open for Perl /python/bash script solution also.

P.S. I need to support all languages (not just US ascii or extended ascii) and I also dont want any data loss.

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

报告相同问题？

关注问题

tomcat启动失败db2数据库无法将 Unicode 字符串转换为 Ebcdic 字符串 eclipse tomcat
2017-09-05 03:36

回答 6 已采纳原因有好几个，，最主要的还是看看计算机，的名称第二个，如果是ｗｉｎ10的话，看看，你的用户名是不是中文的ｃｍｄ里面看
tomcat启动时，无法创建xxx, 无法将 Unicode 字符串转换为 Ebcdic tomcat
2017-09-02 04:27

回答 3 已采纳检查设定的MyEclipse的字符集，查查环境变量，检查是否是中文名引起的？
在Go中将Unicode代码点转换为文字字符
2015-12-07 04:59

回答 2 已采纳 You can use the strconv.Unquote() and strconv.UnquoteChar() functions to do the conversion. One t
PHP中字符与字节的区别及字符串与字节转换示例
2021-01-21 15:41

例如，Unicode UTF-16 编码将字符表示为 16 位整数序列，而 Unicode UTF-8 编码则将相同的字符表示为 8 位字节序列。公共语言运行库使用 Unicode UTF-16（Unicode 转换格式，16 位编码形式）表示字符。 php在UTF-8...
Python中的某些Unicode为何无法转换为字符串？ python
2022-12-17 13:32

回答 1 已采纳不是所有的Unicode编码都可以转换为字符。Unicode字符集中有一些编码是用来表示非字符的，这些编码无法转换为字符。因为编码为0x10fff的字符是一个未分配的代码点，并不表示任何实际的字符。
tomcat启动失败，报无法创建xxx无法将 Unicode 字符串转换为Ebcdic eclipse tomcat 数据库
2017-09-05 01:43

回答 10 已采纳原因有好几个，，最主要的还是看看计算机，的名称第二个，如果是ｗｉｎ10的话，看看，你的用户名是不是中文的ｃｍｄ里面看
此文件中的某些Unicode字符未能保存在当前代码页中 c语言有问必答
2022-02-27 20:55

回答 4 已采纳编码方式的问题，你的代码是直接复制的吗？参考如下方案解决：此文件中的某些Unicode字符未能保存在当前代码页中，是否以Unicode编码重新保存此文件..._ruog
php将数组转换为JSON中文字符串（兼容中文）
2020-07-02 14:26

江枫渔火L的博客使用json_encode将PHP数组转为json格式时编码问题，以下函数将其转为中文： function json_encode_cn($array, $force_object=false) { if ($force_object) { return unicodeDecode(json_encode($array, JSON_FORCE...
在Go中将带有UTF-8字节字符串的命令行输出转换为Unicode代码点
2019-04-10 18:21

回答 1 已采纳 You can use the strconv package to parse the string literal containing the escape sequences. The
如何将字符串从unicode转换为html实体 html
2019-05-06 07:53

回答 1 已采纳 That character is not special in HTML, so you can include it as-is in the output, just be sure to
如何将int32 Unicode转换为字符串
2019-02-09 14:04

回答 2 已采纳 A unicode code point in Go is a rune. Go type rune is an alias for Go type int32. The Go Pro
PHP正确解析UTF-8字符串技巧应用
2021-01-20 00:53

在《学习PHP&MYSQL之——字符编码篇（一）》中介绍了Unicode与UTF-8的转换关系，总结了一个UTF-8的编码规则，根据这个编码规则，写一个UTF-8编码的解析程序，以下是PHP的实现：复制代码代码如下:<?php /* 程序...
如何在Go中将unicode字节数组转换为普通字符串
2014-10-13 01:50

回答 1 已采纳 This looks like a JSON string with \u escapes per the JSON specification. The JSON decoder will ta
PHP如何实现Unicode和Utf-8编码相互转换
2020-10-23 20:44

本文介绍了通过PHP实现一个函数可以对字符串进行Unicode的编码和解码，需要的朋友可以参考下
PHP把unicode编码的json字符串转中文
2019-08-14 11:17

渡目成书的博客 json中中文被编码 $s = '[{"param_name":"email","param_caption":"\u90ae\u7bb1","operator":"\u5305\u542b","value":"aaaa\u5927\u592b\u6492"}]';... * 把unicode编码的字符串转为人眼可看的字符串 * @param $u...
没有解决我的问题, 去提问

悬赏问题

¥15 执行 virtuoso 命令后，界面没有，cadence 启动不起来
¥50 comfyui下连接animatediff节点生成视频质量非常差的原因
¥20 有关区间dp的问题求解
¥15 多电路系统共用电源的串扰问题
¥15 slam rangenet++配置
¥15 有没有研究水声通信方面的帮我改俩matlab代码
¥15 ubuntu子系统密码忘记
¥15 保护模式-系统加载-段寄存器
¥15 电脑桌面设定一个区域禁止鼠标操作
¥15 求NPF226060磁芯的详细资料

码龄粉丝数原力等级 --

将unicode代码点转换为整个文件中的字符串

0条回答默认最新

悬赏问题

将unicode代码点转换为整个文件中的字符串

0条回答 默认 最新

悬赏问题

0条回答默认最新