如何可靠地猜测 MacRoman，CP1252，Latin1，UTF-8和 ASCII 之间的编码

At work it seems like no week ever passes without some encoding-related conniption, calamity, or catastrophe. The problem usually derives from programmers who think they can reliably process a “text” file without specifying the encoding. But you can't.

So it's been decided to henceforth forbid files from ever having names that end in *.txt or *.text. The thinking is that those extensions mislead the casual programmer into a dull complacency regarding encodings, and this leads to improper handling. It would almost be better to have no extension at all, because at least then you know that you don’t know what you’ve got.

However, we aren’t goint to go that far. Instead you will be expected to use a filename that ends in the encoding. So for text files, for example, these would be something like README.ascii, README.latin1, README.utf8, etc.

For files that demand a particular extension, if one can specify the encoding inside the file itself, such as in Perl or Python, then you shall do that. For files like Java source where no such facility exists internal to the file, you will put the encoding before the extension, such as SomeClass-utf8.java.

For output, UTF-8 is to be strongly preferred.

But for input, we need to figure out how to deal with the thousands of files in our codebase named *.txt. We want to rename all of them to fit into our new standard. But we can’t possibly eyeball them all. So we need a library or program that actually works.

These are variously in ASCII, ISO-8859-1, UTF-8, Microsoft CP1252, or Apple MacRoman. Although we're know we can tell if something is ASCII, and we stand a good change of knowing if something is probably UTF-8, we’re stumped about the 8-bit encodings. Because we’re running in a mixed Unix environment (Solaris, Linux, Darwin) with most desktops being Macs, we have quite a few annoying MacRoman files. And these especially are a problem.

For some time now I’ve been looking for a way to programmatically determine which of

ASCII
ISO-8859-1
CP1252
MacRoman
UTF-8

a file is in, and I haven’t found a program or library that can reliably distinguish between those the three different 8-bit encodings. We probably have over a thousand MacRoman files alone, so whatever charset detector we use has to be able to sniff those out. Nothing I’ve looked at can manage the trick. I had big hopes for the ICU charset detector library, but it cannot handle MacRoman. I’ve also looked at modules to do the same sort of thing in both Perl and Python, but again and again it’s always the same story: no support for detecting MacRoman.

What I am therefore looking for is an existing library or program that reliably determines which of those five encodings a file is in—and preferably more than that. In particular it has to distinguish between the three 3-bit encoding I’ve cited, especially MacRoman. The files are more than 99% English language text; there are a few in other languages, but not many.

If it’s library code, our language preference is for it to be in Perl, C, Java, or Python, and in that order. If it’s just a program, then we don’t really care what language it’s in so long as it comes in full source, runs on Unix, and is fully unencumbered.

Has anyone else had this problem of a zillion legacy text files randomly encoded? If so, how did you attempt to solve it, and how successful were you? This is the most important aspect of my question, but I’m also interested in whether you think encouraging programmers to name (or rename) their files with the actual encoding those files are in will help us avoid the problem in the future. Has anyone ever tried to enforce this on an institutional basis, and if so, was that successful or not, and why?

And yes, I fully understand why one cannot guarantee a definite answer given the nature of the problem. This is especially the case with small files, where you don’t have enough data to go on. Fortunately, our files are seldom small. Apart from the random README file, most are in the size range of 50k to 250k, and many are larger. Anything more than a few K in size is guaranteed to be in English.

The problem domain is biomedical text mining, so we sometimes deal with extensive and extremely large corpora, like all of PubMedCentral’s Open Access respository. A rather huge file is the BioThesaurus 6.0, at 5.7 gigabytes. This file is especially annoying because it is almost all UTF-8. However, some numbskull went and stuck a few lines in it that are in some 8-bit encoding—Microsoft CP1252, I believe. It takes quite a while before you trip on that one. :(

转载于:https://stackoverflow.com/questions/4198804/how-to-reliably-guess-the-encoding-between-macroman-cp1252-latin1-utf-8-and

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

7条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
csdnceshi62 2010-11-17 01:38
关注
First, the easy cases:

ASCII

If your data contains no bytes above 0x7F, then it's ASCII. (Or a 7-bit ISO646 encoding, but those are very obsolete.)

UTF-8

If your data validates as UTF-8, then you can safely assume it is UTF-8. Due to UTF-8's strict validation rules, false positives are extremely rare.

ISO-8859-1 vs. windows-1252

The only difference between these two encodings is that ISO-8859-1 has the C1 control characters where windows-1252 has the printable characters €‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ. I've seen plenty of files that use curly quotes or dashes, but none that use C1 control characters. So don't even bother with them, or ISO-8859-1, just detect windows-1252 instead.

That now leaves you with only one question.

How do you distinguish MacRoman from cp1252?

This is a lot trickier.

Undefined characters

The bytes 0x81, 0x8D, 0x8F, 0x90, 0x9D are not used in windows-1252. If they occur, then assume the data is MacRoman.

Identical characters

The bytes 0xA2 (¢), 0xA3 (£), 0xA9 (©), 0xB1 (±), 0xB5 (µ) happen to be the same in both encodings. If these are the only non-ASCII bytes, then it doesn't matter whether you choose MacRoman or cp1252.

Statistical approach

Count character (NOT byte!) frequencies in the data you know to be UTF-8. Determine the most frequent characters. Then use this data to determine whether the cp1252 or MacRoman characters are more common.

For example, in a search I just performed on 100 random English Wikipedia articles, the most common non-ASCII characters are ·•–é°®’èö—. Based on this fact,

The bytes 0x92, 0x95, 0x96, 0x97, 0xAE, 0xB0, 0xB7, 0xE8, 0xE9, or 0xF6 suggest windows-1252.

The bytes 0x8E, 0x8F, 0x9A, 0xA1, 0xA5, 0xA8, 0xD0, 0xD1, 0xD5, or 0xE1 suggest MacRoman.

Count up the cp1252-suggesting bytes and the MacRoman-suggesting bytes, and go with whichever is greatest.
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(6条)

报告相同问题？

关注问题

如何可靠地猜测 MacRoman，CP1252，Latin1，UTF-8和 ASCII 之间的编码 java macos perl python
2010-11-16 20:50

回答 7 已采纳 First, the easy cases: ASCII If your data contains no bytes above 0x7F, then it's ASCII. (Or a
如何在读取CSV文件时修复编码？ php
2013-09-11 12:27

回答 2 已采纳 You can probably use iconv for the conversion. On my installation, the MacRoman encoding is called
从Excel处理CSV文件到MySQL生成“不正确的字符串值”错误 mysql php
2013-12-20 00:15

回答 2 已采纳 I'm going to go ahead and answer my own question with the solution I ended up with. As you'll rea
java cp1252,如何可靠地猜测MacRoman，CP1252，Latin1，UTF-8和ASCII之间的编码
2021-04-22 00:11

thatyoung的博客这个问题通常来自程序员，他们认为他们可以在不指定编码的情况下可靠地处理“文本”文件 . 但你不能 .因此，已经决定从此以后禁止文件的名称以 *.txt 或 *.text 结尾 . 我们的想法是，这些扩展误导了偶然程序员对...
mysql欧洲国家的文字乱码？英语、中文没有乱码是什么原因呢？ mysql
2014-11-04 15:27

回答 2 已采纳还有很多地方涉及编码：比如Java或者JSP代码文件本身的编码。 AP服务器设定的GET或者POST传输的编码。 JDBC驱动连接时候指定的编码。如果内容是从外部文件读进来的，外部文件的编
java cp1252,java – 如何可靠地猜测MacRoman,CP1252,Latin1,UTF-8和ASCII之间的编码
2021-04-22 00:10

激光不是红外线的博客问题通常来自程序员，他们认为他们可以可靠地处理“文本”文件，而不指定编码。但你不能。因此，决定从今以后禁止文件从名称以* .txt或* .text结束。想法是，这些扩展误导了偶然程序员对编码的沉默自满，这导致不...
《Modern Python Cookbook》（Python经典实例）笔记 1.11　编码字符串——创建ASCII和UTF-8字节
2020-12-02 10:59

mighty13的博客 Unicode字符通常被编码为字节序列。...Linux操作系统环境变量的设置命令如下：export PYTHONIOENCODING=UTF-8 Windows操作系统环境变量的设置方法：我的电脑\计算机 →高级系统配置 → 环境变量→高级
mysql utf-8转utf8mb4_mysql中utf8 ,utf8mb4区别转化方法
2021-01-19 06:49

董新帅的博客 mysql中的utf8mysql中的“utf8”最大只支持3 个bytes,而真正的utf8编码(大家都使用的标准)，最大支持4个bytes。正是由于mysql的utf8少一个byte，导致中文的一些特殊字符和emoji都无法正常的显示。mysql真正的utf8...
常用编码：Shift_JIS, GBK,EUCKR,Big5,UTF8,CP1252
2018-05-17 11:28

子曰小玖的博客 https://blog.csdn.net/hellofeiya/article/details/8441812(1) Shift_JISShift_JIS是一个日本电脑系统常用的编码表。它能容纳全角及半角拉丁字母、平假名、片假名、符号及日语汉字。它被命名为Shift_JIS的原因，是...
[工具]MAC下如何把GBK编码的文档转成UTF-8编码的？
2018-12-26 23:28

weixin_30832405的博客 1. GBK编码的文档转成UTF-8编码简洁命令：iconv -f gbk -t utf-8 index.html > index2.html 其中-f指的是原始文件编码，-t是输出编码 index.html 是原始文件 index2.html是输出结果文件其他格式同理 2. 用...
mysql设置utf8中文乱码_【trouble-shooting】MySQL中文乱码，如何设置utf8
2021-02-05 06:21

董晶晖的博客 ci | 1 | | koi8r | KOI8-R Relcom Russian | koi8r_general_ci | 1 | | latin1 | cp1252 West European | latin1_swedish_ci | 1 | | latin2 | ISO 8859-2 Central European | latin2_general_ci | 1 | | swe7 | 7...
java转Cp500编码例子_编码转换
2021-03-10 06:40

weixin_39701834的博客公开常量文本GB2312 = “GB2312”公开常量文本EUC_CN = “EUC-CN”公开常量文本HZ = “HZ”公开常量文本GBK = “GBK”公开常量文本GB...“BIG5”公开常量文本CP950 = “CP950”公开常量文本BIG5_HKSCS...
查看mysql utf8_查看mysql数据库字符集，设置mysql字符集为utf8
2021-02-07 08:57

程归子的博客 2011-03-04 11:20:59|分类：Database|标签：|字号一、查看 MySQL 数据库服务器和数据库字符集。mysql> show variables like '%char%';+--------------------------+-------------------------------------+------...
ibm java 编码_关于Java中编码的问题
2021-02-26 13:27

朕说的博客在Java中使用了Unicode编码，在程序中显式的...“ISO-8859-1”到底是什么样的编码？|SupportedEncodingsTheclassesjava.io.InputStreamReader,java.io.OutputStreamWriter,andjava.lang.StringcanconvertbetweenU...
MySQL-字符集和比较规则
2023-02-11 19:42

冲上云霄的Jayden的博客 romanian_ci utf8mb3 195 Yes 8 PAD SPACE utf8mb3_roman_ci utf8mb3 207 Yes 8 PAD SPACE utf8mb3_sinhala_ci utf8mb3 211 Yes 8 PAD SPACE utf8mb3_slovak_ci utf8mb3 205 Yes 8 PAD SPACE utf8mb3_slovenian_ci ...
mysql ascii编码_mysql字符集编码
2021-01-27 11:56

李楽的博客 mysql字符集编码字符集和整理整理描述armscii8 (ARMSCII-8 Armenian)armscii8_bin 亚美尼亚语, 二进制armscii8_general_ci 亚美尼亚语, 不区分大小写ascii (US ASCII)ascii_bin 西欧 (多语言), 二进制ascii_general...
php iconv lanti1,字符编码转换iconv
2021-04-17 03:24

金融八卦女的博客它的作用是在多种国际编码格式之间进行文本内码的转换。iconv基于GPL公开源代码，是GNU项目的一部分。官网地址附件是Windows下iconv.exe工具，使用方法如下：C:\iconv>iconv.exe -helpUsage: iconv [-c] [-s] [-f...
关于MYSQL在UTF-8字符集下乱码的解决办法
2014-03-30 22:45

leehwi的博客经常会有人遇到这样的问题，MYSQL数据库在UTF-8字符集下是乱码的，去网上搜了以下找到的答案都不理想，下面中国信息港就与大家分享下关于MYSQL在UTF-8字符集下乱码的解决办法的问题！由于能直接在MYSQL数据库里...
MySQL5.7设置MySQL/MariaDB 数据库默认编码为utf8mb4
2021-03-22 14:08

悠悠-我心的博客 MySQL/MariaDB中的UTF-8并不是真正的UTF-8，其中的UTF8MB4才是真正的UTF-8。因此推荐使用UTF8MB4。先查看自己数据库的默认字符集： MariaDB [(none)]> show variables like "%character%";show variables ...
file 命令查看文件编码以及使用 iconv 进行编码转换
2018-08-23 10:15

陈道长的博客在做XML解析的时候有的文件解析错误，查找原因，发现时文件编码的问题。 file -b * 使用上面的命令查看所有文件的编码格式 XML 1.0 document text, ISO-8859 text, with CRLF line terminators XML 1.0 ...
没有解决我的问题, 去提问

悬赏问题

¥15 如何实验stm32主通道和互补通道独立输出
¥30 这是哪个作者做的宝宝起名网站
¥60 版本过低apk如何修改可以兼容新的安卓系统
¥25 由IPR导致的DRIVER_POWER_STATE_FAILURE蓝屏
¥50 有数据，怎么建立模型求影响全要素生产率的因素
¥50 有数据，怎么用matlab求全要素生产率
¥15 TI的insta-spin例程
¥15 完成下列问题完成下列问题
¥15 C#算法问题, 不知道怎么处理这个数据的转换
¥15 YoloV5 第三方库的版本对照问题

如何可靠地猜测 MacRoman，CP1252，Latin1，UTF-8和 ASCII 之间的编码

7条回答 默认 最新

ASCII

UTF-8

ISO-8859-1 vs. windows-1252

How do you distinguish MacRoman from cp1252?

Undefined characters

Identical characters

Statistical approach

悬赏问题

7条回答默认最新