dssnh86244 2014-05-05 08:39 采纳率: 100%
浏览 20
已采纳

半个字? - 口音编码问题

I'm currently facing a very strange encoding issue when dealing with an html source code. I got the following line:

"requête présentée par..."

When an extern library does an utf8_decode I got:

"reque^te présente´e par..."

So accents are placed right to the accented characters. If I do an utf8_encode from that result, I don't get the original "requête présentée par..." but I keep having "reque^te présente´e par..."

Even stranger: If I open the original html in Notepad++, encoding is utf8 without BOM (so far, so good) but I can actually select half of the character with the text selection (keyboard or mouse). Yes, half of it. As if the real code was "e^" but it was displayed as "ê". When I try to copy it to my IDE it copies "ê" but pastes "e^".

I have come up with a basic replacement function:

"e^" => "ê", "e´" => "é", ...

and some other french cases, and it's working properly for now. But as the HTML comes in differents languages, I'm pretty sure I won't be able to successfully replace every character under this encoding issue.

Has anybody face this issue before and (hopefully) has a more general solution?

Thanks in advance.

  • 写回答

1条回答 默认 最新

  • doyp9057 2014-05-05 09:01
    关注

    It sounds like your HTML source is using Combining characters. That is, instead of using a single unicode character to represent the ê, it's using first a regular e and then a combining character to add the diacritic ^. You can verify this with a hex editor to see the character codes, in this case the combining circumflex is hex code 0302.

    See also Unicode equivalence.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 神经网络预测均方误差很小 但是图像上看着差别太大
  • ¥15 Oracle中如何从clob类型截取特定字符串后面的字符
  • ¥15 想通过pywinauto自动电机应用程序按钮,但是找不到应用程序按钮信息
  • ¥15 如何在炒股软件中,爬到我想看的日k线
  • ¥15 seatunnel 怎么配置Elasticsearch
  • ¥15 PSCAD安装问题 ERROR: Visual Studio 2013, 2015, 2017 or 2019 is not found in the system.
  • ¥15 (标签-MATLAB|关键词-多址)
  • ¥15 关于#MATLAB#的问题,如何解决?(相关搜索:信噪比,系统容量)
  • ¥500 52810做蓝牙接受端
  • ¥15 基于PLC的三轴机械手程序