duanfu2562 2014-01-18 13:39
浏览 32
已采纳

如何计算字符串中Unicode字符的出现次数?

how do you count the occurrences of a Unicode character in a string with PHP?
maybe this is a simple questions but I am a biginner in PHP. I want to count how many Unicode characters U+06cc are in a string.

Character 'yeh' in farsi corresponds to 2 code points.
ی = u+06cc
ي = u+064a
that u+064a is a substitute in Farsi.
The popular character Arabic charset CP-1256 has no character mapped into U+06cc.
now I want to count how many Unicode characters U+06cc are in a string to detect that string is arabic or farsi.
when I use $count = substr_count($str, "ى");
or when I use
$count = substr_count($str, "\xDB\x8c");
it counts both "ی" and "ي" ,
any idea ?

  • 写回答

2条回答 默认 最新

  • duangonglian6028 2014-01-18 13:51
    关注

    I suppose you have a UTF-8 string, since UTF-8 is the most reasonable Unicode encoding.

    $count = substr_count($str, "\xDB\x8C");
    

    is what you want. You simply treat the string as a sequence of bytes. In UTF-8 the first byte of a multibyte character and its continuation bytes can never be mixed up (the first byte is always 11...... binary, while continuation bytes are always 10......). This ensures you cannot find something different from what your are looking for.

    To find the UTF-8 encoding of U+06CC I used the fileformat.info website, which I think is the best for this purpose.

    If you use UTF-8 in your IDE too, you can simply write "ى" instead of "\xDB\x8C" (internally they are exactly the same string in PHP), but that will make the readability of what you have written dependent on the IDE (often not good if you need to share your code).


    Now that you have clarified your question, my above answer is no more appropriate. I leave it there just as a reference for other passers-by.

    Your problem could stem from the fact that, reading here it seems that "ي" can lose its dots below if modified by the Unicode character U+0654 (the non-spacing mark "Arabic hamsa above"). Since my browser does not remove the dots, and adds the hamsa, I don't know whether the hamsa is supposed to disappear too when the dots disappear. Anyway, it COULD be that "\xDB\x8C" has the same appearance as "\xD9\x8A\xD9\x94". I have not been able to find the reverse, i.e., the double dot below as a non-spacing modification character, which would explain why substr_count($str, "\xDB\x8c") finds the Arabic yeh too - but maybe it exists.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 如何让企业微信机器人实现消息汇总整合
  • ¥50 关于#ui#的问题:做yolov8的ui界面出现的问题
  • ¥15 如何用Python爬取各高校教师公开的教育和工作经历
  • ¥15 TLE9879QXA40 电机驱动
  • ¥20 对于工程问题的非线性数学模型进行线性化
  • ¥15 Mirare PLUS 进行密钥认证?(详解)
  • ¥15 物体双站RCS和其组成阵列后的双站RCS关系验证
  • ¥20 想用ollama做一个自己的AI数据库
  • ¥15 关于qualoth编辑及缝合服装领子的问题解决方案探寻
  • ¥15 请问怎么才能复现这样的图呀