douxiong3245 2017-04-22 16:25
浏览 106
已采纳

php分裂字符串在块ngrams unicode char问题

I try to generate n grams from string in PHP For that I use this function from : https://gist.github.com/Xeoncross/5366393

function Bigrams($word){
    $ngrams = array();
    $len = strlen($word);
    for($i=0;$i+1<$len;$i++){
        $ngrams[$i]=$word[$i].$word[$i+1];
    }
    return $ngrams;
}

$word = "abcdefg";

print_r(Bigrams($word));

That OK return as expected ngrams :

[0] => ab
[1] => bc
[2] => cd
[3] => de
[4] => ef
[5] => fg

But for certain Unicode characters not return as expected:

Ex: for $word = "Lòria" return:

[0] => L�
[1] => ò
[2] => �r
[3] => ri

Or for $word = "пожалуйста" return:

[0] => п
[1] => ��
[2] => о
[3] => ��
[4] => ж
[5] => ��
[6] => а
[7] => ��
[8] => л

Any idea how to solve this?

  • 写回答

1条回答 默认 最新

  • douanrang4728 2017-04-22 16:41
    关注

    use unicode oriented string functions

    function Bigrams($word){
        $ngrams = array();
        $len = mb_strlen($word);
        for($i=0;$i+1<$len;$i++){
            $ngrams[$i]=mb_substr($word, $i, 2);
        }
        return $ngrams;
    }
    
    $word = "пожалуйста";
    
    print_r(Bigrams($word));
    

    result

    Array
    (
        [0] => по
        [1] => ож
        [2] => жа
        [3] => ал
        [4] => лу
        [5] => уй
        [6] => йс
        [7] => ст
        [8] => та
    )
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 用ns3仿真出5G核心网网元
  • ¥15 matlab答疑 关于海上风电的爬坡事件检测
  • ¥88 python部署量化回测异常问题
  • ¥30 酬劳2w元求合作写文章
  • ¥15 在现有系统基础上增加功能
  • ¥15 远程桌面文档内容复制粘贴,格式会变化
  • ¥15 关于#java#的问题:找一份能快速看完mooc视频的代码
  • ¥15 这种微信登录授权 谁可以做啊
  • ¥15 请问我该如何添加自己的数据去运行蚁群算法代码
  • ¥20 用HslCommunication 连接欧姆龙 plc有时会连接失败。报异常为“未知错误”