douliao5467 2011-01-20 01:56
浏览 32
已采纳

考虑标点符号从文本中提取单词

considering i have an offset marking the start of the word.. i need a method to get the size of that word considering all the sign of punctuation.

example:

$str = "my text bla bla-bla; hello! abc";
$offset = "22";  // start of hello

now i need a function that returns 5 considering hello is 5 chars.

this are some of punctuations may occur:

array(',','.',' ','-',"'",'"',';',':','?','!','|','/','\\','<','>')

i can do some hard parsing but i would like to write something more elegant

  • 写回答

2条回答 默认 最新

  • dqh1992 2011-01-20 03:00
    关注

    This should help you:

    function getWordSize($string, $offset = 0)
    {
        $word = array();
    
        if (preg_match('~.{' . max(0, intval($offset)) . '}(\p{L}+)~u', $string, $word) > 0)
        {
            if (array_key_exists(1, $word) === true)
            {
                return strlen($word[1]); // bytes, or
                return strlen(utf8_decode($word[1])); // unicode chars
            }
        }
    
        return 0;
    }
    

    Usage:

    echo getWordSize('my text bla bla-bla; hello! abc', 21); // 5
    

    However this doesn't handle offsets that cut words in middle, so:

    echo getWordSize('my text bla bla-bla; hello! abc', 23); // 3
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥50 导入文件到网吧的电脑并且在重启之后不会被恢复
  • ¥15 (希望可以解决问题)ma和mb文件无法正常打开,打开后是空白,但是有正常内存占用,但可以在打开Maya应用程序后打开场景ma和mb格式。
  • ¥15 绘制多分类任务的roc曲线时只画出了一类的roc,其它的auc显示为nan
  • ¥20 ML307A在使用AT命令连接EMQX平台的MQTT时被拒绝
  • ¥20 腾讯企业邮箱邮件可以恢复么
  • ¥15 有人知道怎么将自己的迁移策略布到edgecloudsim上使用吗?
  • ¥15 错误 LNK2001 无法解析的外部符号
  • ¥50 安装pyaudiokits失败
  • ¥15 计组这些题应该咋做呀
  • ¥60 更换迈创SOL6M4AE卡的时候,驱动要重新装才能使用,怎么解决?