douting0585 2015-06-09 19:12
浏览 37

PHP将字符串拆分为已知标记,剩余字添加到单字符数组[关闭]

Looking for a smart, very light and creative way to convert a title string into tokenized object but take into consideration non-splittable known two-worded predefined dictionary words.

I.e.: dictionary contains over 300 words / wordsets such as: sheet set, jacket, suit, oxford shoes

String may contain something like: 4-Piece 1000TC 100% Cotton Queen Sheet Set in Ivory

I would like to get resulted array that is stripped off all noisy words (ie. remove any words that have numbers or not long enough)

so first i do regex and strip everything that is not a-zA-Z at least {2,} char long

then I want to receive the following array:

  • cotton
  • queen
  • sheet set
  • ivory

where sheet set would remain as a single token since it is contained in our dictionary.

And I'm looking for a solution that would work very very fast since there're thousands of parallel processes and I'm trying to come up with a way to save on as many iterations as possible and the dictionary keeps on growing as well.

  • 写回答

2条回答 默认 最新

  • doucheng7534 2015-06-09 19:20
    关注

    If you need something real fast, you might consider to build a tree-based structure from your dictionnary (each character would be linked down to the next one), then at each space, you have to try to go down the tree.

    You can have a look for http://en.wikipedia.org/wiki/Trie

    However, if speed is a primary concern, you have to avoid php.

    评论

报告相同问题?

悬赏问题

  • ¥15 lingo18勾选global solver求解使用的算法
  • ¥15 全部备份安卓app数据包括密码,可以复制到另一手机上运行
  • ¥15 Python3.5 相关代码写作
  • ¥20 测距传感器数据手册i2c
  • ¥15 RPA正常跑,cmd输入cookies跑不出来
  • ¥15 求帮我调试一下freefem代码
  • ¥15 matlab代码解决,怎么运行
  • ¥15 R语言Rstudio突然无法启动
  • ¥15 关于#matlab#的问题:提取2个图像的变量作为另外一个图像像元的移动量,计算新的位置创建新的图像并提取第二个图像的变量到新的图像
  • ¥15 改算法,照着压缩包里边,参考其他代码封装的格式 写到main函数里