douqian3712 2009-08-01 02:56
浏览 35
已采纳

在PHP中加速levenshtein / similar_text

I am currently using similar_text to compare a string against a list of ~50,000 which works although due to the number of comparisons it's very slow. It takes around 11 minutes to compare ~500 unique strings.

Before running this I do check the databases to see whether it has been processed in the past so everytime after the inital run it's close to instant.

I'm sure using levenshtein would be slightly faster and the LevenshteinDistance function someone posted in the manual looks interesting. Am I missing something that could make this significantly faster?

  • 写回答

3条回答 默认 最新

  • dpo60833 2009-08-05 05:24
    关注

    In the end, both levenshtein and similar_text were both too slow with the number of strings it had to go through, even with lots of checks and only using them one of them as a last resort.

    As an experiment, I ported some of the code to C# to see how much faster it would be over interperated code. It ran in about 3 minutes with the same dataset.

    Next I added an extra field to the table and used the double metaphone PECL extension to generate keys for each row. The results were good although since some included numbers this caused duplicates. I guess I could then have run each one through the above functions but decided not to.

    In the end I opted for the simplest approach, MySQLs full text which worked very well. Occasionally there are mistakes although they are easy to detect and correct. Also it runs very fast, in around 3-4 seconds.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(2条)

报告相同问题?

悬赏问题

  • ¥20 C++通过HICON获取argb像素数组
  • ¥15 如何利用支持向量机提高分类器正确率和筛选理想分类器
  • ¥15 Pygame坦克大战游戏开发实验报告
  • ¥15 用vmmare虚拟机用sentaurus仿真的时候,调用terminal程序,输入swb指令弹出这个,打不开workbench,桌面上面的sentaurus workbench也打不开
  • ¥75 使用winspool.drv的SetPrinter设置打印机失败
  • ¥15 simulink 硬件在环路仿真
  • ¥15 python动态规划:N根火柴摆出的最大数字
  • ¥20 (标签-excel)
  • ¥200 求idea10和MyEclipse7.1
  • ¥20 vb6.0截取当前窗体保存为jpg文件