dougaojue8185 2009-12-24 11:30
浏览 76
已采纳

用PHP Levenshtein比较5000个字符串

I have 5000, sometimes more, street address strings in an array. I'd like to compare them all with levenshtein to find similar matches. How can I do this without looping through all 5000 and comparing them directly with every other 4999?

Edit: I am also interested in alternate methods if anyone has suggestions. The overall goal is to find similar entries (and eliminate duplicates) based on user-submitted street addresses.

  • 写回答

8条回答 默认 最新

  • duanpie4763 2009-12-24 12:04
    关注

    I think a better way to group similar addresses would be to:

    1. create a database with two tables - one for the address (and a id), one for the soundexes of words or literal numbers in the address (with the foreign key of the addresses table)

    2. uppercase the address, replace anything other than [A-Z] or [0-9] with a space

    3. split the address by space, calculate the soundex of each 'word', leave anything with just digits as is and store it in the soundexes table with the foreign key of the address you started with

    4. for each address (with id $target) find the most similar addresses:

      SELECT similar.id, similar.address, count(*) 
      FROM adress similar, word cmp, word src
      WHERE src.address_id=$target
      AND src.soundex=cmp.soundex
      AND cmp.address_id=similar.id
      ORDER BY count(*)
      LIMIT $some_value;
      
    5. calculate the levenstein difference between your source address and the top few values returned by the query.

    (doing any sort of operation on large arrays is often faster in databases)

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(7条)

报告相同问题?

悬赏问题

  • ¥15 Arcgis相交分析无法绘制一个或多个图形
  • ¥15 seatunnel-web使用SQL组件时候后台报错,无法找到表格
  • ¥15 fpga自动售货机数码管(相关搜索:数字时钟)
  • ¥15 用前端向数据库插入数据,通过debug发现数据能走到后端,但是放行之后就会提示错误
  • ¥30 3天&7天&&15天&销量如何统计同一行
  • ¥30 帮我写一段可以读取LD2450数据并计算距离的Arduino代码
  • ¥15 飞机曲面部件如机翼,壁板等具体的孔位模型
  • ¥15 vs2019中数据导出问题
  • ¥20 云服务Linux系统TCP-MSS值修改?
  • ¥20 关于#单片机#的问题:项目:使用模拟iic与ov2640通讯环境:F407问题:读取的ID号总是0xff,自己调了调发现在读从机数据时,SDA线上并未有信号变化(语言-c语言)