dox19458 2012-07-25 22:12
浏览 46
已采纳

从MySQL数据库中识别(非精确)重复项

Is there any tools for identifying, and merging non exact duplicates in MySQL tables?

I have a large data set with many duplicates like:

1348,  Auto Motors, 12 Long Road, etc
48264, Auto Mtors,  12 Log Road,  etc
82743, Ato Motoers, 12 Lng Road,  etc
83821, Auto Motors, 13 Long Road, etc
92743, Auto Motors, 11 Long Road, etc

There are many tables needed to be merged like:

  • Companies
  • Addresses
  • Phone Numbers
  • Employees

There is about 100,000 rows, and 30-40 columns to match on each row (joined tables).

So, anyone know of a tool for sorting this out? I already have MySQL, PHP installed. I have/can use(d) MongoDB, and Solr before if they would help. And I am open to installing other software if needed.


Alternatively what kind of queries should I run if I cannot find a tool to handle this.

A simple find all duplicates wont work because they are not exact.

Doing wildcard like searches would be extremely slow for all the different combinations I would need to try.

Using a Oliver or Levenshtein (MySQL) may work, and there is too much data to pull into PHP (also probably extremely slow).

  • 写回答

2条回答 默认 最新

  • duandaiqin6080 2012-07-25 22:44
    关注

    You have data that requires massaging. I don't think this is something you can do entirely in sql.

    Google Refine is a great tool for massaging. I would load the data in Refine first, clean it up, then import into your relational database.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 微信小程序协议怎么写
  • ¥15 c语言怎么用printf(“\b \b”)与getch()实现黑框里写入与删除?
  • ¥20 怎么用dlib库的算法识别小麦病虫害
  • ¥15 华为ensp模拟器中S5700交换机在配置过程中老是反复重启
  • ¥15 java写代码遇到问题,求帮助
  • ¥15 uniapp uview http 如何实现统一的请求异常信息提示?
  • ¥15 有了解d3和topogram.js库的吗?有偿请教
  • ¥100 任意维数的K均值聚类
  • ¥15 stamps做sbas-insar,时序沉降图怎么画
  • ¥15 买了个传感器,根据商家发的代码和步骤使用但是代码报错了不会改,有没有人可以看看