dox19458 2012-07-25 22:12
浏览 46
已采纳

从MySQL数据库中识别(非精确)重复项

Is there any tools for identifying, and merging non exact duplicates in MySQL tables?

I have a large data set with many duplicates like:

1348,  Auto Motors, 12 Long Road, etc
48264, Auto Mtors,  12 Log Road,  etc
82743, Ato Motoers, 12 Lng Road,  etc
83821, Auto Motors, 13 Long Road, etc
92743, Auto Motors, 11 Long Road, etc

There are many tables needed to be merged like:

  • Companies
  • Addresses
  • Phone Numbers
  • Employees

There is about 100,000 rows, and 30-40 columns to match on each row (joined tables).

So, anyone know of a tool for sorting this out? I already have MySQL, PHP installed. I have/can use(d) MongoDB, and Solr before if they would help. And I am open to installing other software if needed.


Alternatively what kind of queries should I run if I cannot find a tool to handle this.

A simple find all duplicates wont work because they are not exact.

Doing wildcard like searches would be extremely slow for all the different combinations I would need to try.

Using a Oliver or Levenshtein (MySQL) may work, and there is too much data to pull into PHP (also probably extremely slow).

  • 写回答

2条回答 默认 最新

  • duandaiqin6080 2012-07-25 22:44
    关注

    You have data that requires massaging. I don't think this is something you can do entirely in sql.

    Google Refine is a great tool for massaging. I would load the data in Refine first, clean it up, then import into your relational database.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 求差集那个函数有问题,有无佬可以解决
  • ¥15 MATLAB动图问题
  • ¥15 【提问】基于Invest的水源涵养
  • ¥20 微信网友居然可以通过vx号找到我绑的手机号
  • ¥15 寻一个支付宝扫码远程授权登录的软件助手app
  • ¥15 解riccati方程组
  • ¥15 display:none;样式在嵌套结构中的已设置了display样式的元素上不起作用?
  • ¥15 使用rabbitMQ 消息队列作为url源进行多线程爬取时,总有几个url没有处理的问题。
  • ¥15 Ubuntu在安装序列比对软件STAR时出现报错如何解决
  • ¥50 树莓派安卓APK系统签名