douxiawei9318 2012-07-12 22:26
浏览 32
已采纳

如何在数据库中查找重复项?

There are many questions on how to find duplicates in a database, but not with the specific problem that I have.

I have a table with approx. 120000 entries. I need to find duplicates. To find them, I use a php script that is structured like the following:

//get all entries from database
//loop through them
    //get entries with greater id
    //compare all of them with the original one
    //update database (delete duplicate, update information in linked tables, etc.)

It is not possible to sort out all duplicates already in the initial query, because I have to loop through all entries since my duplicate search is sensitive not only to entries that are 100% alike, but also entries that are 90% alike. I use similar_text() for that.

I think the first loop is okay, but looping through all other entries within the loop is just too much. With 120000 entries this would be close to (120000^2)/2 iterations.

So instead of using a loop within the loop, there must be a better way to do it. Do you have any ideas? I thought about using in_array(), but it is not sensitive to something like 90% string similarity, and also doesn't give me the array's fields it found the duplicates in - I would need those to get the entries' ids to update the database correctly.

Any ideas?

Thank you very much!

Charles

UPDATE 1

The query I am using right now is the following:

SELECT a.host_id
FROM host_webs a
JOIN host_webs b ON a.host_id != b.host_id AND a.web = b.web
GROUP BY a.host_id

It shows originals and duplicates perfectly, but I need to get rid of the originals, i.e. the first ones found with the associated data. How can I accomplish that?

  • 写回答

2条回答 默认 最新

  • duanao6704 2012-07-12 22:34
    关注

    You can JOIN the table onto itself and do it all in SQL (I know you say you don't think you can, but I would be surprised if this is the case). All you need to do is put all the columns you use to test for duplicates into the ON clause of the JOIN.

    SELECT id
    FROM tablename a
    JOIN tablename b ON a.id != b.id AND a.col1 = b.col1 AND a.col2 = b.col2
    GROUP BY id
    

    This will return just the ids of the rows where col1 and col2 are duplicated. You can incorporate whatever string comparisons you need into this, the ON clause can be as complicated as you need it to be. For example:

    SELECT id
    FROM tablename a
    JOIN tablename b ON a.id != b.id AND
      (a.col1 = b.col1 AND (a.col2 = b.col2 OR a.col3 = b.col3))
      OR ((a.col1 = b.col1 OR a.col2 = b.col2) AND a.col3 = b.col3)
      OR (SOUNDEX(a.col1) = SOUNDEX(b.col1) AND SOUNDEX(a.col2) = SOUNDEX(b.col2) AND SOUNDEX(a.col3) = SOUNDEX(b.col3))
    GROUP BY id
    

    EDIT

    Since all you are actually doing with your query is looking for rows where the web column is identical, this would do the job of finding only the duplicates and not the original "good" records - assuming host_id is numeric and that the "good" record would be the one with the lowest host_id:

    SELECT b.host_id
    FROM host_webs a
    INNER JOIN host_webs b ON b.web = a.web AND b.host_id > a.host_id
    GROUP BY b.host_id
    

    I imagine the end game here would be to remove the duplicates, so if you are feeling brave you could actually delete them in one go:

    DELETE b.*
    FROM host_webs a
    INNER JOIN host_webs b ON b.web = a.web AND b.host_id > a.host_id
    

    The GROUP BY is not necessary in the DELETE statement because it doesn't matter if you try and delete the same row more than once in a single statement.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 基于卷积神经网络的声纹识别
  • ¥15 Python中的request,如何使用ssr节点,通过代理requests网页。本人在泰国,需要用大陆ip才能玩网页游戏,合法合规。
  • ¥100 为什么这个恒流源电路不能恒流?
  • ¥15 有偿求跨组件数据流路径图
  • ¥15 写一个方法checkPerson,入参实体类Person,出参布尔值
  • ¥15 我想咨询一下路面纹理三维点云数据处理的一些问题,上传的坐标文件里是怎么对无序点进行编号的,以及xy坐标在处理的时候是进行整体模型分片处理的吗
  • ¥15 CSAPPattacklab
  • ¥15 一直显示正在等待HID—ISP
  • ¥15 Python turtle 画图
  • ¥15 stm32开发clion时遇到的编译问题