如何在数据库中查找重复项？

There are many questions on how to find duplicates in a database, but not with the specific problem that I have.

I have a table with approx. 120000 entries. I need to find duplicates. To find them, I use a php script that is structured like the following:

//get all entries from database
//loop through them
    //get entries with greater id
    //compare all of them with the original one
    //update database (delete duplicate, update information in linked tables, etc.)

It is not possible to sort out all duplicates already in the initial query, because I have to loop through all entries since my duplicate search is sensitive not only to entries that are 100% alike, but also entries that are 90% alike. I use similar_text() for that.

I think the first loop is okay, but looping through all other entries within the loop is just too much. With 120000 entries this would be close to (120000^2)/2 iterations.

So instead of using a loop within the loop, there must be a better way to do it. Do you have any ideas? I thought about using in_array(), but it is not sensitive to something like 90% string similarity, and also doesn't give me the array's fields it found the duplicates in - I would need those to get the entries' ids to update the database correctly.

Any ideas?

Thank you very much!

Charles

UPDATE 1

The query I am using right now is the following:

SELECT a.host_id
FROM host_webs a
JOIN host_webs b ON a.host_id != b.host_id AND a.web = b.web
GROUP BY a.host_id

It shows originals and duplicates perfectly, but I need to get rid of the originals, i.e. the first ones found with the associated data. How can I accomplish that?

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

2条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
duanao6704 2012-07-12 22:34
关注
You can JOIN the table onto itself and do it all in SQL (I know you say you don't think you can, but I would be surprised if this is the case). All you need to do is put all the columns you use to test for duplicates into the ON clause of the JOIN.

SELECT id FROM tablename a JOIN tablename b ON a.id != b.id AND a.col1 = b.col1 AND a.col2 = b.col2 GROUP BY id

This will return just the ids of the rows where col1 and col2 are duplicated. You can incorporate whatever string comparisons you need into this, the ON clause can be as complicated as you need it to be. For example:

SELECT id FROM tablename a JOIN tablename b ON a.id != b.id AND (a.col1 = b.col1 AND (a.col2 = b.col2 OR a.col3 = b.col3)) OR ((a.col1 = b.col1 OR a.col2 = b.col2) AND a.col3 = b.col3) OR (SOUNDEX(a.col1) = SOUNDEX(b.col1) AND SOUNDEX(a.col2) = SOUNDEX(b.col2) AND SOUNDEX(a.col3) = SOUNDEX(b.col3)) GROUP BY id

EDIT

Since all you are actually doing with your query is looking for rows where the web column is identical, this would do the job of finding only the duplicates and not the original "good" records - assuming host_id is numeric and that the "good" record would be the one with the lowest host_id:

SELECT b.host_id FROM host_webs a INNER JOIN host_webs b ON b.web = a.web AND b.host_id > a.host_id GROUP BY b.host_id

I imagine the end game here would be to remove the duplicates, so if you are feeling brave you could actually delete them in one go:

DELETE b.* FROM host_webs a INNER JOIN host_webs b ON b.web = a.web AND b.host_id > a.host_id

The GROUP BY is not necessary in the DELETE statement because it doesn't matter if you try and delete the same row more than once in a single statement.
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(1条)

报告相同问题？

关注问题

在sql中多大的数据才算是大数据？ java mysql 数据库
2022-03-31 17:24

回答 5 已采纳其实没有实际的标准明确定义多少数据量算大数据，不过阿里开发手册中建议，表数据超过500万条时，建议考虑分表，以防影响查询效率，不过我们公司也有单表超过几千万条的数据，效率确实不高，所以理论上百万级别以
java在大数据里面主要做什么呢？ java 大数据
2022-08-08 21:59

回答 3 已采纳 java数据挖掘数据仓储数据清洗全栈都可以啊具体可以了解下hadoop
大数据用的nosql与传统型数据库的比较？ java 向大咖问开源
2021-04-03 00:10

回答 2 已采纳 nosql和sql并不是一个对立的概念。 大数据其实不光使用nosql的数据库，也使用传统关系型数据库。 hbase和mysql最终要的一个差别就是存储上hbase是按列存储，mysql是按行存储
如何在数据库中查找和消除重复的数据？
2018-01-25 00:00

大数据周刊的博客数据重复是困扰许多企业的问题，但是一旦你了解了它的特点，以及如何去处理它，就可以提前发现并预防。...例如，你无意中列出了同一业务在你的销售记录中有两次；该公司的销售数字将加倍，因此，导致你的
数据库中关系代数的除法运算 sqlite 大数据数据库架构
2023-01-03 15:04

回答 1 已采纳 1.没看懂哪里用到除法2.不写all就不能从语法上保证返回条目是唯一的，那就报错了呀，一个数不能与一个集合比较大小
数据库中BCNF和3NF 大数据数据库数据库开发
2022-12-31 14:20

回答 2 已采纳如果关系模式 R 中的所有属性都是主属性，就 R 的最高范式为 3NF。关于范式，有如下几个要点： 1、关系模式范式是指在设计数据库表时遵循的一系列规就，旨在避免数据冗余和数据不一致的情况。 2、一
求助，如何在SQL sever数据库中设置对数据库只能执行查询权限？ sql 数据库
2017-06-13 01:33

回答 2 已采纳可以通过角色来设置用户的权限这个要设置使用数据库的用户权限，在 SQL server management studio --- 对应的数据库--- 安全性--- 用户--- 指定用户---右击属
mysql查找删除表中重复数据方法总结
2020-12-16 01:06

要查找重复数据，我们可以使用mysql里的having语句，如图。执行这个语句后，我们可以看到现在的结果里显示的就是表中重复数据的字段。要删除这些重复的数据，我们找出这些数据的ID，在select语句里，添加id字段...
如何在Sqlite数据库中插入数据列表？ android
2013-08-30 02:38

回答 1 已采纳 SAMPLE_TABLE_NAME + " Values("+LastName.get(i)+","+FirstName.get(i)+",
在python中，第三方库pandas和数据库有什么区别？应用场景有啥不同？ python 大数据
2022-04-21 14:27

回答 1 已采纳 pandas 主要是面对分析工作，集成了部分分析算法。本身不是存储系统，可以访问数据库，也可以访问txt excel 文件。数据库主要功能是存储 +分析。
请问MySQL数据库拒绝访问是什么情况？ mysql 数据库
2022-04-10 09:25

回答 3 已采纳根据描述很可能是用户授权问题，看一下localhost对应的账号和密码，如果都正确，就要查看是否有localhost访问权限了，一般默认是有的，如果有人改了授权，指定了某个ip能访问，那就无法访问了，
PHP实现在数据库百万条数据中随机获取20条记录的方法
2021-01-20 01:37

本文实例讲述了PHP实现在数据库百万条数据中随机获取20条记录的方法。分享给大家供大家参考，具体如下：额，为什么要写这个？在去某个公司面试时，让写个算法出来，当时就蒙了，我开发过程中用到算法的吗？又不是...
android在数据库中查找相应的数据时出错。 android 数据库
2015-08-22 15:33

回答 4 已采纳看下你的数据库中有没有对应的记录
大数据时代的储存者——数据库
2020-03-04 11:53

asnowdream的博客文章目录数据库是什么数据库的作用关系数据库和菲关系数据库数据库的管理系统SQL语句SQL数据查询语言（DQL）SQL数据操作语言（DML）SQL事务处理语言（TPL）数据库事务SQL数据控制语言（DCL）SQL数据定义语言（DDL）...
通过python对同一数据库中“相似表”的查找
2022-03-08 17:53

加油牛牛的博客先找出数据库中所有表的名称，表注释，字段，字段注释，字段类型信息 SELECT table_name as "表名称", COLUMN_NAME "字段名称", COLUMN_TYPE "字段类型长度", IF(EXTRA="auto_increment",CONCAT(COLUMN_KEY,...
没有解决我的问题, 去提问

悬赏问题

¥15 基于卷积神经网络的声纹识别
¥15 Python中的request，如何使用ssr节点，通过代理requests网页。本人在泰国，需要用大陆ip才能玩网页游戏，合法合规。
¥100 为什么这个恒流源电路不能恒流？
¥15 有偿求跨组件数据流路径图
¥15 写一个方法checkPerson，入参实体类Person，出参布尔值
¥15 我想咨询一下路面纹理三维点云数据处理的一些问题，上传的坐标文件里是怎么对无序点进行编号的，以及xy坐标在处理的时候是进行整体模型分片处理的吗
¥15 CSAPPattacklab
¥15 一直显示正在等待HID—ISP
¥15 Python turtle 画图
¥15 stm32开发clion时遇到的编译问题

如何在数据库中查找重复项？

2条回答 默认 最新

悬赏问题

2条回答默认最新