There are many questions on how to find duplicates in a database, but not with the specific problem that I have.
I have a table with approx. 120000 entries. I need to find duplicates. To find them, I use a php script that is structured like the following:
//get all entries from database
//loop through them
//get entries with greater id
//compare all of them with the original one
//update database (delete duplicate, update information in linked tables, etc.)
It is not possible to sort out all duplicates already in the initial query, because I have to loop through all entries since my duplicate search is sensitive not only to entries that are 100% alike, but also entries that are 90% alike. I use similar_text() for that.
I think the first loop is okay, but looping through all other entries within the loop is just too much. With 120000 entries this would be close to (120000^2)/2 iterations.
So instead of using a loop within the loop, there must be a better way to do it. Do you have any ideas? I thought about using in_array(), but it is not sensitive to something like 90% string similarity, and also doesn't give me the array's fields it found the duplicates in - I would need those to get the entries' ids to update the database correctly.
Any ideas?
Thank you very much!
Charles
UPDATE 1
The query I am using right now is the following:
SELECT a.host_id
FROM host_webs a
JOIN host_webs b ON a.host_id != b.host_id AND a.web = b.web
GROUP BY a.host_id
It shows originals and duplicates perfectly, but I need to get rid of the originals, i.e. the first ones found with the associated data. How can I accomplish that?