Please consider the following "tweets" table:
tweet_id user_id text
----------------------------
1 1 lorem ipsum
2 1 lorem ipsum
3 2 pear
4 1 dolor
5 3 foo
6 1 dolor
7 1 dolor
8 3 bar
9 3 baz
10 4 happy
11 4 happy
12 2 apple
13 3 foo
14 4 happy
In reality, the table contains millions of tweets from about 80,000 users. Many of there users are spam accounts, but they are hard to identify by hand. As a rule of thumb, spam accounts post the same message at least 3 times. That's why I want to fill the following tables, "duplicates" on the left and "duplicates_tweets" on the right:
duplicate_id user_id text cnt duplicate_id tweet_id
-------------------------------------- ----------------------
1 1 lorem ipsum 2 1 1
2 1 dolor 3 1 2
3 2 pear 1 2 4
4 2 apple 1 2 6
5 3 foo 2 2 7
6 3 bar 1 3 3
7 3 baz 1 4 12
8 4 happy 3 5 5
5 13
6 8
7 9
8 10
8 11
8 14
I can now very easily sort on cnt for instance, and see which users post the most duplicate messages. My question however, is how to go about this most efficiently. In other words: what query would be most efficient to fill these tables? And is it possible with just SQL or should I use PHP as an intermediary, for instance to take a tweet from the "tweets" database, scans for duplicates, fills the tables, and moves on to the next tweet? I'm afraid this would take ages to finish, so any help is greatly appreciated!