douwen3127 2012-03-31 18:52
浏览 74
已采纳

如何在我的数据库中有效地查找和保存重复推文的日志?

Please consider the following "tweets" table:

tweet_id user_id text      
----------------------------
 1       1       lorem ipsum
 2       1       lorem ipsum
 3       2       pear
 4       1       dolor
 5       3       foo
 6       1       dolor
 7       1       dolor
 8       3       bar
 9       3       baz
10       4       happy
11       4       happy
12       2       apple
13       3       foo
14       4       happy

In reality, the table contains millions of tweets from about 80,000 users. Many of there users are spam accounts, but they are hard to identify by hand. As a rule of thumb, spam accounts post the same message at least 3 times. That's why I want to fill the following tables, "duplicates" on the left and "duplicates_tweets" on the right:

duplicate_id user_id  text         cnt       duplicate_id  tweet_id
--------------------------------------       ----------------------
1            1        lorem ipsum  2         1              1
2            1        dolor        3         1              2
3            2        pear         1         2              4
4            2        apple        1         2              6
5            3        foo          2         2              7
6            3        bar          1         3              3
7            3        baz          1         4             12
8            4        happy        3         5              5
                                             5             13
                                             6              8
                                             7              9
                                             8             10
                                             8             11
                                             8             14

I can now very easily sort on cnt for instance, and see which users post the most duplicate messages. My question however, is how to go about this most efficiently. In other words: what query would be most efficient to fill these tables? And is it possible with just SQL or should I use PHP as an intermediary, for instance to take a tweet from the "tweets" database, scans for duplicates, fills the tables, and moves on to the next tweet? I'm afraid this would take ages to finish, so any help is greatly appreciated!

  • 写回答

5条回答 默认 最新

  • douweida2669 2012-03-31 19:00
    关注

    You can use the REPLACE function in MySQL to UPDATE or INSERT a new row based on the key:

    REPLACE duplicates
    SELECT user_id, text
    FROM (SELECT user_id, text, count(1) as count
    FROM tweets
    GROUP BY user_id, text
    HAVING count(1) > 2))
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(4条)

报告相同问题?

悬赏问题

  • ¥100 set_link_state
  • ¥15 虚幻5 UE美术毛发渲染
  • ¥15 CVRP 图论 物流运输优化
  • ¥15 Tableau online 嵌入ppt失败
  • ¥100 支付宝网页转账系统不识别账号
  • ¥15 基于单片机的靶位控制系统
  • ¥15 真我手机蓝牙传输进度消息被关闭了,怎么打开?(关键词-消息通知)
  • ¥15 装 pytorch 的时候出了好多问题,遇到这种情况怎么处理?
  • ¥20 IOS游览器某宝手机网页版自动立即购买JavaScript脚本
  • ¥15 手机接入宽带网线,如何释放宽带全部速度