dongxiji0687 2012-11-18 16:05
浏览 44
已采纳

如何智能地比较两个表?

Consider following are my two sql tables:

Table 1                                                                       Table 2

+-------+-------------------------+   +-------+------------------------------+
| USD   | Model                   |   | USD   | Model                        |
+-------+-------------------------+   +-------+------------------------------+
| 700   | iPad 2 WiFi 16GB        |   | 710   | iPad2 WiFi 16GB              |
| 400   | iPhone 4S 16GB          |   | 450   | iPhone4S 16GB                |
| 250   | iPod Touch(4th Gen)8GB  |   | 200   |iPod Touch 4th Generation 8GB |
+-------+-------------------------+   +-------+------------------------------+

I am stuck in comparing the data present in two different tables intelligently. I dug alot on the context of searching or comparing and I found

  • similar_text()
  • soundex()
  • metaphones()
  • LEVENSHTEIN()
  • like
  • fulltext
  • regexp

...in PHP and MySQL but they all are not efficient. Because similar_text and LEVENSHTEIN are really good, but the worst drawback is that they are extremely slow for 1000 rows, soundex() and metaphones return the same sound for such items which are not alike, like "iphone" and "ipad", both are not same etc. All I want to do is compare efficiently two rows that are alike like from the above example "iPhone 4S 16GB" and "iPhone4S 16GB" is the same or the like mentioned above and my solution should work quickly to compare such rows. Kindly let me know what are my options for comparing so I can solve my query. I would really appreciate any idea, any hint.

Note: My one table contains around ~900 rows.

This is a continuation of:

Compare two arrays and sort WRT USD

Pattern comparing with mysql between two tables column

  • 写回答

4条回答 默认 最新

  • dscuu86620 2012-11-18 17:20
    关注

    I covered this kind of thing when doing a spam detector (loads of research, and then ditched the idea later, but moving on...).

    Basically, do not use like, it's slow on large text and indexes are limited for example:

    LIKE '%hello' can not use an index, but, LIKE 'hello%' can. Also, large fields will result in large indexs to make the work as you intend (they are ok for say email addresses which tend to be short).

    Use = which will also be case insensitive which you must have for this.

    Next, add a new field to the tables which contains the already parsed metaphone() representation (this means that it only has to be calculated ONCE).

    Now you have a table of say 1000 records, each with their metaphone version as well as the original. You MUST do this to get the efficiency you require. When you want to see if some text already exists, you just convert the new text to it's metephone version, then search the db tables for it (searching on the metephone parsed field). Much quicker ;)

    To improve accuracy, you may want to delete all the common words and remove punctuation such as:

    • and = deleted
    • , = deleted
    • ' = deleted
    • has = deleted
    • it's = its or it is (depending on which you prefer)

    Then combine all multiple whitespace such as 5 spaces, into just 1 space.

    The nature of what you are doing will have hundreds of little tweaks you can do to perfect it for what you need it for.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(3条)

报告相同问题?

悬赏问题

  • ¥15 像这种代码要怎么跑起来?
  • ¥15 怎么改成循环输入删除(语言-c语言)
  • ¥15 安卓C读取/dev/fastpipe屏幕像素数据
  • ¥15 pyqt5tools安装失败
  • ¥15 mmdetection
  • ¥15 nginx代理报502的错误
  • ¥100 当AWR1843发送完设置的固定帧后,如何使其再发送第一次的帧
  • ¥15 图示五个参数的模型校正是用什么方法做出来的。如何建立其他模型
  • ¥100 描述一下元器件的基本功能,pcba板的基本原理
  • ¥15 STM32无法向设备写入固件