douqu8828 2014-06-06 09:09
浏览 14

PHP和MySQL:模糊重复检测/文本索引

This is for a real-world problem. I am trying to detect whether a large text file already exists within a database for the purpose of duplicate detection.

The text is generated by OCR so it is not ever identical even if it is of the same thing. For example, "this is the beginning of the file" and "th1s is the begiming ef the file"

I need to be able to quickly (as in pre-indexed) find whether any file already exists within the database before adding it.

Currently I have done this by comparing the new item to the existing items like this:

  1. Language must be the same, and
  2. Number of words must be within 10%, and
  3. 50 most common words must match 90% of the time

There are two problems with this approach, both relating to 3. Firstly the accuracy of this is good but not perfect; secondly it is very slow because all common words have to be compared to all other common words for matches of 1 & 2.

I really need to come up with a more accurate and faster solution, preferably something that does not require millions of comparisons, but instead can use an index.

This appears to be a very difficult problem, although easy for a human. There must be a way to do it: some way to convert the text into a general representation, e.g. a number whereby a similar number would indicate another very similar file (more-or-less the same words in more-or-less the same order.)

  • 写回答

0条回答 默认 最新

    报告相同问题?

    悬赏问题

    • ¥15 Python输入字符串转化为列表排序具体见图,严格按照输入
    • ¥20 XP系统在重新启动后进不去桌面,一直黑屏。
    • ¥15 opencv图像处理,需要四个处理结果图
    • ¥15 无线移动边缘计算系统中的系统模型
    • ¥15 深度学习中的画图问题
    • ¥15 java报错:使用mybatis plus查询一个只返回一条数据的sql,却报错返回了1000多条
    • ¥15 Python报错怎么解决
    • ¥15 simulink如何调用DLL文件
    • ¥15 关于用pyqt6的项目开发该怎么把前段后端和业务层分离
    • ¥30 线性代数的问题,我真的忘了线代的知识了