PHP和MySQL：模糊重复检测/文本索引

This is for a real-world problem. I am trying to detect whether a large text file already exists within a database for the purpose of duplicate detection.

The text is generated by OCR so it is not ever identical even if it is of the same thing. For example, "this is the beginning of the file" and "th1s is the begiming ef the file"

I need to be able to quickly (as in pre-indexed) find whether any file already exists within the database before adding it.

Currently I have done this by comparing the new item to the existing items like this:

Language must be the same, and
Number of words must be within 10%, and
50 most common words must match 90% of the time

There are two problems with this approach, both relating to 3. Firstly the accuracy of this is good but not perfect; secondly it is very slow because all common words have to be compared to all other common words for matches of 1 & 2.

I really need to come up with a more accurate and faster solution, preferably something that does not require millions of comparisons, but instead can use an index.

This appears to be a very difficult problem, although easy for a human. There must be a way to do it: some way to convert the text into a general representation, e.g. a number whereby a similar number would indicate another very similar file (more-or-less the same words in more-or-less the same order.)

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

报告相同问题？

关注问题

create connection SQLException, url: jdbc:mysql//localhost:3306/jdbc, errorCode 0, state null intellij-idea spring
2019-06-30 19:58

回答 2 已采纳 jdbc:mysql://localhost:3306/jdbc,是不是mysql后边少个：号。
jdbc:mysql:///中为什么是三道斜线？ mysql
2016-04-27 06:28

回答 3 已采纳第三道线是多加的，两道线就ok啦
启动项目报错：ERROR DruidDataSource:641 - init datasource error, url java mysql
2021-10-29 13:02

回答 1 已采纳 select user,plugin from mysql.user; 查一下你的root是什么登录模式
阿里云主要产品及功能介绍，阿里云产品分为6大分类：云计算基础／安全／大数据／人工智能／企业应用／物联网
2021-05-14 21:53

云计算平台开发的博客 大数据（Data Technology）：人工智能（AI）：企业应用（Enterprise Applications）：物联网（IOT）：云计算基础（Cloud Essentials）：云基础产品体系完整度全球领先，基础产品及功能持续投入建设，...
mysqli_connect（）:( HY000 / 2002）：无法建立连接，因为目标计算机主动拒绝它 php
2016-10-01 18:34

回答 3 已采纳 If you look at your XAMPP Control Panel, it's clearly stated that the port to the MySQL server is
MySql中添加/删除数据，索引怎么变？ mysql sql
2021-08-25 19:29

回答 2 已采纳索引是为了提高MySQL查询速度，增删数据，索引也会跟着一起增删，简单点想，比如主键就是唯一索引，而我们建立的其他普通索引，增删时也会维护这个索引序列，所以就是常说的索引建多了，会影响插入和删除速度，
在window下初始化hive报错：hive --service schematool -dbType mysql -initSchema hadoop hive
2022-04-10 23:23

回答 1 已采纳解决办法：mysql的hive已经初始化了，删除表，重新初始化
MySQL
2021-03-14 09:23

dxj1016的博客 MySQL 狂神说java视频 1、初始MySQL 1.1、用途 javaEE：企业级java开发 Web 前端（页面：展示，数据！）后台（连接点：连接数据库 JDBC，连接前端（控制，控制视图跳转，和给前端传递数据））数据库（存数据，...
碰到no suitable driver found for jdbc:mysql//localhost:3306/qzhao 错误 hibernate
2009-09-01 09:34

回答 4 已采纳出现这样的情况，一般有四种原因：一：连接URL格式出现了问题(Connection conn=DriverManager.getConnection("jdbc:mysql://localho
mapperlocations=classpath:mapper/*.xml 报错，如何解决？ java spring boot xml
2022-06-07 18:31

回答 1 已采纳 mapper-locations: classpath*:m*/*.xml 这个**/*.xml好像不能用前面必须得有字母 # mapper-locations: classpath:mapp
PHP / MySQL：SyntaxError：JSON.parse：JSON数据第1行第1列的意外字符 json mysql php
2018-08-13 07:09

回答 2 已采纳 header() should be placed above all output, so before you echo anything. Plus, since your not enc
浅谈MySQL：结构、存储引擎、索引、优化
2020-08-25 00:33

JunSouth的博客对MyISAM和InnoDB总结索引为啥要索引索引的类型单列索引组合索引全文索引空间索引 B+树索引的实现磁盘IO与预读索引实现的方法 b+树的查找过程 b+树性质哈希索引建索引的几大原则慢查询...
FileNotFoundException: File does not exist java linux mysql ubuntu
2021-02-14 17:26

回答 1 已采纳看看这个： https://blog.csdn.net/wangshuminjava/article/details/80179648
数据库MySQL详解
2018-07-24 20:03

砖业洋__的博客全网最详细MySQL教程，2023持续更新中
PHP之mysql面试题大全(58持续更新中)
2023-09-26 15:20

PHP隔壁老王邻居的博客目录一、mysql索引知识点 1、什么是索引 2、索引类型 3、主键和普通索引的区别 4、主键、外键和索引的区别？ 5、索引优劣 6、索引失效情况 7、数据表建立索引的原则有哪些? 8、什么情况下不宜建立索引? 9、msyql...
没有解决我的问题, 去提问

悬赏问题

¥15 Python输入字符串转化为列表排序具体见图，严格按照输入
¥20 XP系统在重新启动后进不去桌面，一直黑屏。
¥15 opencv图像处理，需要四个处理结果图
¥15 无线移动边缘计算系统中的系统模型
¥15 深度学习中的画图问题
¥15 java报错:使用mybatis plus查询一个只返回一条数据的sql，却报错返回了1000多条
¥15 Python报错怎么解决
¥15 simulink如何调用DLL文件
¥15 关于用pyqt6的项目开发该怎么把前段后端和业务层分离
¥30 线性代数的问题，我真的忘了线代的知识了

码龄粉丝数原力等级 --

PHP和MySQL：模糊重复检测/文本索引

0条回答默认最新

悬赏问题

PHP和MySQL：模糊重复检测/文本索引

0条回答 默认 最新

悬赏问题

0条回答默认最新