This is for a real-world problem. I am trying to detect whether a large text file already exists within a database for the purpose of duplicate detection.
The text is generated by OCR so it is not ever identical even if it is of the same thing. For example, "this is the beginning of the file" and "th1s is the begiming ef the file"
I need to be able to quickly (as in pre-indexed) find whether any file already exists within the database before adding it.
Currently I have done this by comparing the new item to the existing items like this:
- Language must be the same, and
- Number of words must be within 10%, and
- 50 most common words must match 90% of the time
There are two problems with this approach, both relating to 3. Firstly the accuracy of this is good but not perfect; secondly it is very slow because all common words have to be compared to all other common words for matches of 1 & 2.
I really need to come up with a more accurate and faster solution, preferably something that does not require millions of comparisons, but instead can use an index.
This appears to be a very difficult problem, although easy for a human. There must be a way to do it: some way to convert the text into a general representation, e.g. a number whereby a similar number would indicate another very similar file (more-or-less the same words in more-or-less the same order.)