I'm trying to develop a way of taking an entity with a number of properties and searching for similar entities in the database (matching as many of the properties in the correct order as possible). The idea is that it would then return a % of how similar it is.
The order of the properties should also be taken into account, so the properties at the beginning are more important than the ones at the end.
For example:
Item 1 - A, B, C, D, E
Item 2 - A, B, C, D, E
Would be a 100% match
Item 1 - A, B, C, D, E
Item 2 - B, C, A, D, E
This wouldn't be a perfect match as the properties are in a different order
Item 1 - A, B, C, D, E
Item 2 - F, G, H, I, A
Would be a low match as only one property is the same and it is in position 5
This algorithm will run for thousands and thousands of records so it needs to be high performing and efficient. Any thoughts as to how I could do this in PHP/MySQL in a fast and efficient manner?
I was considering levenshtein but as far as I can tell that would also look at the distance between two completely different words in terms of spelling. Doesn't appear to be ideal for this scenario unless I'm just using it in the wrong way..
It might be that it could be done solely in MySQL, perhaps using a full text search or something.
This seems like a nice solution, though not designed for this scenario. Perhaps binary comparison could be used in some way?