I'm trying to find the best solution for comparing two similar strings and choosing the most similar it can find.
I have an array of straight movie names. I also have an array of movie names with additional text.
Example:
My straight movie name array contains strings like so:
"Super Troopers",
"Everest",
"Star Wars: Episode I The Phantom Menace"
My other array with movie strings are in forms similar to the following:
"Super Troopers (2001) 720P-AC3-x264",
"Everest - 2015.1080p.DTS mkv",
"Star Wars - Episode 1: The Phantom Menace 1080p h265 HEVC TrueHD"
What I'm currently doing is looping through my first array comparing each movie with the second array and using strpos()
If I find an exact match, great. If not I need to perform some other function to look for which two strings are most similar. I have tried using similar_text()
and levenshtein()
with mixed results.
In my above examples, strpos()
would have matched both Everest and Super Troopers just fine, but for the Star Wars string I need additional checks. Things like hyphens and colons and "I" and "1" used differently along with the additional information that follows the movie name seem to give me sporadic results with similar_text()
and levenshtein()
I'm thinking of maybe first substring out the movie names with the additional information by first calculating the strlen()
of the movie name plus 5 or so additional characters for good measure before running a similar_text()
or levenshtein()
function/s, since the one common thing they all have is their movie names are at the start of the string. This could make the string similarity functions maybe a bit more accurate?
Or maybe some function that breaks up each word and checks to see how many match in the other string. Does such a function exist?
I'll mess around with it more, but if anyone has any input on how they might tackle the problem, I'd love to know.
Thanks.