I have a website where I index information about movies. To find a movie the user can input the title of the movie and this query will be sent to the database:
SELECT IMDB, Name, Year, Views
FROM Movies
WHERE Name LIKE '%$search%'
I used the "similar_text" function to solve little mistakes. For example if the name is "Pulp Fiction" and the user types: "Pulp Foction", since I don't get any results I run this little code for every movie.
similar_text($search, $Name, $percent);
if ($percent > $ValMax) {
$ValMax = $percent;
$PosMax = $i;
}
I'm using this code since I only have a couple hundred movies in the database and in the future it's going to be at max about twenty thousand. Also every movie as a title of an average of 20 characters, without considering movies like Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb
The real problem starts with a query used by an user: "Capitan America" (italian for Captain America) where the name of the movie was "Captain America – Il primo vendicatore"
Using the Levenshtein function I got horrible results where with the similar_text they were at least decent. You can see the full results of that query here. I made up this page to see for each search what the levenshtein and the similar_text would be.
The movie that would have been selected if I didn't place a 60% blockage would have been "C'era una volta in America" (Once upon a time in America).
However "Capitan America" and "Captain America" are pretty similar so I was wondering if there was a way to maybe check for each word. I've also heard about Lucene but I didn't know how to start from and If it was suited for my needs.
Thanks :)