I'm looking for a search engine script, or search engine that can:
- Search lots of large text files, specifically hundreds of full text novels.
- Use regex to return words and possible variations.
- Give the location in the file of all the matches, such as line number, or word count.
- Ideally with javascript or php, as they're the only languages I'm adept in, and I'll probably have to manipulate the results. But I'm sure I can bite the bullet and learn the syntax to whatever language needed.
- Filter a search result array of words against a dictionary to find proper nouns (This may not include the search engine)
The context and specifics (kind of long and only somewhat important):
I have a friend doing a doctoral thesis looking for the theme of cousin marriages in 19th century novels (think Shakespeare plays). Sifting through would take forever, and although no algorithm would be prefect, it should narrow things down greatly. I'm searching for the word "marriage" and every variation of, the word "cousin" and every variation of, and checking to see their relative proximity. Of course I'm searching hundreds of full text novels.
Finding their relative proximity is the feature I'm having a hard time finding. Beyond that, I may need to search for all names to ensure a main character if not the protagonist is involved. Meaning I'm trying to determine
A. Names in general.
B. The protagonist. - should be among the most frequently used names.
As for names in general, I don't things there's a comprehensive database of 19th century names, so I'm left filtering out proper nouns. From there, I have the conundrum of generic words as well as proper nouns following punctuation. I think my best bet is filter all those words through a comprehensive dictionary leaving proper nouns. Names will probably be the most frequently used, but see if I can filter out any other proper nouns, such as places. Granted, far from perfect, but it'll narrow things down significantly.
Thus this means comparing two huge list of words. There's tons of ways to do this, but if it's in a format easy to work with in a language I know, that would be ideal. My best guess is to compare the array of capitalized words with an array of dictionary words and find the differences. If it's in php, or javascript I'm good. As for any other language, if it's a relatively simple operation I'm sure I can figure out the syntax well enough.
Maybe that was a bit too much context, but any advice on the whole algorithm and process is also appreciated.
Thank you very much for your time and help! You'll be contributing to one huge doctoral thesis by saving countless hours of time, so my friend will also be very grateful.
Cheers!