douba2705 2012-01-16 06:33
浏览 59

搜索引擎脚本 - 正则表达式,多个文件,行号

I'm looking for a search engine script, or search engine that can:

  1. Search lots of large text files, specifically hundreds of full text novels.
  2. Use regex to return words and possible variations.
  3. Give the location in the file of all the matches, such as line number, or word count.
  4. Ideally with javascript or php, as they're the only languages I'm adept in, and I'll probably have to manipulate the results. But I'm sure I can bite the bullet and learn the syntax to whatever language needed.
  5. Filter a search result array of words against a dictionary to find proper nouns (This may not include the search engine)

The context and specifics (kind of long and only somewhat important):

I have a friend doing a doctoral thesis looking for the theme of cousin marriages in 19th century novels (think Shakespeare plays). Sifting through would take forever, and although no algorithm would be prefect, it should narrow things down greatly. I'm searching for the word "marriage" and every variation of, the word "cousin" and every variation of, and checking to see their relative proximity. Of course I'm searching hundreds of full text novels.

Finding their relative proximity is the feature I'm having a hard time finding. Beyond that, I may need to search for all names to ensure a main character if not the protagonist is involved. Meaning I'm trying to determine
A. Names in general.
B. The protagonist. - should be among the most frequently used names.

As for names in general, I don't things there's a comprehensive database of 19th century names, so I'm left filtering out proper nouns. From there, I have the conundrum of generic words as well as proper nouns following punctuation. I think my best bet is filter all those words through a comprehensive dictionary leaving proper nouns. Names will probably be the most frequently used, but see if I can filter out any other proper nouns, such as places. Granted, far from perfect, but it'll narrow things down significantly.

Thus this means comparing two huge list of words. There's tons of ways to do this, but if it's in a format easy to work with in a language I know, that would be ideal. My best guess is to compare the array of capitalized words with an array of dictionary words and find the differences. If it's in php, or javascript I'm good. As for any other language, if it's a relatively simple operation I'm sure I can figure out the syntax well enough.

Maybe that was a bit too much context, but any advice on the whole algorithm and process is also appreciated.

Thank you very much for your time and help! You'll be contributing to one huge doctoral thesis by saving countless hours of time, so my friend will also be very grateful.

Cheers!

  • 写回答

1条回答 默认 最新

  • douyi3632 2012-01-16 07:35
    关注

    Sphider is an open source search engine which you can download, it have most of the requirements that you need http://www.sphider.eu/demo.php

    评论

报告相同问题?

悬赏问题

  • ¥15 WPF 大屏看板表格背景图片设置
  • ¥15 这个主板怎么能扩出一两个sata口
  • ¥15 不是,这到底错哪儿了😭
  • ¥15 2020长安杯与连接网探
  • ¥15 关于#matlab#的问题:在模糊控制器中选出线路信息,在simulink中根据线路信息生成速度时间目标曲线(初速度为20m/s,15秒后减为0的速度时间图像)我想问线路信息是什么
  • ¥15 banner广告展示设置多少时间不怎么会消耗用户价值
  • ¥16 mybatis的代理对象无法通过@Autowired装填
  • ¥15 可见光定位matlab仿真
  • ¥15 arduino 四自由度机械臂
  • ¥15 wordpress 产品图片 GIF 没法显示