dth62818 2012-10-15 03:08
浏览 257
已采纳

正则表达式,将从文本文件中提取句子

I need a regular expression that will extract sentences from text file. example text :

Consider, for example, the Asian tsunami disaster that happened in the end of 2004. A query to Google News (http://news.google.com) returned more than 80,000 online news articles about this event within one month (Jan.17 through Feb.17, 2005). information by mr. Kahana.

here's my code :

$re = '/(?<=[.!?]|[.!?][\'"])\s+/';
$sentences = preg_split($re, $text, -1, PREG_SPLIT_NO_EMPTY);

but the last sentence still splitted information by mr. and Kahana. how to solve it ? thank you :)

  • 写回答

1条回答 默认 最新

  • duanpo1498 2012-10-15 03:31
    关注

    You Can't Do this with Regular Expressions

    English as a language does not fit into well-placed formatting rules. As such, regular expressions are not fit to fulfill the purpose you are seeking out. What you are really looking for is something like a Natural Language Processor.

    Unless this is critical to your program, I suggest you instead determine the following things:

    • What is an acceptable level of error? Nothing you do will be perfect. But if it works 80% is that okay? 90%? 99%? How critical is this to you/your client?
    • Where is the text coming from? For example, a textbook will most likely be written differently than people's twitter feeds. You can do research and make exceptions based on what you see in the actual text you are using.
    • What am I doing with the text? If you are just indexing things like keywords, then it doesn't matter (as much) if you get the sentences split correctly. It's all about tuning the program to get the appropriate output for this specific purpose.

    My recommendation is to use trial and error to get your error rate down as much as possible. Run your program on a large set of text, and keep adding exceptions until you get an acceptable error rate. If, however, you need more than a couple dozen rules or so, you will probably just want to rethink the problem.

    In short, PHP and Regular Expressions aren't meant for this because English is funky. So either live with adding exceptions to get a small(er) error rate, or rethink the point altogether.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 fluent的在模拟压强时使用希望得到一些建议
  • ¥15 STM32驱动继电器
  • ¥15 Windows server update services
  • ¥15 关于#c语言#的问题:我现在在做一个墨水屏设计,2.9英寸的小屏怎么换4.2英寸大屏
  • ¥15 模糊pid与pid仿真结果几乎一样
  • ¥15 java的GUI的运用
  • ¥15 Web.config连不上数据库
  • ¥15 我想付费需要AKM公司DSP开发资料及相关开发。
  • ¥15 怎么配置广告联盟瀑布流
  • ¥15 Rstudio 保存代码闪退