dth62818 2012-10-14 19:08
浏览 257
已采纳

正则表达式,将从文本文件中提取句子

I need a regular expression that will extract sentences from text file. example text :

Consider, for example, the Asian tsunami disaster that happened in the end of 2004. A query to Google News (http://news.google.com) returned more than 80,000 online news articles about this event within one month (Jan.17 through Feb.17, 2005). information by mr. Kahana.

here's my code :

$re = '/(?<=[.!?]|[.!?][\'"])\s+/';
$sentences = preg_split($re, $text, -1, PREG_SPLIT_NO_EMPTY);

but the last sentence still splitted information by mr. and Kahana. how to solve it ? thank you :)

  • 写回答

1条回答 默认 最新

  • duanpo1498 2012-10-14 19:31
    关注

    You Can't Do this with Regular Expressions

    English as a language does not fit into well-placed formatting rules. As such, regular expressions are not fit to fulfill the purpose you are seeking out. What you are really looking for is something like a Natural Language Processor.

    Unless this is critical to your program, I suggest you instead determine the following things:

    • What is an acceptable level of error? Nothing you do will be perfect. But if it works 80% is that okay? 90%? 99%? How critical is this to you/your client?
    • Where is the text coming from? For example, a textbook will most likely be written differently than people's twitter feeds. You can do research and make exceptions based on what you see in the actual text you are using.
    • What am I doing with the text? If you are just indexing things like keywords, then it doesn't matter (as much) if you get the sentences split correctly. It's all about tuning the program to get the appropriate output for this specific purpose.

    My recommendation is to use trial and error to get your error rate down as much as possible. Run your program on a large set of text, and keep adding exceptions until you get an acceptable error rate. If, however, you need more than a couple dozen rules or so, you will probably just want to rethink the problem.

    In short, PHP and Regular Expressions aren't meant for this because English is funky. So either live with adding exceptions to get a small(er) error rate, or rethink the point altogether.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
编辑
预览

报告相同问题?

手机看
程序员都在用的中文IT技术交流社区

程序员都在用的中文IT技术交流社区

专业的中文 IT 技术社区,与千万技术人共成长

专业的中文 IT 技术社区,与千万技术人共成长

关注【CSDN】视频号,行业资讯、技术分享精彩不断,直播好礼送不停!

关注【CSDN】视频号,行业资讯、技术分享精彩不断,直播好礼送不停!

客服 返回
顶部