donglian7879 2019-03-07 17:27
浏览 240
已采纳

正则表达式在点号|分号处分割,但忽略URL

I'm trying to parse and match a lot of legal text, splitting it all up into individual sentences. I have the following regex which would work for just a few lines of easy text just fine:

[^\.\!\?\;
]*[\.\!\?\;
](\s+)

! and ? or pretty irrelvant here but . and ; as separators are quite common in the texts I'm trying to work with. The problem is that the above regex is just finding those delimiters followed by a space character. The following text for example would not be properly matched:

Member State law or pursuant to contract with a health professional and subject to the conditions and safeguards referred to in paragraph 3; processing is necessary for reasons of public interest in the area of public health, such as protecting against serious cross-border threats to health or ensuring high standards comparison tool at https://ec.europa.eu/ploteus/en/compare Adopted 7 comparable procedures (e. g. certifications/audits), and registered as required by the Member State. of quality and safety of health care and of medicinal products or medical devices, on the basis of Union or Member State law, which provides for suitable and specific measures to safeguard the rights and freedoms of the data subject, in particular professional secrecy; processing is...

the following entire section:

processing is necessary for reasons of public interest in the area of public health, such as protecting against serious cross-border threats to health or ensuring high standards comparison tool at https://ec.europa.

would not be matched at all.

Any help in improving the above regex would be greatly appreciated!

Thanks

  • 写回答

1条回答 默认 最新

  • dongshan4549 2019-03-08 17:55
    关注

    I think the name of what you want is a sentence tokenizer. For Go, I can recommend one library: github.com/jdkato/prose, it should do the job like a charm.

    Personally, I never used. Good luck!

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 做个有关计算的小程序
  • ¥15 MPI读取tif文件无法正常给各进程分配路径
  • ¥15 如何用MATLAB实现以下三个公式(有相互嵌套)
  • ¥30 关于#算法#的问题:运用EViews第九版本进行一系列计量经济学的时间数列数据回归分析预测问题 求各位帮我解答一下
  • ¥15 setInterval 页面闪烁,怎么解决
  • ¥15 如何让企业微信机器人实现消息汇总整合
  • ¥50 关于#ui#的问题:做yolov8的ui界面出现的问题
  • ¥15 如何用Python爬取各高校教师公开的教育和工作经历
  • ¥15 TLE9879QXA40 电机驱动
  • ¥20 对于工程问题的非线性数学模型进行线性化