donglian7879 2019-03-07 17:27
浏览 240
已采纳

正则表达式在点号|分号处分割,但忽略URL

I'm trying to parse and match a lot of legal text, splitting it all up into individual sentences. I have the following regex which would work for just a few lines of easy text just fine:

[^\.\!\?\;
]*[\.\!\?\;
](\s+)

! and ? or pretty irrelvant here but . and ; as separators are quite common in the texts I'm trying to work with. The problem is that the above regex is just finding those delimiters followed by a space character. The following text for example would not be properly matched:

Member State law or pursuant to contract with a health professional and subject to the conditions and safeguards referred to in paragraph 3; processing is necessary for reasons of public interest in the area of public health, such as protecting against serious cross-border threats to health or ensuring high standards comparison tool at https://ec.europa.eu/ploteus/en/compare Adopted 7 comparable procedures (e. g. certifications/audits), and registered as required by the Member State. of quality and safety of health care and of medicinal products or medical devices, on the basis of Union or Member State law, which provides for suitable and specific measures to safeguard the rights and freedoms of the data subject, in particular professional secrecy; processing is...

the following entire section:

processing is necessary for reasons of public interest in the area of public health, such as protecting against serious cross-border threats to health or ensuring high standards comparison tool at https://ec.europa.

would not be matched at all.

Any help in improving the above regex would be greatly appreciated!

Thanks

  • 写回答

1条回答 默认 最新

  • dongshan4549 2019-03-08 17:55
    关注

    I think the name of what you want is a sentence tokenizer. For Go, I can recommend one library: github.com/jdkato/prose, it should do the job like a charm.

    Personally, I never used. Good luck!

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 配置FPT报错,该如何处理
  • ¥15 请大家看一下这个代码咋写,一点思路都没有,最好能做一下,不要伪代码,有偿
  • ¥15 有偿请人帮写个安卓系统下禁止装软件及禁止拷入文件的程序
  • ¥100 用 H.265 对音视频硬编码 (CUDA)
  • ¥20 mpich安装完成后出问题
  • ¥15 stm32循迹小车代码问题
  • ¥15 输入一堆单词,使其去重输出
  • ¥15 qc代码,修改和添加东西
  • ¥50 Unity的粒子系统使用shadergraph(内置管线)制作的一个顶点偏移shader,但是粒子模型移动时,顶点也会偏移
  • ¥15 如何用python处理excel的数据(极值标准化)