donglian7879 2019-03-07 17:27
浏览 240
已采纳

正则表达式在点号|分号处分割,但忽略URL

I'm trying to parse and match a lot of legal text, splitting it all up into individual sentences. I have the following regex which would work for just a few lines of easy text just fine:

[^\.\!\?\;
]*[\.\!\?\;
](\s+)

! and ? or pretty irrelvant here but . and ; as separators are quite common in the texts I'm trying to work with. The problem is that the above regex is just finding those delimiters followed by a space character. The following text for example would not be properly matched:

Member State law or pursuant to contract with a health professional and subject to the conditions and safeguards referred to in paragraph 3; processing is necessary for reasons of public interest in the area of public health, such as protecting against serious cross-border threats to health or ensuring high standards comparison tool at https://ec.europa.eu/ploteus/en/compare Adopted 7 comparable procedures (e. g. certifications/audits), and registered as required by the Member State. of quality and safety of health care and of medicinal products or medical devices, on the basis of Union or Member State law, which provides for suitable and specific measures to safeguard the rights and freedoms of the data subject, in particular professional secrecy; processing is...

the following entire section:

processing is necessary for reasons of public interest in the area of public health, such as protecting against serious cross-border threats to health or ensuring high standards comparison tool at https://ec.europa.

would not be matched at all.

Any help in improving the above regex would be greatly appreciated!

Thanks

  • 写回答

1条回答 默认 最新

  • dongshan4549 2019-03-08 17:55
    关注

    I think the name of what you want is a sentence tokenizer. For Go, I can recommend one library: github.com/jdkato/prose, it should do the job like a charm.

    Personally, I never used. Good luck!

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 c程序不知道为什么得不到结果
  • ¥40 复杂的限制性的商函数处理
  • ¥15 程序不包含适用于入口点的静态Main方法
  • ¥15 素材场景中光线烘焙后灯光失效
  • ¥15 请教一下各位,为什么我这个没有实现模拟点击
  • ¥15 执行 virtuoso 命令后,界面没有,cadence 启动不起来
  • ¥50 comfyui下连接animatediff节点生成视频质量非常差的原因
  • ¥20 有关区间dp的问题求解
  • ¥15 多电路系统共用电源的串扰问题
  • ¥15 slam rangenet++配置