douyi6960 2012-11-18 21:33
浏览 33
已采纳

优化句子消毒剂的正则表达式

This is a sentence sanitizer.

function sanitize_sentence($string) {
    $pats = array(
    '/([.!?]\s{2}),/',      # Abc.  ,Def
    '/\.+(,)/',             # ......,
    '/(!|\?)\1+/',          # abc!!!!!!!!, abc?????????
    '/\s+(,)/',             # abc   , def
    '/([a-zA-Z])\1\1/');    # greeeeeeen
    $fixed = preg_replace($pats,'$1',$string); # apply pats
    $fixed = preg_replace('/(?:(?<=\s)|^)[^a-z0-9]+(?:(?=\s)|$)/i', '',$fixed); # bad chunks
    $fixed = preg_replace( '/([!?,.])(\S)/', '$1 $2', $fixed); # spaces after punctuation, if it doesn't exist already
    $fixed = preg_replace( '/[^a-zA-Z0-9!?.]+$/', '.', $fixed); # end of string must end in period
    $fixed = preg_replace('/,(?!\s)/',', ',$fixed); # spaces after commas
    return $fixed;
}

This is the test sentence:

hello [[[[[[]]]]]] friend.....? how are you [}}}}}}

It should return:

hello friend.....? how are you

But instead it is returning:

hello friend. .. .. ? how are you.

So there are 2 problems and I can't find a solution around them:

  1. the set of periods are being separated into ".. .. ." for some reason. They should remain as "....." next to the question mark.
  2. the end of the string must end in a period only and only if there is at least one of these characters anywhere in the string: !?,. (if at least one of those characters are not found in the string, that preg_replace should not be executed)

Examples for the second problem:

This sentence doesn't need an ending period because the mentioned characters are nowhere to be found

This other sentence, needs it! Why? Because it contains at least one of the mentioned characters

(of course, the ending period should only be placed if it doesn't exist yet)

Thanks for your help!

  • 写回答

1条回答 默认 最新

  • dongping1689 2012-11-18 21:40
    关注

    Here is the answer to your first problem. The third-to-last replacement is the problem:

    $fixed = preg_replace( '/([!?,.])(\S)/', '$1 $2', $fixed); # spaces after punctuation, if it doesn't exist already
    

    It will match the first period with the character class, and the second period as a non-space character. Then insert a space. Since matches cannot overlap, it will then match the third and forth period and insert a space and so on. This is probably best fixed like this:

    $fixed = preg_replace( '/[!?,.](?![!?,.\s])/', '$0 ', $fixed);
    

    Here is how you could go about your second requirement (replace the second-to-last preg_replace):

    $fixed = trim($fixed);
    $fixed = preg_replace( '/[!?.,].*(?<![.!?])$/', '$0.', $fixed);
    

    First we get rid of leading and trailing whitespace to separate this concern from the trailing period. Then the preg_replace will try to find a punctuation character in the string and if it does, it matches everything until the end of the string. The replacement puts the match back in place and appends the period. Note the negative lookbehind. It asserts that the string does not already end with a sentence-ending punctuation character.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 求帮我调试一下freefem代码
  • ¥15 R语言Rstudio突然无法启动
  • ¥15 关于#matlab#的问题:提取2个图像的变量作为另外一个图像像元的移动量,计算新的位置创建新的图像并提取第二个图像的变量到新的图像
  • ¥15 改算法,照着压缩包里边,参考其他代码封装的格式 写到main函数里
  • ¥15 用windows做服务的同志有吗
  • ¥60 求一个简单的网页(标签-安全|关键词-上传)
  • ¥35 lstm时间序列共享单车预测,loss值优化,参数优化算法
  • ¥15 Python中的request,如何使用ssr节点,通过代理requests网页。本人在泰国,需要用大陆ip才能玩网页游戏,合法合规。
  • ¥100 为什么这个恒流源电路不能恒流?
  • ¥15 有偿求跨组件数据流路径图