douwo6738 2015-02-07 20:25
浏览 113
已采纳

尝试使用preg_match_all将包含3个或更少字符的单词组成4个或更多字符的单词

I am trying to group words of 4 or more characters with words of 3 or less characters using preg_match_all() in PHP. I am doing this for a keyword search function where users can enter things like "An elephant" and I cannot have any results come back that have just "An" in them.

Therefore instead of breaking the keywords apart by spaces, (e.g. "An", "elephant") I need to put the keywords of three or less characters with the next or previous keyword. (e.g. "An elephant", "History of")

In order to accomplish this I am trying to use conditional sub patterns but I am not sure if I am really on the right track here.

Here's the best I've got so far:

(\s\S{1,3}\s*)?(?(1)\S+)

Yet I seem to also be matching a whole bunch of empty spaces as well. Can someone please point me in the right direction?

In the case of "History of elephants" I am trying to get it to create two matches: "History of", and "elephants".

I cannot simply omit the "stop words" because they are important in this case. The real-life use case is searching for course titles such as "Calculus A" and in that case "A" is important.

  • 写回答

2条回答 默认 最新

  • dqq48152418 2015-02-07 22:23
    关注

    See if this would match your needs:

    \b(?:[\w'-]{1,3}\W+[\w'-]{4,}|[\w'-]{4,}\W+[\w'-]{1,3}|[\w'-]{4,})\b
    
    • Starts at \b word boundaries where it...
    • [\w'-]{1,3}\W+[\w'-]{4,} matches 1-3 word characters, followed by \W+ one or more non-word characters, followed by [\w'-]{4,}\b 4 or more word characters.
    • |[\w'-]{4,}\W+[\w'-]{1,3} or matches first the 4+ words followed by shorter ones.
    • |[\w'-]{4,} or matches any words with at least 4 characters. (reduce if needed)

    Test at regex101.com; Regex FAQ

    Also see the problems if input is such as "I visted Calculus A, you in Calculus B?"; Outputs: I visted, Calculus A, in Calculus because of the priority of preceding words.


    And a PHP-example ($out[0] would hold the matches)

    $str = "
    An elephant in the garden 
    history of elephants
    Algebra A B-movies";
    
    $pattern = '~\b(?:
    [\w\'-]{1,3}\W+[\w\'-]{4,}|
    [\w\'-]{4,}\W+[\w\'-]{1,3}|
    [\w\'-]{4,}
    )\b~x';
    
    if(preg_match_all($pattern, $str, $out)) {
      print_r($out[0]);
    }
    

    outputs to:

    Array
    (
        [0] => An elephant
        [1] => the garden
        [2] => history of
        [3] => elephants
        [4] => Algebra A
        [5] => B-movies
    )
    

    Test at eval.in (link expires soon)

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 求螺旋焊缝的图像处理
  • ¥15 blast算法(相关搜索:数据库)
  • ¥15 请问有人会紧聚焦相关的matlab知识嘛?
  • ¥15 网络通信安全解决方案
  • ¥50 yalmip+Gurobi
  • ¥20 win10修改放大文本以及缩放与布局后蓝屏无法正常进入桌面
  • ¥15 itunes恢复数据最后一步发生错误
  • ¥15 关于#windows#的问题:2024年5月15日的win11更新后资源管理器没有地址栏了顶部的地址栏和文件搜索都消失了
  • ¥100 H5网页如何调用微信扫一扫功能?
  • ¥15 讲解电路图,付费求解