dounouxi1020 2012-12-12 12:31
浏览 30
已采纳

不规则的RegEx行为

I have a string:

$day = "11.08.2012 PROC BRE-AMS 08:00-12:00 ( MIETWAGEN MIT BAK RES 6049687886 ) Y AMS-AMS 13:15-19:15"

And I have a regular expression:

$data = preg_split("/(?=[A-Z]{1,4}[\s]+[A-Z]{3}[\-][A-Z]{3}[\s]+)/", $day);

The expected $data-Array should be:

array
      0 => string '11.08.2012 ' (length=11)
      1 => string 'PROC 08:00-12:00 ( MIETWAGEN MIT BAK RES 6049687886 ) ' (length=22)
      2 => string 'Y AMS-AMS 13:15-19:15' (length=21)

But my result is:

0 => string '11.08.2012 ' (length=11)
      1 => string 'P' (length=1)
      2 => string 'R' (length=1)
      3 => string 'O' (length=1)
      4 => string 'C BRE-AMS 08:00-12:00 ( MIETWAGEN MIT BAK RES 6049687886 ) ' (length=59)
      5 => string 'Y AMS-AMS 13:15-19:15' (length=21)

I cannot retrace what´s happening here. Could someone pleaqse explain?

  • 写回答

2条回答 默认 最新

  • douzai2562 2012-12-12 12:40
    关注

    In short, the problem is that (?=...) subexpression in your pattern match a position. I understand that was exactly your intention; the problem is, the next match is started not when the pattern specified in (?=) ends its match - but at the position matched by the lookahead + 1 symbol.

    Let's check this process in details. First time the split is attempted, it walks the string until it got to the position marked by asterisk:

    11.08.2012 *PROC BRE-AMS 08:00-12:00
    

    ... where it can match the pattern given. For the next attempt, the starting position 'bumps along' one symbol, so now we're here:

    11.08.2012 P*ROC BRE-AMS 08:00-12:00
    

    ... and voila, we again can match this pattern, because of that {1,4} quantifier! That's how you got these 'irregular' P, R and O symbols.


    That's for explanation, now for the "how to fix" part. The easiest way out of this, I suppose, is adding this little twist in your split pattern:

    $data = preg_split('/\b(?=[A-Z]{1,4}\s+[A-Z]{3}-[A-Z]{3}\s+)/', $day);
    

    We still match for position - but now this position should be the one that separates a 'word' symbol from a non-word one. The same idea can be expressed with negative lookbehind pattern:

    $data = preg_split('/(?<![A-Z])(?=[A-Z]{1,4}\s+[A-Z]{3}-[A-Z]{3}\s+)/', $day);
    

    ... which is actually more precise, but less elegant, I suppose. )

    Two sidenotes here: 1) don't use character class syntax when you need to specify a single symbol (simple - - - or 'shortcut' one, like \s); 2) use single quotation marks to delimit your pattern unless you want to interpolate some variables in it.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥20 关于php中URL传递GET全局变量的问题
  • ¥15 怎么改成循环输入删除(语言-c语言)
  • ¥15 安卓C读取/dev/fastpipe屏幕像素数据
  • ¥15 pyqt5tools安装失败
  • ¥15 mmdetection
  • ¥15 nginx代理报502的错误
  • ¥100 当AWR1843发送完设置的固定帧后,如何使其再发送第一次的帧
  • ¥15 图示五个参数的模型校正是用什么方法做出来的。如何建立其他模型
  • ¥100 描述一下元器件的基本功能,pcba板的基本原理
  • ¥15 STM32无法向设备写入固件