doushan6692 2014-09-26 15:49
浏览 33
已采纳

PHP正则表达式解析 - 用我自己的语言拆分令牌。 有没有更好的办法?

I am creating my own language.

The goal is to "compile" it to PHP or Javascript, and, ultimately, to interpret and run it on the same language, to make it look like a "middle-level" language.

Right now, I'm focusing on the aspect of interpreting it in PHP and run it.

At the moment, I'm using regex to split the string and extract the multiple tokens.

This is the regex I have:

/\:((?:cons@(?:\d+(?:\.\d+)?|(?:"(?:(?:\\\\)+"|[^"]|(?:
||
))*")))|(?:[a-z]+(?:@[a-z]+)?|\^?[\~\&](?:[a-z]+|\d+|\-1)))/g

This is quite hard to read and maintain, even though it works.

Is there a better way of doing this?

Here is an example of the code for my language:

:define:&0:factorial
    :param:~0:static
    :case
        :lower@equal:cons@1
    :case:end
    :scope
        :return:cons@1
    :scope:end
    :scope
        :define:~0:static
        :define:~1:static
        :require:static
        :call:static@sub:^~0:~1 :store:~0
        :call:&-1:~0 :store:~1
        :call:static@sum:^~0:~1 :store:~0
        :return:~0
    :scope:end
:define:end

This defines a recursive function to calculate the factorial (not so well written, that isn't important).

The goal is to get what is after the :, including the @. :static@sub is a whole token, saving it without the :.

Everything is the same, except for the token :cons, which can take a value after. The value is a numerical value (integer or float, called static or dynamic in the language, respectively) or a string, which must start and end with ", supporting escaping like \". Multi-line strings aren't supported.

Variables are the ones with ~0, using ^ before will get the value to the above :scope.

Functions are similar, being used &0 instead and &-1 points to the current function (no need for ^&-1 here).

Said this, Is there a better way to get the tokens?

Here you can see it in action: http://regex101.com/r/nF7oF9/2

  • 写回答

1条回答 默认 最新

  • douke6881 2014-09-26 16:23
    关注

    [Update] To issue the pattern being complicated and maintainability, you can split it using PCRE_EXTENDED, and comments:

    preg_match('/
      # read constant (?)
      \:((?:cons@(?:\d+(?:\.\d+)?|
      # read a string (?)
      (?:"(?:(?:\\\\)+"|[^"]|(?:
    ||
    ))*")))|
      # read an identifier (?)
      (?:[a-z]+(?:@[a-z]+)?|
      # read whatever 
      \^?[\~\&](?:[a-z]+|\d+|\-1)))
      /gx
    ', $input)
    

    Beware that all space are ignored, except under certain conditions ( is normally "safe").


    Now, if you want to pimp you lexer and parser, then read that:

    What does (f)lex [GNU equivalent of LEX] is simply let you pass a list of regexp, and eventually a "group". You can also try ANTLR and PHP Target Runtime to get the work done.

    As for you request, I've made a lexer in the past, following the principle of FLEX. The idea is to cycle through the regexp like FLEX does:

    $regexp = [reg1 => STRING, reg2 => ID, reg3 => WS];
    $input = ...;
    $tokens = [];
    while ($input) {
      $best = null;
      $k = null;
      for ($regexp as $re => $kind) {
        if (preg_match($re, $input, $match)) {
          $best = $match[0];
          $k = $kind;
          break;
        }
      }
    
      if (null === $best) {
        throw new Exception("could not analyze input, invalid token");
      }
    
      $tokens[] = ['kind' => $kind, 'value' => $best];
    
      $input = substr($input, strlen($best)); // move.
    }
    

    Since FLEX and Yacc/Bison integrates, the usual pattern is to read until next token (that is, they don't do a loop that read all input before parsing).

    The $regexp array can be anything, I expected it to be a "regexp" => "kind" key/value, but you can also an array like that:

    $regexp = [['reg' => '...', 'kind' => STRING], ...]
    

    You can also enable/disable regexp using groups (like FLEX groups works): for example, consider the following code:

    class Foobar {
      const FOOBAR = "arg";
      function x() {...}  
    }
    

    There is no need to activate the string regexp until you need to read an expression (here, the expression is what come after the "="). And there is no need to activate the class identifier when you are actually in a class.

    FLEX's group permits to read comments, using a first regexp, activating some group that would ignore other regexp, until some matches is done (like "*/").

    Note that this approach is a naïve approach: a lexer like FLEX will actually generate an automaton, which use different state to represent your need (the regexp is itself an automaton).

    This use an algorithm of packed indexes or something alike (I used the naïve "for each" because I did not understand the algorithm enough) which is memory and speed efficient.

    As I said, it was something I made in the past - something like 6/7 years ago.

    • It was on Windows.
    • It was not particularly quick (well it is O(N²) because of the two loops).
    • I think also that PHP was compiling the regexp each times. Now that I do Java, I use the Pattern implementation which compile the regexp once, and let you reuse it. I don't know PHP does the same by first looking into a regexp cache if there was already a compiled regexp.
    • I was using preg_match with an offset, to avoid doing the substr($input, ...) at the end.

    You should try to use the ANTLR3 PHP Code Generation Target, since the ANTLR grammar editor is pretty easy to use, and you will have a really more readable/maintainable code :)

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 微信会员卡接入微信支付商户号收款
  • ¥15 如何获取烟草零售终端数据
  • ¥15 数学建模招标中位数问题
  • ¥15 phython路径名过长报错 不知道什么问题
  • ¥15 深度学习中模型转换该怎么实现
  • ¥15 HLs设计手写数字识别程序编译通不过
  • ¥15 Stata外部命令安装问题求帮助!
  • ¥15 从键盘随机输入A-H中的一串字符串,用七段数码管方法进行绘制。提交代码及运行截图。
  • ¥15 TYPCE母转母,插入认方向
  • ¥15 如何用python向钉钉机器人发送可以放大的图片?