PHP正则表达式解析 - 用我自己的语言拆分令牌。有没有更好的办法？

I am creating my own language.

The goal is to "compile" it to PHP or Javascript, and, ultimately, to interpret and run it on the same language, to make it look like a "middle-level" language.

Right now, I'm focusing on the aspect of interpreting it in PHP and run it.

At the moment, I'm using regex to split the string and extract the multiple tokens.

This is the regex I have:

/\:((?:cons@(?:\d+(?:\.\d+)?|(?:"(?:(?:\\\\)+"|[^"]|(?:
||
))*")))|(?:[a-z]+(?:@[a-z]+)?|\^?[\~\&](?:[a-z]+|\d+|\-1)))/g

This is quite hard to read and maintain, even though it works.

Is there a better way of doing this?

Here is an example of the code for my language:

:define:&0:factorial
    :param:~0:static
    :case
        :lower@equal:cons@1
    :case:end
    :scope
        :return:cons@1
    :scope:end
    :scope
        :define:~0:static
        :define:~1:static
        :require:static
        :call:static@sub:^~0:~1 :store:~0
        :call:&-1:~0 :store:~1
        :call:static@sum:^~0:~1 :store:~0
        :return:~0
    :scope:end
:define:end

This defines a recursive function to calculate the factorial (not so well written, that isn't important).

The goal is to get what is after the :, including the @. :static@sub is a whole token, saving it without the :.

Everything is the same, except for the token :cons, which can take a value after. The value is a numerical value (integer or float, called static or dynamic in the language, respectively) or a string, which must start and end with ", supporting escaping like \". Multi-line strings aren't supported.

Variables are the ones with ~0, using ^ before will get the value to the above :scope.

Functions are similar, being used &0 instead and &-1 points to the current function (no need for ^&-1 here).

Said this, Is there a better way to get the tokens?

Here you can see it in action: http://regex101.com/r/nF7oF9/2

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
douke6881 2014-09-26 16:23
关注
[Update] To issue the pattern being complicated and maintainability, you can split it using PCRE_EXTENDED, and comments:

preg_match('/ # read constant (?) \:((?:cons@(?:\d+(?:\.\d+)?| # read a string (?) (?:"(?:(?:\\\\)+"|[^"]|(?: || ))*")))| # read an identifier (?) (?:[a-z]+(?:@[a-z]+)?| # read whatever \^?[\~\&](?:[a-z]+|\d+|\-1))) /gx ', $input)

Beware that all space are ignored, except under certain conditions ( is normally "safe").

Now, if you want to pimp you lexer and parser, then read that:

What does (f)lex [GNU equivalent of LEX] is simply let you pass a list of regexp, and eventually a "group". You can also try ANTLR and PHP Target Runtime to get the work done.

As for you request, I've made a lexer in the past, following the principle of FLEX. The idea is to cycle through the regexp like FLEX does:

$regexp = [reg1 => STRING, reg2 => ID, reg3 => WS]; $input = ...; $tokens = []; while ($input) { $best = null; $k = null; for ($regexp as $re => $kind) { if (preg_match($re, $input, $match)) { $best = $match[0]; $k = $kind; break; } } if (null === $best) { throw new Exception("could not analyze input, invalid token"); } $tokens[] = ['kind' => $kind, 'value' => $best]; $input = substr($input, strlen($best)); // move. }

Since FLEX and Yacc/Bison integrates, the usual pattern is to read until next token (that is, they don't do a loop that read all input before parsing).

The $regexp array can be anything, I expected it to be a "regexp" => "kind" key/value, but you can also an array like that:

$regexp = [['reg' => '...', 'kind' => STRING], ...]

You can also enable/disable regexp using groups (like FLEX groups works): for example, consider the following code:

class Foobar { const FOOBAR = "arg"; function x() {...} }

There is no need to activate the string regexp until you need to read an expression (here, the expression is what come after the "="). And there is no need to activate the class identifier when you are actually in a class.

FLEX's group permits to read comments, using a first regexp, activating some group that would ignore other regexp, until some matches is done (like "*/").

Note that this approach is a naïve approach: a lexer like FLEX will actually generate an automaton, which use different state to represent your need (the regexp is itself an automaton).

This use an algorithm of packed indexes or something alike (I used the naïve "for each" because I did not understand the algorithm enough) which is memory and speed efficient.

As I said, it was something I made in the past - something like 6/7 years ago.

It was on Windows.

It was not particularly quick (well it is O(N²) because of the two loops).

I think also that PHP was compiling the regexp each times. Now that I do Java, I use the Pattern implementation which compile the regexp once, and let you reuse it. I don't know PHP does the same by first looking into a regexp cache if there was already a compiled regexp.

I was using preg_match with an offset, to avoid doing the substr($input, ...) at the end.

You should try to use the ANTLR3 PHP Code Generation Target, since the ANTLR grammar editor is pretty easy to use, and you will have a really more readable/maintainable code :)
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

PHP正则表达式解析 - 用我自己的语言拆分令牌。有没有更好的办法？ php
2014-09-26 15:49

回答 1 已采纳 [Update] To issue the pattern being complicated and maintainability, you can split it using PCRE_E
正则表达式0-1包含两位小数正则表达式
2018-06-20 02:55

回答 7 已采纳 ``` var re=/^(1|0(\.\d{1,2})?)$/ console.log(re.test('1')) console.log(re.test('0')) consol
正则表达式----【yyyy-MM-dd HH:mm:ss.SSS】
2016-06-13 12:49

回答 3 已采纳 ``` (19|20)\d{2}\-[01]\d\-[0123]\d\s\d{2}\:\d{2}\:\d{2}\.\d{3} ```
php正则匹配电子邮件,php – 使用正则表达式验证电子邮件地址
2021-04-03 08:20

橙一橙的博客这个问题没有简单的正则表达式：seethis fully RFC‑822–compliant regex，这是什么不简单。 (它是在语法模式之前写的。)RFC 5322中规定的语法对于原始正则表达式来说太复杂了。Perl，PCRE和PHP中更复杂的语法模式...
el-input如何添加这种正则表达式校验？ javascript vue.js 正则表达式
2021-05-14 11:30

回答 1 已采纳在@change回调里面判断。或者表单验证里面做判断就行了
正则表达式 匹配1-1200的正整数开发语言正则表达式
2021-10-25 14:36

回答 2 已采纳 ^([1-9]|[1-9]\d|1\d{2}|1200)$
求一个php正则表达式 php 正则表达式
2022-01-23 19:47

回答 1 已采纳试试这个import repattern = re.compile (r'(?:money=)\d+.?\d*')pattern.findall(string)
php拆词,关于php：将句子拆分成单独的单词
2021-03-26 11:27

瓜瓜龙的博客我需要将中文句子拆分为单独的单词。中文的问题是没有空格。例如，该句子可能看起来像：主楼怎么走(带空格的地方是：主楼怎么走)。目前，我可以想到一种解决方案。我有一本有中文单词的字典(在数据库中)。该...
el-form-item绑定的经纬度的正则表达式-90~90怎么表示呀 elementui javascript 前端
2022-03-17 14:34

回答 1 已采纳 https://blog.csdn.net/weixin_35425512/article/details/80358352
求一个正则表达式1-10之间要求是可以为1或者10，并且只允许小数点后一位，比如5.5。但5.55就不行正则表达式
2018-01-12 07:51

回答 14 已采纳优化 @jmy1980的代码 ``` import java.util.regex.Pattern; public class test { public static void
java 正则表达式解析公式问题 java 有问必答正则表达式
2021-07-07 10:44

回答 3 已采纳 /(\-?[^\+\-\*\/]+)([\+\-])((?:[^\+\-\*/]|[-](?=[0-9]))+)/gi.exec('lineData(1,"debit”)+adjustHis("801
Php知识点-CI
2024-04-18 15:45

wxjing1的博客 CodeIgniter包含库, 简单的界面和逻辑结构, 用于访问这些库, 插件, 帮助程序和其他一些资源, 这些资源解决了PHP的复杂功能, 更易于维护高性能。控制器是CodeIgniter框架的基本组成部分，它是Web应用程序的所有请求的...
请教一个PHP正则表达式的问题 php 有问必答正则表达式
2021-08-24 09:13

回答 2 已采纳这样？有帮助麻烦点个采纳【本回答右上角】，谢谢~~ <?php $s=<<<str 1.\$foo->\$bar['baz'] 主要想用两个正则表达式，放入编辑器以查询
php面试题2024
2024-04-01 16:10

这货不是陈进坚的博客 php面试题汇总
php pregmatchall,关于php：使用preg_match_all()获得重复的比赛
2021-04-12 18:15

weixin_39530647的博客我正在尝试使所有子字符串与乘数匹配：$list = '1,2,3,4';preg_match_all('|\d+(,\d+)*|', $list, $matches);print_r($matches);此示例按预期返回[1]中的最后一个匹配项：Array([0] =>... ,4))但是，我想用(...
没有解决我的问题, 去提问

悬赏问题

¥15 微信会员卡接入微信支付商户号收款
¥15 如何获取烟草零售终端数据
¥15 数学建模招标中位数问题
¥15 phython路径名过长报错不知道什么问题
¥15 深度学习中模型转换该怎么实现
¥15 HLs设计手写数字识别程序编译通不过
¥15 Stata外部命令安装问题求帮助！
¥15 从键盘随机输入A-H中的一串字符串，用七段数码管方法进行绘制。提交代码及运行截图。
¥15 TYPCE母转母，插入认方向
¥15 如何用python向钉钉机器人发送可以放大的图片？

PHP正则表达式解析 - 用我自己的语言拆分令牌。 有没有更好的办法？

1条回答 默认 最新

悬赏问题

PHP正则表达式解析 - 用我自己的语言拆分令牌。有没有更好的办法？

1条回答默认最新