duanfei1268 2016-06-25 00:25
浏览 108

使用PLY.yacc进行适当的解析策略

I am writing a PHP parser in PLY in order to teach myself the concepts of lexing/parsing.

I have the lexer tokens created for a very simple PHP code snippet but I am stuck on the proper way to parse.

Here is the code snippet I am trying to lex/parse:

  <?php if (isset($_REQUEST['name'])){
        $name = $_REQUEST['name'];
        $msg = "Hello, " . $name . "!";
        $encoded = htmlspecialchars($msg);
  }
  ?>

My goal is to trace the user-input to determine that is has indeed reached the htmlspecialchars() method. My current parsing strategy gets me as far as the parsing to line 2

$name = $_REQUEST['name'];

but I have no idea what the proper way to parse line 3:

$msg = "Hello, " . $name . "!";

The complication is that I will never be certain how many concatenations will take place on my user-input and I feel it is wrong to "hard code" just to successfully parse the example code. For example with this line I'm interested in the fact that the $msg variable includes my user-supplied data (from $name variable)

I have tried parsing this token in probably the worst possible way just to test if I could reach it but when I run my script it says WARNING: Symbol 'wrong' is unreachable

def p_wrong(p):
    '''wrong : VARIABLE EQUALS QUOTED_ENCAPSED_STRING DOT VARIABLE DOT QUOTED_ENCAPSED_STRING SEMICOLON'''
    print "wrong"

So I am hoping for guidance I how to understand how to parse line #3 in such a way that it won't matter how many concatenations or other operations take place on the variables I am tracing. I have a feeling this is where a lesson on BNF grammar or the wonderfully painful complexities of parsing will begin. But I want to learn I just don't know where to start.

Here is my complete code at this point:

import ply.lex as lex
import ply.yacc as yacc

string = """<?php if (isset($_REQUEST['name'])){
               $name = $_REQUEST['name'];
               $msg = "Hello, " . $name . "!";
               $encoded = htmlspecialchars($msg);
}
?>"""

delimeters = ('LPAREN', 'RPAREN', 'LBRACKET', 'RBRACKET')

tokens = delimeters + (
    "CHAR",
    "NUM",
    "OPEN_TAG",
    "CLOSE_TAG",
    "VARIABLE",
    "CONSTANT_ENCAPSED_STRING",
    "ENCAPSED_AND_WHITESPACE",
    "QUOTED_ENCAPSED_STRING",
    "LCURLYBRACKET",
    "RCURLYBRACKET",
    "EQUALS",
    "SEMICOLON",
    "QUOTE",
    "DOT",
    "IF"
)

t_ignore         = " \t"
t_CHAR           = r"[a-z]"
t_LPAREN         = r'\('
t_RPAREN         = r'\)'
t_RBRACKET       = r'\]'
t_LBRACKET       = r'\['
t_RCURLYBRACKET  = r'\}'
t_LCURLYBRACKET  = r'\{'
t_EQUALS         = r'='
t_SEMICOLON      = r';'
t_DOT            = r'\.'


def t_newline(t):
    r'
+'
    t.lexer.lineno += t.value.count("
")

def t_CONSTANT_ENCAPSED_STRING(t):
    r"'([^\\']|\\(.|
))*'"
    t.lexer.lineno += t.value.count("
")
    return t

def t_QUOTED_ENCAPSED_STRING(t):
    r"""\"([^\\"]|\\(.|
))*\""""
    t.lexer.lineno += t.value.count("
")
    return t

def t_OPEN_TAG(t):
    r'<[?%]((php[ \t
]?)|=)?'
    if '=' in t.value: t.type = 'OPEN_TAG_WITH_ECHO'
    t.lexer.lineno += t.value.count("
")
    return t

def t_CLOSE_TAG(t):
    r'[?%]>?
?'
    t.lexer.lineno += t.value.count("
")
    #t.lexer.begin('INITIAL')
    return t

def t_VARIABLE(t):
    r'\$[A-Za-z_][\w_]*'
    return t

def t_NUM(t):
    r"\d+"
    t.value = int(t.value)
    return t

def t_error(t):
    print t.lexer.current_state
    print dir(t.lexer)
    raise TypeError("unknown char '%s'"%(t.value))

lexer = lex.lex()

lex.input(string)
for tok in iter(lex.token, None):
    print repr(tok.type), repr(tok.value)


##now for the parsing

"""
$name = $_REQUEST['name'];
$msg = "Hello, " . $name . "!";    
"""

def p_assign(p):
    '''assign : VARIABLE EQUALS input'''
    print "assign rule"
    print p[1],p[2],p[3]
    p[0] = p[1]

def p_input(p):
    '''input : VARIABLE LBRACKET CONSTANT_ENCAPSED_STRING RBRACKET SEMICOLON
             | VARIABLE LBRACKET QUOTED_ENCAPSED_STRING RBRACKET SEMICOLON'''
    print "input rule"
    value =  p[1]+p[2]+p[3]+p[4]+p[5]
    p[0] = value

def p_wrong(p):
    '''wrong : VARIABLE EQUALS QUOTED_ENCAPSED_STRING DOT VARIABLE DOT QUOTED_ENCAPSED_STRING SEMICOLON'''
    print "wrong"    


yacc.yacc()
yacc.parse(string)

And the results:

...
WARNING: There is 1 unused rule
WARNING: Symbol 'wrong' is unreachable
Generating LALR tables
yacc: Syntax error at line 6, token=OPEN_TAG
input rule
assign rule
$name = $_REQUEST['name'];
yacc: Syntax error at line 8, token=VARIABLE

My (incorrect) attempt at parsing line 3 (with the format hard-coded in the parser rule p_wrong) doesn't even get hit. But I would just like some guidance on how to proceed to parse this simple code block.

Desired output

Ideally I will have results that allow me to trace the user input something like this:

user-input -> $name -> $msg -> htmlspecialchars($msg)
  • 写回答

0条回答 默认 最新

    报告相同问题?

    悬赏问题

    • ¥15 素材场景中光线烘焙后灯光失效
    • ¥15 请教一下各位,为什么我这个没有实现模拟点击
    • ¥15 执行 virtuoso 命令后,界面没有,cadence 启动不起来
    • ¥50 comfyui下连接animatediff节点生成视频质量非常差的原因
    • ¥20 有关区间dp的问题求解
    • ¥15 多电路系统共用电源的串扰问题
    • ¥15 slam rangenet++配置
    • ¥15 有没有研究水声通信方面的帮我改俩matlab代码
    • ¥15 ubuntu子系统密码忘记
    • ¥15 保护模式-系统加载-段寄存器