I am writing a PHP parser in PLY in order to teach myself the concepts of lexing/parsing.
I have the lexer tokens created for a very simple PHP code snippet but I am stuck on the proper way to parse.
Here is the code snippet I am trying to lex/parse:
<?php if (isset($_REQUEST['name'])){
$name = $_REQUEST['name'];
$msg = "Hello, " . $name . "!";
$encoded = htmlspecialchars($msg);
}
?>
My goal is to trace the user-input to determine that is has indeed reached the htmlspecialchars()
method. My current parsing strategy gets me as far as the parsing to line 2
$name = $_REQUEST['name'];
but I have no idea what the proper way to parse line 3:
$msg = "Hello, " . $name . "!";
The complication is that I will never be certain how many concatenations will take place on my user-input and I feel it is wrong to "hard code" just to successfully parse the example code. For example with this line I'm interested in the fact that the $msg
variable includes my user-supplied data (from $name
variable)
I have tried parsing this token in probably the worst possible way just to test if I could reach it but when I run my script it says WARNING: Symbol 'wrong' is unreachable
def p_wrong(p):
'''wrong : VARIABLE EQUALS QUOTED_ENCAPSED_STRING DOT VARIABLE DOT QUOTED_ENCAPSED_STRING SEMICOLON'''
print "wrong"
So I am hoping for guidance I how to understand how to parse line #3 in such a way that it won't matter how many concatenations or other operations take place on the variables I am tracing. I have a feeling this is where a lesson on BNF grammar or the wonderfully painful complexities of parsing will begin. But I want to learn I just don't know where to start.
Here is my complete code at this point:
import ply.lex as lex
import ply.yacc as yacc
string = """<?php if (isset($_REQUEST['name'])){
$name = $_REQUEST['name'];
$msg = "Hello, " . $name . "!";
$encoded = htmlspecialchars($msg);
}
?>"""
delimeters = ('LPAREN', 'RPAREN', 'LBRACKET', 'RBRACKET')
tokens = delimeters + (
"CHAR",
"NUM",
"OPEN_TAG",
"CLOSE_TAG",
"VARIABLE",
"CONSTANT_ENCAPSED_STRING",
"ENCAPSED_AND_WHITESPACE",
"QUOTED_ENCAPSED_STRING",
"LCURLYBRACKET",
"RCURLYBRACKET",
"EQUALS",
"SEMICOLON",
"QUOTE",
"DOT",
"IF"
)
t_ignore = " \t"
t_CHAR = r"[a-z]"
t_LPAREN = r'\('
t_RPAREN = r'\)'
t_RBRACKET = r'\]'
t_LBRACKET = r'\['
t_RCURLYBRACKET = r'\}'
t_LCURLYBRACKET = r'\{'
t_EQUALS = r'='
t_SEMICOLON = r';'
t_DOT = r'\.'
def t_newline(t):
r'
+'
t.lexer.lineno += t.value.count("
")
def t_CONSTANT_ENCAPSED_STRING(t):
r"'([^\\']|\\(.|
))*'"
t.lexer.lineno += t.value.count("
")
return t
def t_QUOTED_ENCAPSED_STRING(t):
r"""\"([^\\"]|\\(.|
))*\""""
t.lexer.lineno += t.value.count("
")
return t
def t_OPEN_TAG(t):
r'<[?%]((php[ \t
]?)|=)?'
if '=' in t.value: t.type = 'OPEN_TAG_WITH_ECHO'
t.lexer.lineno += t.value.count("
")
return t
def t_CLOSE_TAG(t):
r'[?%]>?
?'
t.lexer.lineno += t.value.count("
")
#t.lexer.begin('INITIAL')
return t
def t_VARIABLE(t):
r'\$[A-Za-z_][\w_]*'
return t
def t_NUM(t):
r"\d+"
t.value = int(t.value)
return t
def t_error(t):
print t.lexer.current_state
print dir(t.lexer)
raise TypeError("unknown char '%s'"%(t.value))
lexer = lex.lex()
lex.input(string)
for tok in iter(lex.token, None):
print repr(tok.type), repr(tok.value)
##now for the parsing
"""
$name = $_REQUEST['name'];
$msg = "Hello, " . $name . "!";
"""
def p_assign(p):
'''assign : VARIABLE EQUALS input'''
print "assign rule"
print p[1],p[2],p[3]
p[0] = p[1]
def p_input(p):
'''input : VARIABLE LBRACKET CONSTANT_ENCAPSED_STRING RBRACKET SEMICOLON
| VARIABLE LBRACKET QUOTED_ENCAPSED_STRING RBRACKET SEMICOLON'''
print "input rule"
value = p[1]+p[2]+p[3]+p[4]+p[5]
p[0] = value
def p_wrong(p):
'''wrong : VARIABLE EQUALS QUOTED_ENCAPSED_STRING DOT VARIABLE DOT QUOTED_ENCAPSED_STRING SEMICOLON'''
print "wrong"
yacc.yacc()
yacc.parse(string)
And the results:
...
WARNING: There is 1 unused rule
WARNING: Symbol 'wrong' is unreachable
Generating LALR tables
yacc: Syntax error at line 6, token=OPEN_TAG
input rule
assign rule
$name = $_REQUEST['name'];
yacc: Syntax error at line 8, token=VARIABLE
My (incorrect) attempt at parsing line 3 (with the format hard-coded in the parser rule p_wrong) doesn't even get hit. But I would just like some guidance on how to proceed to parse this simple code block.
Desired output
Ideally I will have results that allow me to trace the user input something like this:
user-input -> $name -> $msg -> htmlspecialchars($msg)