如何编写 lpeg 中使用的 peg 来解析 lua 本身?
正如标题所说,我知道lua在The Complete Syntax of Lua中有一个官方的扩展BNF 。我想写一个 PEG 传递给 lpeg.re.compile 来解析 lua 本身。也许 Lua PEG 类似于 BNF。我已经阅读了 BNF 并尝试将其转换为 PEG,但我发现 Numeral 和 LiteralString 很难编写。有人做过这样的事吗?
local lpeg = require "lpeg"
local re = lpeg.re
local p = re.compile([[
chunk <- block
block <- stat * retstat ?
stat <- ';' /
varlist '=' explist /
functioncall /
label /
'break' /
'goto' Name /
'do' block 'end' /
'while' exp 'do' block 'end' /
'repeat' block 'until' exp /
'if' exp 'then' block ('elseif' exp 'then' block) * ('else' block) ? 'end' /
'for' Name '=' exp ',' exp (',' exp) ? 'do' block 'end' /
'for' namelist 'in' explist 'do' block 'end' /
'function' funcname funcbody /
'local function' Name funcbody /
'local' attnamelist ('=' explist) ?
attnamelist <- Name attrib (',' Name attrib) *
attrib <- ('<' Name '>') ?
retstat <- 'return' explist ? ';' ?
label <- '::' Name '::'
funcname <- Name ('.' Name) * (':' Name) ?
varlist <- var (',' var) *
var <- Name / prefixexp '[' exp ']' / prefixexp '.' Name
namelist <- Name (',' Name) *
explist <- exp (',' exp) *
exp <- 'nil' / 'false' / 'true' / Numeral / LiteralString / "..." / functiondef /
prefixexp / tableconstructor / exp binop exp / unop exp
prefixexp <- var / functioncall / '(' exp ')'
functioncall <- prefixexp args / prefixexp ":" Name args
args <- '(' explist ? ')' / tableconstructor / LiteralString
functiondef <- 'function' funcbody
funcbody <- '(' parlist ? ')' block 'end'
parlist <- namelist (',' '...') ? / '...'
tableconstructor <- '{' fieldlist ? '}'
fieldlist <- field (fieldsep field) * fieldsep ?
field <- '[' exp ']' '=' exp / Name '=' exp / exp
fieldsep <- ',' / ';'
binop <- '+' / '-' / ‘*’ / '/' / '//' / '^' / '%' /
'&' / '~' / '|' / '>>' / '<<' / '..' /
'<' / '<=' / '>' / '>=' / '==' / '~=' /
'and' / 'or'
unop <- '-' / 'not' / '#' / '~'
saveword <- "and" / "break" / "do" / "else" / "elseif" / "end" /
"false" / "for" / "function" / "goto" / "if" / "in" /
"local" / "nil" / "no"t / "or" / "repeat" / "return" /
"then" / "true" / "until" / "while"
Name <- ! saveword / name
Numeral <-
LiteralString <-
]])
As title say, I know lua has a offical extended BNF in The Complete Syntax of Lua. I want to write a PEG to pass to lpeg.re.compile to parse lua itself. Maybe the Lua PEG is something like it's BNF. I have read the BNF and try to translate it to a PEG, but I found Numeral and LiteralString it hard to write. Is there someone had do something like this?
local lpeg = require "lpeg"
local re = lpeg.re
local p = re.compile([[
chunk <- block
block <- stat * retstat ?
stat <- ';' /
varlist '=' explist /
functioncall /
label /
'break' /
'goto' Name /
'do' block 'end' /
'while' exp 'do' block 'end' /
'repeat' block 'until' exp /
'if' exp 'then' block ('elseif' exp 'then' block) * ('else' block) ? 'end' /
'for' Name '=' exp ',' exp (',' exp) ? 'do' block 'end' /
'for' namelist 'in' explist 'do' block 'end' /
'function' funcname funcbody /
'local function' Name funcbody /
'local' attnamelist ('=' explist) ?
attnamelist <- Name attrib (',' Name attrib) *
attrib <- ('<' Name '>') ?
retstat <- 'return' explist ? ';' ?
label <- '::' Name '::'
funcname <- Name ('.' Name) * (':' Name) ?
varlist <- var (',' var) *
var <- Name / prefixexp '[' exp ']' / prefixexp '.' Name
namelist <- Name (',' Name) *
explist <- exp (',' exp) *
exp <- 'nil' / 'false' / 'true' / Numeral / LiteralString / "..." / functiondef /
prefixexp / tableconstructor / exp binop exp / unop exp
prefixexp <- var / functioncall / '(' exp ')'
functioncall <- prefixexp args / prefixexp ":" Name args
args <- '(' explist ? ')' / tableconstructor / LiteralString
functiondef <- 'function' funcbody
funcbody <- '(' parlist ? ')' block 'end'
parlist <- namelist (',' '...') ? / '...'
tableconstructor <- '{' fieldlist ? '}'
fieldlist <- field (fieldsep field) * fieldsep ?
field <- '[' exp ']' '=' exp / Name '=' exp / exp
fieldsep <- ',' / ';'
binop <- '+' / '-' / ‘*’ / '/' / '//' / '^' / '%' /
'&' / '~' / '|' / '>>' / '<<' / '..' /
'<' / '<=' / '>' / '>=' / '==' / '~=' /
'and' / 'or'
unop <- '-' / 'not' / '#' / '~'
saveword <- "and" / "break" / "do" / "else" / "elseif" / "end" /
"false" / "for" / "function" / "goto" / "if" / "in" /
"local" / "nil" / "no"t / "or" / "repeat" / "return" /
"then" / "true" / "until" / "while"
Name <- ! saveword / name
Numeral <-
LiteralString <-
]])
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
首先:您需要在由 tokenization (词汇分析,REGEX)和解析(语法分析,CFGS)组成的两个步骤中解析LUA。
考虑句法无效的lua代码
if1then print()结束
。如果您只是一口气解析,您可能不会遇到语法错误,因为从理论上讲,它可以合理地将其解释为If
- 数字 1 1 -然后 ...-令牌化将贪婪地制作
if1
一个单个“标识符”/名称令牌,以后在语法分析中触发语法错误。在某些情况下,PEG可以通过其有序选择来表达这一点,但是通常应该应用两步过程,以免获得过于允许的(可能是模棱两可的语法)。
仍然要编写的“规则”都是代币规则(从大写字母中可以看出) -
name
,literalalString
和numeral
。这些基本上只是简单的言论。至于name
s:如果您巧妙地使用PEG的有序选择+
,则不必使用“扣除”(负lookahead)来避免解析关键字ASname
s:只需按照token = keyword + name + ...
在您的令牌化语法中进行操作。字面的字符串确实很棘手,因为长字符串不能写为正则弦。引用的字符串很容易(虽然您必须处理逃生)。 lpeg docs 有一个有关长字符串的示例:
数字有点笨拙,因为您必须处理许多不同的案例,以解决不同的基础,省略点,省略点之前或之后的0,指数,标志等。
我碰巧有相关的LPEG规则,这些规则在周围:
您可能需要除非将规则加载到自定义环境中,否则将这些变量的某些(如果不是全部)本地化。
First off: You need to parse Lua in a two step process consisting of tokenization (lexical analysis, RegEx) and parsing (syntactical analysis, CFGs).
Consider the syntactically invalid Lua code
if1then print()end
. If you just parse this in one go, you might not get a syntax error, as theoretically it could reasonably be interpreted asif
- number1
-then
... - tokenization however would greedily makeif1
a single "identifier"/name token, triggering a syntax error in the syntactical analysis later on.PEGs might allow to express this in some cases through their ordered choice, but generally the two-step process should be applied in order to not obtain an overly permissive (and possibly ambiguous grammar).
The "rules" still left to be written are all token rules (as can be seen from the capitalized names) -
Name
,LiteralString
andNumeral
. These are basically just simple RegExes. As forName
s: If you use the ordered choice+
of PEGs cleverly, you don't have to use the "subtraction" (negative lookahead) to avoid keywords being parsed asName
s: Just do something along the lines ofToken = Keyword + Name + ...
in your tokenization grammar.Literal strings are indeed tricky because of long strings which can't be written as RegExes; quoted strings are rather easy (you have to deal with escapes though). The LPeg docs have an example concerning long strings:
Numerals are a bit clunky because you have to deal with many different cases for different bases, omission of the dot, omission of 0 before or after the dot, exponents, signs etc.
I happen to have the relevant LPeg rules for these lying around:
You might want to localize some (if not all) of these variables unless you load the rules in a custom environment.