如何编写 lpeg 中使用的 peg 来解析 lua 本身?

发布于 2025-01-17 21:17:58 字数 2706 浏览 0 评论 0原文

正如标题所说,我知道lua在The Complete Syntax of Lua中有一个官方的扩展BNF 。我想写一个 PEG 传递给 lpeg.re.compile 来解析 lua 本身。也许 Lua PEG 类似于 BNF。我已经阅读了 BNF 并尝试将其转换为 PEG,但我发现 Numeral 和 LiteralString 很难编写。有人做过这样的事吗?

local lpeg = require "lpeg"
local re = lpeg.re

local p = re.compile([[
    chunk <- block
    block <- stat * retstat ?
    stat <- ';' /
            varlist '=' explist /
            functioncall /
            label /
            'break' /
            'goto' Name /
            'do' block 'end' /
            'while' exp 'do' block 'end' /
            'repeat' block 'until' exp /
            'if' exp 'then' block ('elseif' exp 'then' block) * ('else' block) ? 'end' /
            'for' Name '=' exp ',' exp (',' exp) ? 'do' block 'end' /
            'for' namelist 'in' explist 'do' block 'end' /
            'function' funcname funcbody /
            'local function' Name funcbody /
            'local' attnamelist ('=' explist) ?
    attnamelist <- Name attrib (',' Name attrib) *
    attrib <- ('<' Name '>') ?
    retstat <- 'return' explist ? ';' ?
    label <- '::' Name '::'
    funcname <- Name ('.' Name) * (':' Name) ?
    varlist <- var (',' var) *
    var <- Name / prefixexp '[' exp ']' / prefixexp '.' Name
    namelist <- Name (',' Name) *
    explist <- exp (',' exp) *
    exp <- 'nil' / 'false' / 'true' / Numeral / LiteralString / "..." / functiondef /
           prefixexp / tableconstructor / exp binop exp / unop exp
    prefixexp <- var / functioncall / '(' exp ')'
    functioncall <- prefixexp args / prefixexp ":" Name args
    args <- '(' explist ? ')' / tableconstructor / LiteralString
    functiondef <- 'function' funcbody
    funcbody <- '(' parlist ? ')' block 'end'
    parlist <- namelist (',' '...') ? / '...'
    tableconstructor <- '{' fieldlist ? '}'
    fieldlist <- field (fieldsep field) * fieldsep ?
    field <- '[' exp ']' '=' exp / Name '=' exp / exp
    fieldsep <- ',' / ';'
    binop <- '+' / '-' / ‘*’ / '/' / '//' / '^' / '%' /
             '&' / '~' / '|' / '>>' / '<<' / '..' /
             '<' / '<=' / '>' / '>=' / '==' / '~=' /
             'and' / 'or'
    unop <- '-' / 'not' / '#' / '~'

    saveword <- "and" / "break" / "do" / "else" / "elseif" / "end" /
                "false" / "for" / "function" / "goto" / "if" / "in" /
                "local" / "nil" / "no"t / "or" / "repeat" / "return" /
                "then" / "true" / "until" / "while"
    Name <- ! saveword / name
    Numeral <- 
    LiteralString <- 
]])

As title say, I know lua has a offical extended BNF in The Complete Syntax of Lua. I want to write a PEG to pass to lpeg.re.compile to parse lua itself. Maybe the Lua PEG is something like it's BNF. I have read the BNF and try to translate it to a PEG, but I found Numeral and LiteralString it hard to write. Is there someone had do something like this?

local lpeg = require "lpeg"
local re = lpeg.re

local p = re.compile([[
    chunk <- block
    block <- stat * retstat ?
    stat <- ';' /
            varlist '=' explist /
            functioncall /
            label /
            'break' /
            'goto' Name /
            'do' block 'end' /
            'while' exp 'do' block 'end' /
            'repeat' block 'until' exp /
            'if' exp 'then' block ('elseif' exp 'then' block) * ('else' block) ? 'end' /
            'for' Name '=' exp ',' exp (',' exp) ? 'do' block 'end' /
            'for' namelist 'in' explist 'do' block 'end' /
            'function' funcname funcbody /
            'local function' Name funcbody /
            'local' attnamelist ('=' explist) ?
    attnamelist <- Name attrib (',' Name attrib) *
    attrib <- ('<' Name '>') ?
    retstat <- 'return' explist ? ';' ?
    label <- '::' Name '::'
    funcname <- Name ('.' Name) * (':' Name) ?
    varlist <- var (',' var) *
    var <- Name / prefixexp '[' exp ']' / prefixexp '.' Name
    namelist <- Name (',' Name) *
    explist <- exp (',' exp) *
    exp <- 'nil' / 'false' / 'true' / Numeral / LiteralString / "..." / functiondef /
           prefixexp / tableconstructor / exp binop exp / unop exp
    prefixexp <- var / functioncall / '(' exp ')'
    functioncall <- prefixexp args / prefixexp ":" Name args
    args <- '(' explist ? ')' / tableconstructor / LiteralString
    functiondef <- 'function' funcbody
    funcbody <- '(' parlist ? ')' block 'end'
    parlist <- namelist (',' '...') ? / '...'
    tableconstructor <- '{' fieldlist ? '}'
    fieldlist <- field (fieldsep field) * fieldsep ?
    field <- '[' exp ']' '=' exp / Name '=' exp / exp
    fieldsep <- ',' / ';'
    binop <- '+' / '-' / ‘*’ / '/' / '//' / '^' / '%' /
             '&' / '~' / '|' / '>>' / '<<' / '..' /
             '<' / '<=' / '>' / '>=' / '==' / '~=' /
             'and' / 'or'
    unop <- '-' / 'not' / '#' / '~'

    saveword <- "and" / "break" / "do" / "else" / "elseif" / "end" /
                "false" / "for" / "function" / "goto" / "if" / "in" /
                "local" / "nil" / "no"t / "or" / "repeat" / "return" /
                "then" / "true" / "until" / "while"
    Name <- ! saveword / name
    Numeral <- 
    LiteralString <- 
]])

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

带上头具痛哭 2025-01-24 21:17:58

首先:您需要在由 tokenization (词汇分析,REGEX)和解析(语法分析,CFGS)组成的两个步骤中解析LUA。
考虑句法无效的lua代码if1then print()结束。如果您只是一口气解析,您可能不会遇到语法错误,因为从理论上讲,它可以合理地将其解释为If - 数字 1 1 - 然后 ...-令牌化将贪婪地制作if1一个单个“标识符”/名称令牌,以后在语法分析中触发语法错误。

在某些情况下,PEG可以通过其有序选择来表达这一点,但是通常应该应用两步过程,以免获得过于允许的(可能是模棱两可的语法)。

仍然要编写的“规则”都是代币规则(从大写字母中可以看出) - nameliteralalStringnumeral 。这些基本上只是简单的言论。至于name s:如果您巧妙地使用PEG的有序选择+,则不必使用“扣除”(负lookahead)来避免解析关键字AS name s:只需按照token = keyword + name + ...在您的令牌化语法中进行操作。

字面的字符串确实很棘手,因为长字符串不能写为正则弦。引用的字符串很容易(虽然您必须处理逃生)。 lpeg docs 有一个有关长字符串的示例:

equals = lpeg.P"="^0
open = "[" * lpeg.Cg(equals, "init") * "[" * lpeg.P"\n"^-1
close = "]" * lpeg.C(equals) * "]"
closeeq = lpeg.Cmt(close * lpeg.Cb("init"), function (s, i, a, b) return a == b end)
string = open * lpeg.C((lpeg.P(1) - closeeq)^0) * close / 1

数字有点笨拙,因为您必须处理许多不同的案例,以解决不同的基础,省略点,省略点之前或之后的0,指数,标志等。

我碰巧有相关的LPEG规则,这些规则在周围:

-- Character classes
_letter = R("AZ", "az")
_letter_ = _letter + P"_"
_digit = R"09"
_hexdigit = _digit + R("AF", "af")
white = C(S" \f\t\v\n\r" ^ 1)
_keyword = P"not"
    + P"and"
    + P"or"
    + P"function"
    + P"nil"
    + P"false"
    + P"true"
    + P"return"
    + P"goto"
    + P"do"
    + P"end"
    + P"while"
    + P"repeat"
    + P"until"
    + P"if"
    + P"then"
    + P"elseif"
    + P"else"
    + P"for"
    + P"local"
-- Names
Name = C(_letter_ * (_letter_ + _digit) ^ 0) - _keyword
-- Numbers
local function _numeral(digit_, exponent_letter)
    local uint = digit_ ^ 1
    local float = uint + uint * P"." * uint + uint * P"." + P"." * uint
    local exponent = exponent_letter * O(S"+-") * uint
    return C(float) * C(O(exponent))
end
_hex_numeral = C(P"0x") * _numeral(_hexdigit, S"pP")
_decimal_numeral = _numeral(_digit, S"eE")
Numeral = _hex_numeral + _decimal_numeral
-- Strings
decimal_escape = C(_digit * O(_digit * O(_digit)))
hex_escape = P"x" * C(_hexdigit * _hexdigit)
unicode_escape = P"u{" * C(_hexdigit^1) * P"}"
char_escape = C(S[[abfnrtv\'"]])
_escape = P[[\]] * (decimal_escape + hex_escape + char_escape + unicode_escape)
local function string_quoted(quotes)
    local range = P(1) - S(quotes .. "\0\n\r\\")
    local content = (_escape + C(range^1)) ^ 0
    return Cg(P(quotes), "quotes") * content * P(quotes)
end
local equals = P"=" ^ 0
local open = P"[" * Cg(equals, "equals") * P"[" * O(P"\n")
local close = Cmt(P"]" * C(equals) * P"]" * Cb"equals", function(_, _, open, close)
    return open == close
end)
_long_string = open * C((P(1) - close) ^ 0) * close
String = string_quoted[[']] + string_quoted[["]] + _long_string
_line_comment = -open * ((P(1) - P"\n") ^ 0) * (P"\n" + _eof)
Comment = P"--" * (_long_string + _line_comment)

您可能需要除非将规则加载到自定义环境中,否则将这些变量的某些(如果不是全部)本地化。

First off: You need to parse Lua in a two step process consisting of tokenization (lexical analysis, RegEx) and parsing (syntactical analysis, CFGs).
Consider the syntactically invalid Lua code if1then print()end. If you just parse this in one go, you might not get a syntax error, as theoretically it could reasonably be interpreted as if - number 1 - then ... - tokenization however would greedily make if1 a single "identifier"/name token, triggering a syntax error in the syntactical analysis later on.

PEGs might allow to express this in some cases through their ordered choice, but generally the two-step process should be applied in order to not obtain an overly permissive (and possibly ambiguous grammar).

The "rules" still left to be written are all token rules (as can be seen from the capitalized names) - Name, LiteralString and Numeral. These are basically just simple RegExes. As for Names: If you use the ordered choice + of PEGs cleverly, you don't have to use the "subtraction" (negative lookahead) to avoid keywords being parsed as Names: Just do something along the lines of Token = Keyword + Name + ... in your tokenization grammar.

Literal strings are indeed tricky because of long strings which can't be written as RegExes; quoted strings are rather easy (you have to deal with escapes though). The LPeg docs have an example concerning long strings:

equals = lpeg.P"="^0
open = "[" * lpeg.Cg(equals, "init") * "[" * lpeg.P"\n"^-1
close = "]" * lpeg.C(equals) * "]"
closeeq = lpeg.Cmt(close * lpeg.Cb("init"), function (s, i, a, b) return a == b end)
string = open * lpeg.C((lpeg.P(1) - closeeq)^0) * close / 1

Numerals are a bit clunky because you have to deal with many different cases for different bases, omission of the dot, omission of 0 before or after the dot, exponents, signs etc.

I happen to have the relevant LPeg rules for these lying around:

-- Character classes
_letter = R("AZ", "az")
_letter_ = _letter + P"_"
_digit = R"09"
_hexdigit = _digit + R("AF", "af")
white = C(S" \f\t\v\n\r" ^ 1)
_keyword = P"not"
    + P"and"
    + P"or"
    + P"function"
    + P"nil"
    + P"false"
    + P"true"
    + P"return"
    + P"goto"
    + P"do"
    + P"end"
    + P"while"
    + P"repeat"
    + P"until"
    + P"if"
    + P"then"
    + P"elseif"
    + P"else"
    + P"for"
    + P"local"
-- Names
Name = C(_letter_ * (_letter_ + _digit) ^ 0) - _keyword
-- Numbers
local function _numeral(digit_, exponent_letter)
    local uint = digit_ ^ 1
    local float = uint + uint * P"." * uint + uint * P"." + P"." * uint
    local exponent = exponent_letter * O(S"+-") * uint
    return C(float) * C(O(exponent))
end
_hex_numeral = C(P"0x") * _numeral(_hexdigit, S"pP")
_decimal_numeral = _numeral(_digit, S"eE")
Numeral = _hex_numeral + _decimal_numeral
-- Strings
decimal_escape = C(_digit * O(_digit * O(_digit)))
hex_escape = P"x" * C(_hexdigit * _hexdigit)
unicode_escape = P"u{" * C(_hexdigit^1) * P"}"
char_escape = C(S[[abfnrtv\'"]])
_escape = P[[\]] * (decimal_escape + hex_escape + char_escape + unicode_escape)
local function string_quoted(quotes)
    local range = P(1) - S(quotes .. "\0\n\r\\")
    local content = (_escape + C(range^1)) ^ 0
    return Cg(P(quotes), "quotes") * content * P(quotes)
end
local equals = P"=" ^ 0
local open = P"[" * Cg(equals, "equals") * P"[" * O(P"\n")
local close = Cmt(P"]" * C(equals) * P"]" * Cb"equals", function(_, _, open, close)
    return open == close
end)
_long_string = open * C((P(1) - close) ^ 0) * close
String = string_quoted[[']] + string_quoted[["]] + _long_string
_line_comment = -open * ((P(1) - P"\n") ^ 0) * (P"\n" + _eof)
Comment = P"--" * (_long_string + _line_comment)

You might want to localize some (if not all) of these variables unless you load the rules in a custom environment.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文