EBNF / parboiled:如何将正则表达式转换为 PEG?
这是一个特定于 parboiled 解析器框架和一般 BNF/PEG 的问题。
假设我有一个相当简单的正则表达式
^\\s*([A-Za-z_][A-Za-z_0-9]*)\\s*=\\s*(\\S+)\\s*$
的伪 EBNF
<line> ::= <ws>? <identifier> <ws>? '=' <nonwhitespace> <ws>?
<ws> ::= (' ' | '\t' | {other whitespace characters})+
<identifier> ::= <identifier-head> <identifier-tail>
<identifier-head> ::= <letter> | '_'
<identifier-tail> ::= (<letter> | <digit> | '_')*
<letter> ::= ('A'..'Z') | ('a'..'z')
<digit> ::= '0'..'9'
<nonwhitespace> ::= ___________
,它代表How would you Define nonwhitespace (一个或多个不是空白的字符) in EBNF?
?对于熟悉 Java parboiled 库的人来说,如何实现定义非空白的规则?
This is a question both specific to the parboiled parser framework, and to BNF/PEG in general.
Let's say I have the fairly simple regular expression
^\\s*([A-Za-z_][A-Za-z_0-9]*)\\s*=\\s*(\\S+)\\s*$
which represents the pseudo-EBNF of
<line> ::= <ws>? <identifier> <ws>? '=' <nonwhitespace> <ws>?
<ws> ::= (' ' | '\t' | {other whitespace characters})+
<identifier> ::= <identifier-head> <identifier-tail>
<identifier-head> ::= <letter> | '_'
<identifier-tail> ::= (<letter> | <digit> | '_')*
<letter> ::= ('A'..'Z') | ('a'..'z')
<digit> ::= '0'..'9'
<nonwhitespace> ::= ___________
How would you define nonwhitespace (one or more characters that aren't whitespace) in EBNF?
For those of you familiar with the Java parboiled library, how could you implement a rule that defines nonwhitespace?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您必须遵循词法生成器的约定来指定字符范围和字符范围上的操作。
许多词法分析器生成器接受十六进制值(例如 0x)来表示字符,因此您可以编写:
代表数字。
对于非空白,您需要知道您正在使用哪个字符集。对于 7 位 ASCII,概念上非空白是所有打印字符:
对于 ISO8859-1:
您可以自己决定 0x80 以上的字符代码是否是空格(不间断空格是空格吗?)。您还可以决定控制字符 0x0..0x1F 的状态。 tab (0x9) 是空白字符吗? CR 0xD 和 LF 0xA 怎么样? ETB控制字符怎么样?
Unicode 更难,因为它是一个巨大的集合,并且您的列表会变得又大又乱。 这就是生活。我们的 DMS 软件重组工具包 用于构建多种语言的解析器,并且必须支持 ASCII 的词法分析器、支持大量 z 的 ISO8859-z 以及 Unicode。 DMS 允许使用减法正则表达式,而不是编写复杂的“加法”正则表达式范围,因此我们可以编写:
这更容易理解,并且第一次尝试就可以正确执行。
You are stuck with the conventions of your lexical generator for specifying character ranges and operations on character ranges.
Many lexer generators accept hex values (something like 0x) to represent characters, so you might write:
for digits.
For nonwhitespace, you need to know which character set you are using. For 7 bit ASCII, nonwhitespace is conceptually all the printing characters:
For ISO8859-1:
You can decide for yourself if the character codes above 0x80 are spaces or not (is non-breaking space a space?). You also get to decide about the status of the control characters 0x0..0x1F. Is tab (0x9) a whitespace character? How about CR 0xD and LF 0xA? How about the ETB control character?
Unicode is harder, because its a huge set, and your list gets big and messy. C'est la vie. Our DMS Software Reengineering Toolkit is used to build parsers for a wide variety of languages, and has to support lexers for ASCII, ISO8859-z for lots of z's, and Unicode. Rather than write complicated "additive" regular expression ranges, DMS allows subtractive regular expressions, and so we can write:
which is much easier to understand and gets it right on the first try.
在 EBNF 中,我将非空白简单地定义为任何不是空白的字符:
这要求您有一个“anycharacter”文字来定义可能的符号的整个范围,并明确定义哪些字符是空白。
在 Parboiled 中,您可以使用
TestNot
和ANY
规则来执行此操作,例如非空白将被定义为与 WhiteSpace() 规则不匹配的任何字符:
In EBNF I would simply define nonwhitespace as any character that isn't whitespace:
This requires that you have a 'anycharacter' literal that defines the entire range of possible symbols, and a clear definition of which characters are whitespace.
In Parboiled you can do this using the
TestNot
andANY
Rules, so for examplenonwhitespace would be defined as any character which doesn't match the WhiteSpace() Rule: