如何简化令牌预测 DFA?
Lexer DFA 导致“代码太大”错误
我正在尝试使用 ANTLR 3 解析 Java 服务器页面。Java
对单个方法的字节代码有 64k 的限制,并且我不断遇到“代码太大”错误当编译 ANTLR 生成的 Java 源代码时。
在某些情况下,我可以通过破坏我的词法分析器来修复它。例如,JSP 使用 XML“Name”标记,它可以包含多种字符。我决定在我的“Name”标记中只接受 ASCII 字符,这大大简化了词法分析器中的一些测试,使其能够编译。
然而,我已经到了不能再走捷径的地步了,但 DFA 仍然太复杂了。
我该怎么办?
是否存在导致复杂 DFA 的常见错误?
有没有办法抑制 DFA 的生成,也许依靠语义谓词或固定前瞻来帮助预测?
手动编写这个词法分析器很容易,但在放弃 ANTLR 之前,我想确保我没有忽略一些明显的东西。
背景
ANTLR 3 词法分析器使用 DFA 来决定如何对输入进行标记。在生成的 DFA 中,有一个名为 specialStateTransition()
的方法。此方法包含一个 switch
语句,其中包含 DFA 中每个状态的情况。在每种情况下,都有一系列 if
语句,每个状态对应一个转换。每个 if
语句的条件测试输入字符以查看它是否与转换匹配。
这些字符测试条件可能非常复杂。它们通常具有以下形式:
int ch = … ; /* "ch" is the next character in the input stream. */
switch(s) { /* "s" is the current state. */
…
case 13 :
if ((('a' <= ch) && (ch <= 'z')) || (('A' <= ch) && (ch <= 'Z')) || … )
s = 24; /* If the character matches, move to the next state. */
else if …
对我的词法分析器看似微小的更改可能会导致对单个转换进行数十次比较,对每个状态进行多次转换以及数十个状态。我认为由于我的语义谓词,某些正在考虑的状态是不可能达到的,但 DFA 似乎忽略了语义谓词。 (不过我可能会误读——这段代码绝对不是我能手写的!)
我在 Jsp2x 工具中找到了 ANTLR 2 语法,但我对其解析树不满意,我想要为了刷新我的 ANTLR 技能,所以我想我应该尝试编写自己的。我正在使用 ANTLRWorks,并尝试为 DFA 生成图形,但 ANTLRWorks 中似乎存在阻止它的错误。
Lexer DFA results in "code too large" error
I'm trying to parse Java Server Pages using ANTLR 3.
Java has a limit of 64k for the byte code of a single method, and I keep running into a "code too large" error when compiling the Java source generated by ANTLR.
In some cases, I've been able to fix it by compromising my lexer. For example, JSP uses the XML "Name" token, which can include a wide variety of characters. I decided to accept only ASCII characters in my "Name" token, which drastically simplified some tests in the and lexer allowed it to compile.
However, I've gotten to the point where I can't cut any more corners, but the DFA is still too complex.
What should I do about it?
Are there common mistakes that result in complex DFAs?
Is there a way to inhibit generation of the DFA, perhaps relying on semantic predicates or fixed lookahead to help with the prediction?
Writing this lexer by hand will be easy, but before I give up on ANTLR, I want to make sure I'm not overlooking something obvious.
Background
ANTLR 3 lexers use a DFA to decide how to tokenize input. In the generated DFA, there is a method called specialStateTransition()
. This method contains a switch
statement with a case for each state in the DFA. Within each case, there is a series of if
statements, one for each transition from the state. The condition of each if
statement tests an input character to see if it matches the transition.
These character-testing conditions can be very complex. They normally have the following form:
int ch = … ; /* "ch" is the next character in the input stream. */
switch(s) { /* "s" is the current state. */
…
case 13 :
if ((('a' <= ch) && (ch <= 'z')) || (('A' <= ch) && (ch <= 'Z')) || … )
s = 24; /* If the character matches, move to the next state. */
else if …
A seemingly minor change to my lexer can result in dozens of comparisons for a single transition, several transitions for each state, and scores of states. I think that some of the states being considered are impossible to reach due to my semantic predicates, but it seems like semantic predicates are ignored by the DFA. (I could be misreading things though—this code is definitely not what I'd be able to write by hand!)
I found an ANTLR 2 grammar in the Jsp2x tool, but I'm not satisfied with its parse tree, and I want to refresh my ANTLR skills, so I thought I'd try writing my own. I am using ANTLRWorks, and I tried to generate graphs for the DFA, but there appear to be bugs in ANTLRWorks that prevent it.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
不幸的是,非常大的语法(许多不同的标记)有这个问题(SQL 语法也遇到这个问题)。
有时,可以通过使某些词法分析器规则
片段
与生成标记的“完整”词法分析器规则相反和/或重新排列规则内字符匹配的方式来解决此问题,但是通过查看您的方式自己已经尝试过了,我怀疑你的情况能有多大收获。但是,如果您愿意在此处发布您的词法分析器语法,那么我或其他人可能会看到一些可以更改的内容。一般来说,这个问题可以通过将词法分析器语法拆分为 2 个或多个单独的词法分析器语法,然后将它们导入到一个“主”语法中来解决。在 ANTLR 术语中,这些被称为复合语法。请参阅有关它们的 ANTLR Wiki 页面:http://www.antlr.org/ wiki/display/ANTLR3/Composite+Grammars
编辑
正如 @Gunther 在 OP 下面的评论中正确提到的,请参阅问答:为什么我的antlr lexer java类“代码太大”?< /a> 其中一个小更改(删除某个谓词)导致此“代码太大”错误消失。
Grammars that are very large (many different tokens) have that problem, unfortunately (SQL grammars suffer from this too).
Sometimes this can be fixed by making certain lexer rules
fragments
opposed to "full" lexer rules that produce tokens and/or re-arranging the way characters are matched inside the rules, but by looking at the way you already tried yourself, I doubt there can gained much in your case. However, if you're willing to post your lexer grammar here on SO, I, or someone else, might see something that could be changed.In general, this problem is fixed by splitting the lexer grammar into 2 or more separate lexer grammars and then importing those in one "master" grammar. In ANTLR terms, these are called composite grammars. See this ANTLR Wiki page about them: http://www.antlr.org/wiki/display/ANTLR3/Composite+Grammars
EDIT
As @Gunther rightfully mentioned in the comment beneath the OP, see the Q&A: Why my antlr lexer java class is "code too large"? where a small change (the removal of a certain predicate) caused this "code too large"-error to disappear.
嗯,实际上,创建复合语法并不总是那么容易。在许多情况下,此 AntTask 会有所帮助解决这个问题(每次重新编译语法后都必须运行,但这个过程并不那么无聊)。
不幸的是,即使这个神奇的脚本在某些复杂的情况下也没有帮助。编译器可能会开始抱怨 DFA 转换(静态 String[] 字段)块太大。
我找到了一种简单的方法来解决这个问题,通过使用任意生成的名称移动(使用 IDE 重构功能)将此类字段移动到另一个类。当以这种方式移动一个或多个字段时,它总是有帮助的。
Well, actually it is not always easy to make a composite grammar. In many cases this AntTask helps to fix this problem (it must be run every time after recompiling a grammar, but this process is not so boring).
Unfortunately, even this magic script doesn't help in some complex cases. Compiler can begin to complaining about too large blocks of DFA transitions (static String[] fields).
I found an easy way to solve it, by moving (using IDE refactoring features) such fields to another class with arbitrarily generated name. It always helps when moving just one or more fields in such way.