当前位置：文江博客话题详情

如何简化令牌预测 DFA？

发布于 2024-12-06 09:05:39 字数 1339 浏览 2 评论 0原文

Lexer DFA 导致“代码太大”错误

我正在尝试使用 ANTLR 3 解析 Java 服务器页面。Java

对单个方法的字节代码有 64k 的限制，并且我不断遇到“代码太大”错误当编译 ANTLR 生成的 Java 源代码时。

在某些情况下，我可以通过破坏我的词法分析器来修复它。例如，JSP 使用 XML“Name”标记，它可以包含多种字符。我决定在我的“Name”标记中只接受 ASCII 字符，这大大简化了词法分析器中的一些测试，使其能够编译。

然而，我已经到了不能再走捷径的地步了，但 DFA 仍然太复杂了。

我该怎么办？

是否存在导致复杂 DFA 的常见错误？

有没有办法抑制 DFA 的生成，也许依靠语义谓词或固定前瞻来帮助预测？

手动编写这个词法分析器很容易，但在放弃 ANTLR 之前，我想确保我没有忽略一些明显的东西。

背景

ANTLR 3 词法分析器使用 DFA 来决定如何对输入进行标记。在生成的 DFA 中，有一个名为 specialStateTransition() 的方法。此方法包含一个 switch 语句，其中包含 DFA 中每个状态的情况。在每种情况下，都有一系列 if 语句，每个状态对应一个转换。每个 if 语句的条件测试输入字符以查看它是否与转换匹配。

这些字符测试条件可能非常复杂。它们通常具有以下形式：

int ch = … ; /* "ch" is the next character in the input stream. */
switch(s) { /* "s" is the current state. */
  …
  case 13 :
    if ((('a' <= ch) && (ch <= 'z')) || (('A' <= ch) && (ch <= 'Z')) || … )
      s = 24; /* If the character matches, move to the next state. */
    else if …

对我的词法分析器看似微小的更改可能会导致对单个转换进行数十次比较，对每个状态进行多次转换以及数十个状态。我认为由于我的语义谓词，某些正在考虑的状态是不可能达到的，但 DFA 似乎忽略了语义谓词。（不过我可能会误读——这段代码绝对不是我能手写的！）

我在 Jsp2x 工具中找到了 ANTLR 2 语法，但我对其解析树不满意，我想要为了刷新我的 ANTLR 技能，所以我想我应该尝试编写自己的。我正在使用 ANTLRWorks，并尝试为 DFA 生成图形，但 ANTLRWorks 中似乎存在阻止它的错误。

原文

Lexer DFA results in "code too large" error

I'm trying to parse Java Server Pages using ANTLR 3.

Java has a limit of 64k for the byte code of a single method, and I keep running into a "code too large" error when compiling the Java source generated by ANTLR.

In some cases, I've been able to fix it by compromising my lexer. For example, JSP uses the XML "Name" token, which can include a wide variety of characters. I decided to accept only ASCII characters in my "Name" token, which drastically simplified some tests in the and lexer allowed it to compile.

However, I've gotten to the point where I can't cut any more corners, but the DFA is still too complex.

What should I do about it?

Are there common mistakes that result in complex DFAs?

Is there a way to inhibit generation of the DFA, perhaps relying on semantic predicates or fixed lookahead to help with the prediction?

Writing this lexer by hand will be easy, but before I give up on ANTLR, I want to make sure I'm not overlooking something obvious.

Background

ANTLR 3 lexers use a DFA to decide how to tokenize input. In the generated DFA, there is a method called specialStateTransition(). This method contains a switch statement with a case for each state in the DFA. Within each case, there is a series of if statements, one for each transition from the state. The condition of each if statement tests an input character to see if it matches the transition.

These character-testing conditions can be very complex. They normally have the following form:

int ch = … ; /* "ch" is the next character in the input stream. */
switch(s) { /* "s" is the current state. */
  …
  case 13 :
    if ((('a' <= ch) && (ch <= 'z')) || (('A' <= ch) && (ch <= 'Z')) || … )
      s = 24; /* If the character matches, move to the next state. */
    else if …

A seemingly minor change to my lexer can result in dozens of comparisons for a single transition, several transitions for each state, and scores of states. I think that some of the states being considered are impossible to reach due to my semantic predicates, but it seems like semantic predicates are ignored by the DFA. (I could be misreading things though—this code is definitely not what I'd be able to write by hand!)

I found an ANTLR 2 grammar in the Jsp2x tool, but I'm not satisfied with its parse tree, and I want to refresh my ANTLR skills, so I thought I'd try writing my own. I am using ANTLRWorks, and I tried to generate graphs for the DFA, but there appear to be bugs in ANTLRWorks that prevent it.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

耳钉梦 2024-12-13 09:05:40

不幸的是，非常大的语法（许多不同的标记）有这个问题（SQL 语法也遇到这个问题）。

有时，可以通过使某些词法分析器规则片段与生成标记的“完整”词法分析器规则相反和/或重新排列规则内字符匹配的方式来解决此问题，但是通过查看您的方式自己已经尝试过了，我怀疑你的情况能有多大收获。但是，如果您愿意在此处发布您的词法分析器语法，那么我或其他人可能会看到一些可以更改的内容。

一般来说，这个问题可以通过将词法分析器语法拆分为 2 个或多个单独的词法分析器语法，然后将它们导入到一个“主”语法中来解决。在 ANTLR 术语中，这些被称为复合语法。请参阅有关它们的 ANTLR Wiki 页面：http://www.antlr.org/ wiki/display/ANTLR3/Composite+Grammars