转换语法以进行预测解析

发布于 2025-01-22 09:55:55 字数 2040 浏览 3 评论 0 原文

以下语法适合预测解析，还是它们的算法修改语法以使其适合预测解析？

number = digit digit_or_sep* digit | digit
digit_or_sep = '0'..'9' | '_'
digit = '0'..'9'

其中*表示零或更多，<代码> | 将不同的选择分开。

我写了一本在上述语法上正常工作的回溯解析器，但是我读到，现代解析器如今主要是预测性解析器，很少使用回溯的解析器。回溯解析器需要倒带解析器的状态，这使得它们的性能少于预测解析器。

将上述语法转换为：

number = digit (sep* digit+)*
sep = '_'
digit = '0'..'9'

其中+表示一个或多个。

将进行预测性解析工作，因为它可以避免 digit_or_sep*在最终 digit 之前消耗了太多令牌。但是我不确定是否有一种算法的方法来自动转换语法以使其起作用。

编辑：我已经阅读了有关Google上的左理由的阅读，这可能是我需要了解的难题的缺失部分。

一点点重构后：

number = digit digit_or_sep* digit | digit
digit_or_sep = '0'..'9' | '_'
digit = '0'..'9'
let g = digit_or_sep g | empty
number = digit g digit | digit
let h = g digit | empty
number = digit h
h = g digit | empty
g = digit_or_sep g | empty
g = digit g | sep g | empty
h = (digit g | sep g | empty) digit | empty
h = digit g digit | sep g digit | digit | empty
h = digit h | sep g digit | empty

产生以下语法：

number = digit h
h = digit h | sep g digit | empty
g = digit g | sep g | empty

该语法应适用于预测性解析器。但是我仍然没有想出算法来自动执行此操作。

编辑：投掷更多的详细信息。上面的语法只是我想改变语法的一个小例子。

我实际上是在解析科特林：

它与回溯解析器的工作正常，但是遇到了性能问题。解析简单函数类型签名“（A：String，B：String） - ＆GT; String”需要20ms来解析，这对于如此小的输入而言太慢了。我可以在代码中进行一些优化，但我知道的是比我预期的要慢。

另一方面，进行预测解析，我不能简单地使用他们的语法。看来首先需要对语法进行一些手动更改。在产生解析器之前，Antlr必须对其语法进行一些转换。

在此语法文件的底部，Kotlin的语法文件是上面的数字示例相同的：

我读过ANTLR可以在1秒内解析完整的Java标准库。我可能还有很长的路要走。

编辑：（29/04/2022）在进行了更多研究之后，我发现初始语法（数字）对于预测解析器来说没有问题，我只需要更多的浏览。

原文

Is the following grammar suitable for predictive parsing, or is their an algorithm to modify the grammar to make it suitable for predictive parsing?

number = digit digit_or_sep* digit | digit
digit_or_sep = '0'..'9' | '_'
digit = '0'..'9'

where * means zero or more, and | divides the different choices.

I have written a backtracking parser that works fine on the above grammar, however I've read that modern parsers are mostly predictive parsers these days and backtracking parsers are rarely used. Backtracking parsers need to rewind the state of the parser as they backtrack which makes them less performant than predictive parsers.

Transforming the grammar above into:

number = digit (sep* digit+)*
sep = '_'
digit = '0'..'9'

where + means one or more.

Will make predictive parsing work because it avoids digit_or_sep* from consuming too many tokens before the final digit. But I am not sure if there is an algorithmic way to auto-transform the grammar to make it work.

Edit: I had a read about left factoring on google, it could be the missing piece of the puzzle I need to understand.

After a bit of left refactoring:

number = digit digit_or_sep* digit | digit
digit_or_sep = '0'..'9' | '_'
digit = '0'..'9'
let g = digit_or_sep g | empty
number = digit g digit | digit
let h = g digit | empty
number = digit h
h = g digit | empty
g = digit_or_sep g | empty
g = digit g | sep g | empty
h = (digit g | sep g | empty) digit | empty
h = digit g digit | sep g digit | digit | empty
h = digit h | sep g digit | empty

The following grammar is produced:

number = digit h
h = digit h | sep g digit | empty
g = digit g | sep g | empty

Which should be suitable for a predictive parser. But I have still not come up with an algorithm to do it yet automatically.

Edit: Throwing in more details of what I am trying to do. The grammar above is just an small example of grammar I'd like to transform.

I am actually parsing Kotlin:
https://github.com/clinuxrulz/parse-bolt/blob/main/src/kotlin/parser.rs

And it is working fine with the backtracking parser, but it is having performance issues. Parsing a simple function type signature "(a: String, b: String) -> String" took 20ms to parse which is way too slow for such a small input. There are some optimisations I can do in the code that I know off, but it was way slower than I expected.

On the other hand for predictive parsing I can not simply use their grammar as is. It seems it will need some manual changes to the grammar first. ANTLR must be doing some transformations on their grammar 1st before generating the parser.

Down the bottom of this grammar file for Kotlin is the same the number example above:
https://github.com/Kotlin/kotlin-spec/blob/release/grammar/src/main/antlr/KotlinLexer.g4

I've read ANTLR can parse the full Java standard library in under 1 second. I might have a long way to go.

Edit: (29/04/2022) After more research, I found that the initial grammar (for number) has no problems for a predictive parser, I just needed more look-ahead.