如何在 ANTLR 中构建干净的、类似 Python 的语法?

发布于 2024-07-29 10:52:03 字数 1663 浏览 3 评论 0原文

日安!

如何构建一个简单的 ANTLR 语法来处理多行表达式而不需要分号或反斜杠?

我正在尝试为表达式编写一个简单的 DSL:

# sh style comments
ThisValue = 1
ThatValue = ThisValue * 2
ThisOtherValue = (1 + 2 + ThisValue * ThatValue)
YetAnotherValue = MAX(ThisOtherValue, ThatValue)

总的来说,我希望我的应用程序为脚本提供一些初始命名值并提取最终结果。 然而,我对语法很着迷。 我想支持如下所示的多行表达式:

# Note: no backslashes required to continue expression, as we're in brackets
# Note: no semicolon required at end of expression, either
ThisValueWithAReallyLongName = (ThisOtherValueWithASimilarlyLongName
                               +AnotherValueWithAGratuitouslyLongName)

我从这样的 ANTLR 语法开始:

exprlist
    : ( assignment_statement | empty_line )* EOF!
    ;
assignment_statement
    : assignment NL!?
    ;
empty_line
    : NL;
assignment
    : ID '=' expr
    ;

// ... and so on

这似乎很简单,但我已经在换行符方面遇到了麻烦:

warning(200): StackOverflowQuestion.g:11:20: Decision can match input such as "NL" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input

以图形方式,在 org.antlr.works.IDE 中:

< a href="http://img.skitch.com/20090723-ghpss46833si9f9ebk48x28b82.png">决策可以使用多种替代方案来匹配 NL http://img.skitch.com/20090723-ghpss46833si9f9ebk48x28b82.png

我已经踢了语法,但最终总是违反预期的行为:

  • 文件末尾不需要换行符
  • 空行是可以接受的
  • 从井号开始的行中的所有内容都将作为注释被丢弃
  • 作业以 end-of 结尾-行,而不是分号
  • 如果用方括号括起来,表达式可以跨越多行

我可以找到具有许多这些特征的示例 ANTLR 语法。 我发现当我削减它们以限制它们的表达能力以满足我的需要时,我最终会破坏一些东西。 其他的太简单了,我在增加表现力时打破了它们。

我应该从哪个角度来理解这个语法? 您能举出任何既不简单也不完整的图灵完备语言的示例吗?

G'day!

How can I construct a simple ANTLR grammar handling multi-line expressions without the need for either semicolons or backslashes?

I'm trying to write a simple DSLs for expressions:

# sh style comments
ThisValue = 1
ThatValue = ThisValue * 2
ThisOtherValue = (1 + 2 + ThisValue * ThatValue)
YetAnotherValue = MAX(ThisOtherValue, ThatValue)

Overall, I want my application to provide the script with some initial named values and pull out the final result. I'm getting hung up on the syntax, however. I'd like to support multiple line expressions like the following:

# Note: no backslashes required to continue expression, as we're in brackets
# Note: no semicolon required at end of expression, either
ThisValueWithAReallyLongName = (ThisOtherValueWithASimilarlyLongName
                               +AnotherValueWithAGratuitouslyLongName)

I started off with an ANTLR grammar like this:

exprlist
    : ( assignment_statement | empty_line )* EOF!
    ;
assignment_statement
    : assignment NL!?
    ;
empty_line
    : NL;
assignment
    : ID '=' expr
    ;

// ... and so on

It seems simple, but I'm already in trouble with the newlines:

warning(200): StackOverflowQuestion.g:11:20: Decision can match input such as "NL" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input

Graphically, in org.antlr.works.IDE:

Decision Can Match NL Using Multiple Alternatives http://img.skitch.com/20090723-ghpss46833si9f9ebk48x28b82.png

I've kicked the grammar around, but always end up with violations of expected behavior:

  • A newline is not required at the end of the file
  • Empty lines are acceptable
  • Everything in a line from a pound sign onward is discarded as a comment
  • Assignments end with end-of-line, not semicolons
  • Expressions can span multiple lines if wrapped in brackets

I can find example ANTLR grammars with many of these characteristics. I find that when I cut them down to limit their expressiveness to just what I need, I end up breaking something. Others are too simple, and I break them as I add expressiveness.

Which angle should I take with this grammar? Can you point to any examples that aren't either trivial or full Turing-complete languages?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

巴黎盛开的樱花 2024-08-05 10:52:03

我会让你的记号生成器完成繁重的工作,而不是将换行规则混合到语法中:

  • 计算括号、方括号和大括号的数量,并且在存在未封闭组时不要生成 NL 记号。 这将免费为您提供行延续,而无需您的语法变得更加明智。

  • 无论最后一行是否以 '\n' 字符结尾,始终在文件末尾生成 NL 标记,这样您就不必担心语句的特殊情况没有 NL。 语句始终以 NL 结尾。

第二点可以让你将语法简化为这样:

exprlist
    : ( assignment_statement | empty_line )* EOF!
    ;
assignment_statement
    : assignment NL
    ;
empty_line
    : NL
    ;
assignment
    : ID '=' expr
    ;

I would let your tokenizer do the heavy lifting rather than mixing your newline rules into your grammar:

  • Count parentheses, brackets, and braces, and don't generate NL tokens while there are unclosed groups. That'll give you line continuations for free without your grammar being any the wiser.

  • Always generate an NL token at the end of file whether or not the last line ends with a '\n' character, then you don't have to worry about a special case of a statement without a NL. Statements always end with an NL.

The second point would let you simplify your grammar to something like this:

exprlist
    : ( assignment_statement | empty_line )* EOF!
    ;
assignment_statement
    : assignment NL
    ;
empty_line
    : NL
    ;
assignment
    : ID '=' expr
    ;
一绘本一梦想 2024-08-05 10:52:03

这个怎么样?

exprlist
    : (expr)? (NL+ expr)* NL!? EOF!
    ;
expr 
    : assignment | ...
    ;
assignment
    : ID '=' expr
    ;

How about this?

exprlist
    : (expr)? (NL+ expr)* NL!? EOF!
    ;
expr 
    : assignment | ...
    ;
assignment
    : ID '=' expr
    ;
走过海棠暮 2024-08-05 10:52:03

我假设您选择将 NL 设为可选,因为输入代码中的最后一个语句不必以换行符结尾。

虽然这很有意义,但你却让解析器的生活变得更加困难。 应该珍惜分隔符令牌(如 NL),因为它们可以消除歧义并减少冲突的可能性。

在您的情况下,解析器不知道是否应该解析“赋值NL”或“赋值empty_line”。 解决这个问题的方法有很多,但大多数都只是不明智的设计选择的创可贴。

我的建议是一个无辜的黑客:强制使用 NL,并始终将 NL 附加到输入流的末尾!

这可能看起来有点令人讨厌,但实际上它会为你省去很多未来的麻烦。

I assume you chose to make NL optional, because the last statement in your input code doesn't have to end with a newline.

While it makes a lot of sense, you are making life a lot harder for your parser. Separator tokens (like NL) should be cherished, as they disambiguate and reduce the chance of conflicts.

In your case, the parser doesn't know if it should parse "assignment NL" or "assignment empty_line". There are many ways to solve it, but most of them are just band-aides for an unwise design choice.

My recommendation is an innocent hack: Make NL mandatory, and always append NL to the end of your input stream!

It may seem a little unsavory, but in reality it will save you a lot of future headaches.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文