使用 ANTLR 手动发出令牌

发布于 2024-08-17 18:18:34 字数 638 浏览 4 评论 0 原文

我在 ANTLR 中使用词法分析器规则手动发出令牌时遇到了一些麻烦。我知道需要使用 emit() 函数,但似乎明显缺乏关于此的文档。有人有一个很好的例子来说明如何做到这一点吗?

ANTLR 书提供了一个很好的示例,说明如何解析 Python 的嵌套。例如,如果您看到一定数量的空白大于前一行的空白,则发出 INDENT 标记,但如果小于,则发出 DEDENT 标记。不幸的是,这本书掩盖了所需的实际语法。

编辑:这是我试图解析的示例。这是 Markdown 的嵌套块引用:

before blockquote

> text1
>
> > text2
>
> text3

outside blockquote

现在,到目前为止我的方法基本上是计算 > > 。每行符号。例如,上面的内容似乎应该发出(大致...)PARAGRAPH_START、CDATA、PARAGRAPH_END、BQUOTE_START、CDATA、BQUOTE_START、CDATA、BQUOTE_END、CDATA、BQUOTE_END、PARAGRAPH_START、CDATA、PARAGRAPH_END。这里的困难是最终的 BQUOTE_END ,我认为它应该是在找到非块引用元素后发出的虚构令牌(并且嵌套级别 >= 1)

I'm having a bit of trouble manually emitting a token with a lexer rule in ANTLR. I know that the emit() function needs to be used but there seems to be a distinct lack of documentation about this. Does anybody have a good example of how to do this?

The ANTLR book gives a good example of how you need to do this to parse Python's nesting. For example, if you see a certain amount of whitespace that's greater than the previous line's whitespace, emit an INDENT token but if it's less, emit a DEDENT token. Unfortunately the book glosses over the actual syntax that's required.

EDIT: Here's an example of what I'm trying to parse. It's Markdown's nested blockquotes:

before blockquote

> text1
>
> > text2
>
> text3

outside blockquote

Now, my approach so far is to essentially count the > symbols per line. For example, the above seems like it should emit (roughly...) PARAGRAPH_START, CDATA, PARAGRAPH_END, BQUOTE_START, CDATA, BQUOTE_START, CDATA, BQUOTE_END, CDATA, BQUOTE_END, PARAGRAPH_START, CDATA, PARAGRAPH_END. The difficulty here is the final BQUOTE_END which I think should be an imaginary token emitted once a non-blockquote element is found (and the nesting level is >= 1)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

流星番茄 2024-08-24 18:18:34

好吧,如果您想要发出的标记不是由词法分析器规则定义的,那么您需要添加一个标记部分,如下所示:

tokens
{
    MYFAKETOKEN
}

在您的词法分析器中,您仍然需要一个规则来告诉词法分析器何时生成此标记。一个常见的例子是确定某个东西是整数、范围还是实数值。

NUMBERS_OR_RANGE
: INT 
        ( { LA(1) == '.' && LA(2) == '.' }? { _ttype = INT; }
    | { LA(1) == '.' || LA(1) == 'e' || LA(1) == 'E' }? { _ttype = REAL; }
    )
| PERIOD 
    ( PERIOD { _ttype = RANGE; }
    INT (( 'e' | 'E' ) ( '-' | '+' )? INT )? { _ttype = REAL; }
)
;

在这里你可以看到我们匹配一个 INT,然后向前看,如果我们找到一个双句点,那么我们就知道 INT 确实是一个 int,而不是一个实数。在本例中,我们将变量 _ttype 设置为 INT。如果我们找到一个句号,然后找到一个“e”,我们就知道它是一个真实的。

在第二种情况下,我们匹配一个句点,我们知道如果下一个字符是一个句点,那么我们就得到了一个范围,否则我们得到了一个实数。

如果合适的话,我们可以使用上面定义的 MYFAKETOKEN 类型来分配给 _ttype。

Well if the token you want to emit is not defined by a lexer rule then you'll need to add a token section like so:

tokens
{
    MYFAKETOKEN
}

In your lexer you will still need a rule that tells the lexer when to produce this token. A common instance is determining if something is an Integer or range or real value.

NUMBERS_OR_RANGE
: INT 
        ( { LA(1) == '.' && LA(2) == '.' }? { _ttype = INT; }
    | { LA(1) == '.' || LA(1) == 'e' || LA(1) == 'E' }? { _ttype = REAL; }
    )
| PERIOD 
    ( PERIOD { _ttype = RANGE; }
    INT (( 'e' | 'E' ) ( '-' | '+' )? INT )? { _ttype = REAL; }
)
;

Here you can see we match either an INT and then lookahead, if we find a double period then we know that the INT is really an int and not a real. In this case we set the variable _ttype to be INT. If we find a PERIOD and then an 'e' we know it's a real.

The second case where we match a period we know that if the next char is a period, then we've got a range otherwise we've got a real.

We could use the MYFAKETOKEN type we defined above to assign to _ttype if that was appropriate.

作死小能手 2024-08-24 18:18:34

好的,我做了一些研究,发现了这个: http: //www.cforcoding.com/2010/01/markdown-and-introduction-to-parsing.html

我认为 ANTLR 并不是真正为此类任务而设置的,并且试图竭尽全力去做这真的不值得。

Okay, I did some research and found this: http://www.cforcoding.com/2010/01/markdown-and-introduction-to-parsing.html

I don't think ANTLR is really set up for this sort of task and trying to bend over backwards to do it isn't really worth it.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文