我应该在哪里划清词法分析器和解析器之间的界限?

发布于 2024-10-25 07:35:09 字数 880 浏览 6 评论 0原文

我正在为 IMAP 协议编写一个词法分析器,用于教育目的,但我很困惑应该在词法分析器和解析器之间划清界限。以 IMAP 服务器响应为例:

* FLAGS (\Answered \Deleted)

该响应在正式语法中定义如下:

mailbox-data   = "FLAGS" SP flag-list
flag-list      = "(" [flag *(SP flag)] ")"
flag           = "\Answered" / "\Deleted"

由于它们被指定为字符串文字(又名“终端”标记),因此词法分析器发出唯一标记会更正确吗? 对于每个,例如:

(TknAnsweredFlag)
(TknSpace)
(TknDeletedFlag)

或者发出这样的内容是否同样正确:

(TknBackSlash)
(TknString "Answered")
(TknSpace)
(TknBackSlash)
(TknString "Deleted")

我的困惑是前一种方法可能会使词法分析器过于复杂 - 如果 \Answered 在两种不同的情况下有两种含义词法分析器不会发出正确的上下文。作为一个人为的示例(这种情况不会发生,因为电子邮件地址用引号引起来),词法分析器将如何处理像 \[电子邮件受保护]?或者正式语法的设计是否不允许出现这种歧义?

I'm writing a lexer for the IMAP protocol for educational purposes and I'm stumped as to where I should draw the line between lexer and parser. Take this example of an IMAP server response:

* FLAGS (\Answered \Deleted)

This response is defined in the formal syntax like this:

mailbox-data   = "FLAGS" SP flag-list
flag-list      = "(" [flag *(SP flag)] ")"
flag           = "\Answered" / "\Deleted"

Since they are specified as string literals (aka "terminal" tokens) would it be more correct for the lexer to emit a unique token for each, like:

(TknAnsweredFlag)
(TknSpace)
(TknDeletedFlag)

Or would it be just as correct to emit something like this:

(TknBackSlash)
(TknString "Answered")
(TknSpace)
(TknBackSlash)
(TknString "Deleted")

My confusion is that the former method could overcomplicate the lexer - if \Answered had two meanings in two different contexts the lexer wouldn't emit the right token. As a contrived example (this situation won't occur because e-mail addresses are enclosed in quotes), how would the lexer deal with an e-mail address like \[email protected]? Or is the formal syntax designed to never allow such an ambiguity to arise?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

没企图 2024-11-01 07:35:10

我建议避免分离词法分析器和解析器 - 现代解析方法(例如 PEG)允许混合词法分析和解析。这样你就根本不需要令牌了。

I'd recommend to avoid separating lexer and parser - modern parsing approaches (like PEGs) allows to mix lexing and parsing. This way you won't need tokens at all.

沫离伤花 2024-11-01 07:35:09

作为一般规则,您不希望词汇语法传播到语法中,因为它只是细节。例如,像 C 这样的计算机编程语言的词法分析器肯定会识别数字,但生成 HEXNUMBER 和 DECIMALNUMBER 标记通常是不合适的,因为这对语法来说并不重要。

我认为您想要的是最抽象的标记,使您的语法能够区分与您的目的相关的感兴趣的案例。你可以通过语法的一个部分引起的混乱以及你在其他部分可能做出的选择来调解这一问题。

如果您的目标只是读取过去的标志值,那么实际上您不需要区分它们,并且没有关联内容的 TknFlag 就足够了。

如果您的目标是单独处理标志值,您需要知道是否收到“已回答”和/或“已删除”指示。它们的词汇拼写方式无关紧要;所以我会选择你的 TknAnsweredFlag 解决方案。我会转储 TknSpace,因为在任何标志序列中,都必须有中间空格(您的规范是这么说的),所以我会尝试消除使用词法分析器提供的任何空白抑制机制。

有时,我会遇到有几十个类似旗帜的东西的情况。如果每个语法都有一个标记,那么你的语法就会开始变得混乱。如果语法不需要知道特定的标志,那么您应该有一个带有关联字符串值的 TknFlag。如果语法需要一小部分标志来区分,但大多数都不需要,那么您应该妥协:为那些对语法重要的标志使用单独的标记,并为其余的标记使用关联的字符串捕获所有 TknFlag 。

关于有两种不同解释的困难:这是这些权衡之一。如果您遇到这个问题,那么您的标记要么需要在语法中需要的两个地方都有足够的细节,以便您可以区分。如果“\”作为语法中其他地方的标记相关,那么您当然可以生成 TknBackSlash 和 TknAnswered。但是,如果语法的某个部分处理某些内容的方式与另一部分不同,您通常可以使用模式驱动的词法分析器来解决这个问题。将模式视为有限状态机,每个状态机都有一个关联的(子)词法分析器。模式之间的转换由作为提示的标记触发(您必须有一个 FLAGS 标记;正是这样的提示,您将要选取标记值)。在某种模式下,你可以产生其他模式不会产生的代币;因此,在一种模式下,您可能会生成“\”标记,但在标志模式下则不需要。模式支持在词法分析器中非常常见,因为这个问题比您想象的更常见。有关示例,请参阅 Flex 文档。

您提出这个问题的事实表明您正在做出正确的选择。您需要平衡最小化标记的可维护性目标(从技术上讲,您可以使用标记来解析所有 ASCII 字符!)与充分区分您的需求的基本要求。在构建了十几个语法之后,这种权衡似乎很容易,但我认为我提供的经验法则非常好。

As a general rule, you don't want lexical syntax to propagate into the grammar, because its just detail. For instance, a lexer for a computer programming langauge like C would certainly recognize numbers, but it is generally inappropriate to produce HEXNUMBER and DECIMALNUMBER tokens, because this isn't important to the grammar.

I think what you want are the most abstract tokens that allows your grammar to distinguish cases of interest relative to your purpose. You get to mediate this by confusion caused in one part of the grammar, by choices you might make in other parts.

If your goal is simply to read past the flag values, then in fact you don't need to distinguish among them, and a TknFlag with no associated content would be good enough.

If your goal is to process the flag values individually, you need to know if you got an ANSWERED and/or DELETED indications. How they are lexically spelled is irrelevant; so I'd go with your TknAnsweredFlag solution. I would dump the TknSpace, because in any sequence of flags, there must be intervening spaces (your spec say so), so I'd try to eliminate using whatever whitespace supression machinery you lexer offers.

On occasion, I run into situations where there are dozens of such flag-like things. Then your grammar starts to become cluttered if you have a token for each. If the grammar doesn't need to know specific flags, then you should have a TknFlag with associated string value. If a small subset of the flags are needed by the grammar to distinguish, but most of them are not, then you should compromise: have individual tokens for those flags that matter to the grammar, and a catch all TknFlag with associated string for the rest.

Regarding the difficulty in having two different interpretations: this is one of those tradeoffs. If you have that issue, then your tokens either need to have fine enough detail in both places where they are needed in the grammar so you can discriminate. If "\" is relevant as a token somewhere else in the grammar, you certainly could produce both TknBackSlash and TknAnswered. However, if the way something is treated in one part of the grammar is different than another, you can often get around this using a mode-driven lexer. Think of modes as being a finite state machine, each with an associated (sub)lexer. Transitions between modes are triggered by tokens that are cues (you must have a FLAGS token; it is preciseuly such a cue that you are about to pick up flag values). In a mode, you can produce tokens that other modes would not produce; thus in one mode, you might produce "\" tokens, but in your flag mode you wouldn't need to. Mode support is pretty common in lexers because this problem is more common that you might expect. See Flex documentation for an example.

The fact that you are asking the question shows you are on the right track for making a good choice. You need to balance the maintainability goal of minimizing tokens (technically you can parse using a token for ever ASCII character!) with fundamental require to discriminate well enough for your needs. After you've built a dozen grammars this tradeoff will seem easy, but I think the rules of thumb I've provided are pretty good.

若无相欠,怎会相见 2024-11-01 07:35:09

我首先提出 CFG,无论它完成其工作所需的终端是什么,词法分析器都应该识别;否则你只是猜测标记字符串的正确方法。

I'd come up with the CFG first and whatever terminals it needs to do its job is what the lexer should recognize; otherwise you are just guessing at the proper way to tokenize the string.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文