如何将其标记化?
我正在尝试手动编写标记器代码。我继续阅读可以作为令牌一部分的字符。例如,整数只能包含数字。因此,在下面的文本中,我继续读取字符,直到找到非数字字符。所以我得到 123 作为令牌。接下来我得到 ( 作为标记,然后 abc 作为标识符。这很好,因为 ( 是分隔符。
123(abc
但是,在下面的文本中,我得到 123 作为整数,然后得到 abc 作为标识符。但实际上这是无效的,因为没有 分词器是否应该检查分隔符
123abc(
并报告错误?如果是,分词器应该在发现无效标记后从哪里继续读取?
或者分词器应该简单地返回 123 作为整数和 abc作为标识符并让解析器检测错误?
I am trying to hand code a tokenizer. I keep on reading the characters which can be part of a token. For example an integer can only contain digits. So in the below text I keep on reading the characters until I find a non-digit character. So I get 123 as the token. Next I get ( as the token, and then abc as identifier. This is fine as ( is a delimiter.
123(abc
However, in the below text I get 123 as integer and then abc as identifier. But actually this in not valid since there is no delimiter between them.
123abc(
Should the tokenizer check for delimiters and report an error? If yes what tokens should be returned and where should the tokenizer continue reading from after an invalid token is found?
Or should the tokenizer simply return 123 as integer and abc as identifier and let the parser detect the errors?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
通常,分词器(或词法分析器)不执行有效语法检查。
词法分析器的作用是将输入拆分为标记,然后可以通过 解析器将其转换为语法树。因此,执行此类检查通常是解析器的工作。
Usually, the tokenizer (or lexer) performs no checking of valid syntax.
The role of a lexer is to split the input into tokens, which can then be transformed into a syntax tree by the parser. Therefore, it'd usually be the job of the parser to perform such a check.
这在某种程度上是一个灰色地带,但大多数手工编码的词法分析器只是进行标记化,并让解析器决定标记流是否有意义。
This is somewhat of a gray area, but most hand-coded lexers just do the tokenizing, and let the parser decide whether the stream of tokens make sense.
如果“123abc”是无效标记,那么您应该在发现它后立即处理它,因为它与标记的定义方式直接相关,而不是它们彼此交互的方式(这将是词法分析器的工作)。这是一个拼写错误,而不是与语法相关的错误。
有多种方法可以解决这个问题:
您可以中止解析并抛出一些异常,让调用者没有令牌或只有您在此之前已成功解析的令牌。这将为您节省任何“恢复”逻辑,并且可能足以满足您的用例。不过,例如,如果您正在解析语法突出显示的内容,这可能还不够,因为您不希望所有剩余代码看起来都未解析。
示例:如果不需要处理格式错误的标记,则符合标准的 XML 解析器可以使用此方法来处理致命错误,只需吐出一个基本错误并退出。
或者,您可以插入一个“错误”标记,其中包含有关错误性质的正确元数据,然后跳到下一个有效标记。
您可能需要在词法分析器中使用启发式方法来优雅地处理错误标记,并在嵌套表达式中发现错误标记时找到如何解释进一步的标记(例如,您是否应该认为表达式已经结束?寻找结束语)令牌?等)。
无论如何,这种方法将允许使用错误标记来显示有关遇到的错误的位置和性质的精确信息(想想 GUI 中的内联错误报告)。
If "123abc" is an invalid token then you should handle it as soon as you spot it since it's directly related to the way tokens are defined, not how they interact with each other (which would be the lexer's job). It's an orthographic error rather than a grammar-related one.
There are multiple ways to go about it:
You could abort the parsing and just throw some exception, leaving the caller with no tokens or just the tokens you had successfully parsed until then. This will save you any "recovery" logic and might be enough for your use case. Although, if you're parsing stuff for syntax highlighting for instance, this would probably not be sufficient as you don't want all of the remaining code to look unparsed.
Example: A conforming XML parser could use this for fatal errors if there's no need to handle malformed markup, just spit out a basic error and quit.
Alternatively, you could insert an "error" token with proper metadata about the nature of the error and skip ahead to the next valid token.
You might need to have heuristics in your lexer to handle the error token gracefully and find how to interpret further tokens when an error token is found inside an nested expression (like, should you consider the expression has ended? look for a closing token? etc.).
Anyway, this approach will allow for error tokens to be used to display precise info about the location and nature of errors encountered (think inline error reporting in a GUI).
您可以考虑生成分词器或词法分析器。 Flex 或 ANTLR 应该有帮助。 您也可以使用 ANTLR 或 Bison 生成解析器
如果您坚持手动编码, 词法分析器(和你的解析器),进行一些前瞻在实践中非常有帮助。例如,您可以逐行读取输入并在当前行内进行标记(能够检查接下来的几个字符)。
You might consider generating your tokenizer or lexer. Tools like Flex or ANTLR should help. And you might also generate your parser with ANTLR or Bison
If you insist on hand-coding your lexer (and your parser), having some look-ahead is extremely helpful in practice. For instance, you could read your input line by line and tokenize inside the current line (with the ability to inspect the next few characters).