词法分析器如何处理注释和转义序列?
注释和转义序列(例如字符串文字)与常规符号表示相比非常特殊。
对我来说很难理解常规词法分析器如何对它们进行标记。 像 lex
、flex
等词法分析器如何处理这种符号?有通用方法吗?或者只是针对每种语言的具体情况?
Comment and escape sequence (such as string literal) are very exceptional from regular symbolic representation.
It's hard to understand for me how does regular lexical analyzers tokenize them.
How do lexical analyzers like lex
, flex
, or etc.. handle this kind of symbols? Is there a generic method? Or just case by case for each language?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我认为这 - 每种语言的具体情况 - 是正确的。
如果注释起始符存在于字符串文字中,则词法分析器必须忽略它。
类似地,在 C 中,如果转义双引号
\"
存在于字符串文字中,词法分析器必须忽略它。
为此,flex 有启动条件。
这可以实现上下文分析。
例如,有一个C注释分析的例子(
/*
和*/
之间)在flex texinfo手册中:
开始条件还可以进行字符串文字分析。
有一个示例说明如何使用 start 来匹配 C 风格的带引号的字符串
项目开始条件中的条件,以及
还有一个常见问题解答项目标题为
如何在 C 样式引用字符串中扩展反斜杠转义序列?
在flex texinfo手册中。
也许这会直接回答您关于字符串文字的问题。
I think this - case by case for each language - is true.
If comment starter exists in a string literal, lexer has to ignore it.
Similarly, in C, if escaped double quote
\"
exists in a string literal,lexer has to ignore it.
For this purpose, flex has start condition.
This enables contextual analysis.
For instance, there is an example for C comment analysis(between
/*
and*/
)in flex texinfo manual:
Start condition also enables string literal analysis.
There is an example of how to match C-style quoted strings using start
conditions in the item Start Conditions, and
there is also FAQ item titled
How do I expand backslash-escape sequences in C-style quoted strings?
in flex texinfo manual.
Probably this will answer directly your question about string literal.
我不确定你的意思,但这个说法肯定是错误的。注释(除非它们可以嵌套)和带有转义序列的字符串都允许简单的常规语言描述。
例如,允许
\\
、\"
、\n
和\r
的转义序列可以描述为以下正则语法(以E
开头):字符串只是零个或多个未转义符号或转义序列的重复(即两个正则语言上的 Kleene 闭包,其本身就是正则的)。
I’m not sure what you mean but this statement is certainly wrong. Both comments (unless they may be nested) and strings with escape sequence admit a simple regular language description.
For example, an escape sequence allowing
\\
,\"
,\n
and\r
can be described by the following regular grammar (with start symbolE
):And a string is just a repetition of zero or more unescaped symbols or escape sequences (i.e. a Kleene closure over two regular languages, which is itself regular).
我不能对 lex 说什么,但在我的语言的词法分析器中(使用 C++ 风格 // 注释),我已经按行分割了输入(因为它是一种受 Python 启发的语言),我有一个匹配 // 和任意数量任意字符的正则表达式。
I can't say anything for
lex
, but in my lexer for my language (using C++ style // comments) I have already split the input by lines (seeing as it's a Python-inspired language), I have a regex that matches the // and then any number of any characters.