解析 Javascript 时,什么决定了斜杠的含义?
Javascript 有一个棘手的语法需要解析。正斜杠可以表示多种不同的含义:除法运算符、正则表达式文字、注释引入符或行注释引入符。最后两个很容易区分:如果斜杠后面跟着一个星号,则它开始多行注释。如果斜杠后面跟着另一个斜杠,则它是行注释。
但消除除法和正则表达式文字歧义的规则却让我无法理解。我在 ECMAScript 标准 中找不到它。词法语法显式地分为两部分:InputElementDiv 和InputElementRegExp,具体取决于斜杠的含义。但没有任何解释何时使用哪个。
当然,可怕的分号插入规则使一切变得复杂。
有谁有一个清晰的 Javascript 代码示例来解答这个问题吗?
Javascript has a tricky grammar to parse. Forward-slashes can mean a number of different things: division operator, regular expression literal, comment introducer, or line-comment introducer. The last two are easy to distinguish: if the slash is followed by a star, it starts a multiline comment. If the slash is followed by another slash, it is a line-comment.
But the rules for disambiguating division and regex literal are escaping me. I can't find it in the ECMAScript standard. There the lexical grammar is explicitly divided into two parts, InputElementDiv and InputElementRegExp, depending on what a slash will mean. But there's nothing explaining when to use which.
And of course the dreaded semicolon insertion rules complicate everything.
Does anyone have an example of clear code for lexing Javascript that has the answer?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
这实际上相当简单,但它需要让你的词法分析器比平常更聪明一些。
除法运算符必须跟在表达式后面,而正则表达式文字不能跟在表达式后面,因此在所有其他情况下,您可以安全地假设您正在查看正则表达式文字。
如果你做得正确的话,你已经必须将标点符号识别为多字符字符串。因此,查看前面的标记,看看它是否是以下任何一个:
对于其中的大多数,您现在知道您处于可以找到正则表达式文字的上下文中。现在,对于
++ --
来说,您需要做一些额外的工作。如果++
或--
是预自增/自减,则其后面的/
会启动正则表达式文字;如果它是后递增/递减,则其后面的/
会启动 DivPunctuator。幸运的是,您可以通过检查其前一个标记来确定它是否是“前”运算符。首先,后自增/自减是一种受限产生式,因此如果
++
或--
前面有换行符,那么您就知道它是“前置”。否则,如果前一个标记是可以在正则表达式文字之前的任何内容(是的递归!),那么您就知道它是“pre-”。在所有其他情况下,它都是“后”的。当然,
)
标点符号并不总是指示表达式的结尾 - 例如if (something) /regex/.exec(x).这很棘手,因为它确实需要一些语义理解才能解开。
遗憾的是,这还不是全部。有一些运算符不是标点符号,还有其他值得注意的关键字。正则表达式文字也可以遵循这些。它们是:
如果您刚刚使用的 IdentifierName 是其中之一,那么您正在查看正则表达式文字;否则,它是一个 DivPunctuator。
以上内容基于 ECMAScript 5.1 规范(可在 此处找到),并且不包含任何特定于浏览器的语言扩展。但如果您需要支持这些,那么这应该提供简单的指导来确定您所处的上下文类型。
当然,上面的大多数都代表了包含正则表达式文字的非常愚蠢的情况。例如,您实际上无法预先递增正则表达式,即使语法上允许这样做。因此,大多数工具都可以简化实际应用程序的正则表达式上下文检查。 JSLint 检查前面字符是否为
(,=:[!&|?{};
的方法可能就足够了。但是,如果您在开发应该是词法分析 JS 的工具时采取这样的捷径,那么你应该确保注意到这一点。It's actually fairly easy, but it requires making your lexer a little smarter than usual.
The division operator must follow an expression, and a regular expression literal can't follow an expression, so in all other cases you can safely assume you're looking at a regular expression literal.
You already have to identify Punctuators as multiple-character strings, if you're doing it right. So look at the previous token, and see if it's any of these:
For most of these, you now know you're in a context where you can find a regular expression literal. Now, in the case of
++ --
, you'll need to do some extra work. If the++
or--
is a pre-increment/decrement, then the/
following it starts a regular expression literal; if it is a post-increment/decrement, then the/
following it starts a DivPunctuator.Fortunately, you can determine whether it is a "pre-" operator by checking its previous token. First, post-increment/decrement is a restricted production, so if
++
or--
is preceded by a linebreak, then you know it is "pre-". Otherwise, if the previous token is any of the things that can precede a regular expression literal (yay recursion!), then you know it is "pre-". In all other cases, it is "post-".Of course, the
)
punctuator doesn't always indicate the end of an expression - for exampleif (something) /regex/.exec(x)
. This is tricky because it does require some semantic understanding to disentangle.Sadly, that's not quite all. There are some operators that are not Punctuators, and other notable keywords to boot. Regular expression literals can also follow these. They are:
If the IdentifierName you just consumed is one of these, then you're looking at a regular expression literal; otherwise, it's a DivPunctuator.
The above is based on the ECMAScript 5.1 specification (as found here) and does not include any browser-specific extensions to the language. But if you need to support those, then this should provide easy guidelines for determining which sort of context you're in.
Of course, most of the above represent very silly cases for including a regular expression literal. For example, you can't actually pre-increment a regular expression, even though it is syntactically allowed. So most tools can get away with simplifying the regular expression context checking for real-world applications. JSLint's method of checking the preceding character for
(,=:[!&|?{};
is probably sufficient. But if you take such a shortcut when developing what's supposed to be a tool for lexing JS, then you should make sure to note that.我目前正在开发 JavaScript/使用 JavaCC 的 ECMAScript 5.1 解析器。 正则表达式文字 和 自动分号插入 是让我对 ECMAScript 语法着迷的两件事。这个问题和答案对于正则表达式问题非常宝贵。在这个答案中,我想将我自己的发现放在一起。
TL;DR 在 JavaCC 中,使用 词汇状态 和 从解析器切换它们。
汤姆·布莱克(Thom Blake)写道:
因此,您实际上需要了解它是否是之前的表达式。这在解析器中是微不足道的,但在词法分析器中却非常困难。
正如 Thom 指出,在许多(但不幸的是,不是全部)情况下,您可以理解它是否是“看着”最后一个标记。您必须考虑标点符号和关键字。
让我们从关键字开始。以下关键字不能位于
DivPunctuator
之前(例如,不能有case /5
),因此,如果您在这些关键字之后看到/
,则表明您有一个RegularExpressionLiteral
:接下来是标点符号。 ,在
{ /a...
中,符号/
永远不能开始除法):以下标点符号不能位于
DivPunctuator
之前(例如 如果你有其中之一,并在其后看到/...
,那么它永远不可能是DivPunctuator
,因此必须是RegularExpressionLiteral
。接下来,如果您有:
并且
/...
之后它也必须是RegularExpressionLiteral
。如果这些斜杠之间没有空格(即// ...
),则必须将其处理为SingleLineComment
(“最大咀嚼”)。接下来,下面的标点符号只能结束一个表达式:
因此下面的
/
必须开始一个DivPunctuator
。现在我们还有以下剩余情况,不幸的是,这些情况是不明确的:
对于
}
和)
你必须知道它们是否结束表达式,对于++
和--
- 它们结束一个PostfixExpression
或开始一个UnaryExpression
。我得出的结论是,在词法分析器中很难(如果不是不可能的话)找到答案。为了让您了解这一点,举几个例子。
在此示例中:
/a/g
是一个RegularExpressionLiteral
,但在此示例中:/a/g
是一个除法。在
)
的情况下,您可以有一个除法:以及一个
RegularExpressionLiteral
:因此,不幸的是,看起来您无法单独使用词法分析器来解决它。或者你必须将如此多的语法引入词法分析器,这样它就不再是词法分析器了。
这是一个问题。
现在,一个可能的解决方案,在我的例子中是基于 JavaCC 的。
我不确定其他解析器生成器中是否有类似的功能,但 JavaCC 有一个 词汇状态功能,可用于在“我们期望一个
DivPunctuator
”和“我们期望一个RegularExpressionLiteral
”之间切换” 状态。例如,在 此语法NOREGEXP
状态意味着“我们不期望这里有RegularExpressionLiteral
”。这解决了部分问题,但没有解决歧义的
)
、}
、++
和--
。为此,您需要能够从解析器切换词汇状态。这是可能的,请参阅 JavaCC FAQ 中的以下问题:
前瞻解析器可能已经在令牌流中走得太远了(即已经将
/
读取为DIV
,反之亦然)。幸运的是,似乎有一种方法可以使切换词汇状态更安全:
这个想法是创建一个“备份”令牌流,并将在前瞻期间读取的令牌再次推回。
我认为这应该适用于
}
、)
、++
、--
,因为它们通常可以在LOOKAHEAD(1) 情况,但我不能 100% 确定。在最坏的情况下,词法分析器可能已经尝试将/
起始标记解析为RegularExpressionLiteral
并失败,因为它没有被另一个/
终止。无论如何,我认为没有更好的方法可以做到这一点。下一件好事可能是完全放弃这种情况(就像 JSLint 和许多其他人所做的那样),记录而不是解析这些类型的表达式。无论如何,
{}/a/g
没有多大意义。I am currently developing a JavaScript/ECMAScript 5.1 parser with JavaCC. RegularExpressionLiteral and Automatic Semicolon Insertion are two things which make me crazy in ECMAScript grammar. This question and an answers were invaluable for the regex question. In this answer I'd like to put my own findings together.
TL;DR In JavaCC, use lexical states and switch them from the parser.
Very important is what Thom Blake wrote:
So you actually need to understand if it was an expression or not before. This is trivial in the parser but very hard in the lexer.
As Thom pointed out, in many (but, unfortunately, not all) cases you can understand if it was an expression by "looking" at the last token. You have to consider punctuators as well as keywords.
Let's start with keywords. The following keywords cannot precede a
DivPunctuator
(for example, you cannot havecase /5
), so if you see a/
after these, you have aRegularExpressionLiteral
:Next, punctuators. The following punctuators cannot precede a
DivPunctuator
(ex. in{ /a...
the symbol/
can never start a division):So if you have one of these and see
/...
after this, then this can never be aDivPunctuator
and therefore must be aRegularExpressionLiteral
.Next, if you have:
And
/...
after that it also must be aRegularExpressionLiteral
. If there were no space between these slashes (i.e.// ...
), this must have handled as aSingleLineComment
("maximal munch").Next, the following punctuator may only end an expression:
So the following
/
must start aDivPunctuator
.Now we have the following remaining cases which are, unfortunately, ambiguous:
For
}
and)
you have to know if they end an expression or not, for++
and--
- they end anPostfixExpression
or start anUnaryExpression
.And I have come to the conclusion that it is very hard (if not impossible) to find out in the lexer. To give you a sense of that, a couple of examples.
In this example:
/a/g
is aRegularExpressionLiteral
, but in this one:/a/g
is a division.In case of
)
you can have a division:as well as a
RegularExpressionLiteral
:So, unfortunately, it looks like you can't solve it with the lexer alone. Or you'll have to bring in so much grammar into the lexer so it's no lexer anymore.
This is a problem.
Now, a possible solution, which is, in my case JavaCC-based.
I am not sure if you have similar features in other parser generators, but JavaCC has a lexical states feature which can be used to switch between "we expect a
DivPunctuator
" and "we expect aRegularExpressionLiteral
" states. For instance, in this grammar theNOREGEXP
state means "we don't expect aRegularExpressionLiteral
here".This solves part of the problem, but not the ambiguous
)
,}
,++
and--
.For this, you'll need to be able to switch lexical states from the parser. This is possible, see the following question in JavaCC FAQ:
A lookahead parser may have already gone too far in the token stream (i.e. already read
/
as aDIV
or vice versa).Fortunately there seems to be a way to make switching lexical states a bit safer:
The idea is to make a "backup" token stream and push tokens read during lookahead back again.
I think that this should work for
}
,)
,++
,--
as they are normally found in LOOKAHEAD(1) situations, but I am not 100% sure of that. In the worst case the lexer may have already tried to parse/
-starting token as aRegularExpressionLiteral
and failed as it was not terminated by another/
.In any case, I see no better way of doing that. The next good thing would be probably to drop the case altogether (like
JSLint
and many others did), document and just not parse these types of expressions.{}/a/g
does not make much sense anyway.如果前代币是犀牛
总是从Lexer返回Div(斜线)令牌,那么JSLINT似乎会期待正则表达式。
JSLint appears to expect a regular expression if the preceding token is one of
Rhino always returns a DIV (slash) token from the lexer.
您只能通过实现语法解析器来知道如何解释 / 。到达有效解析的 lex 路径决定如何解释该字符。显然,这是他们考虑修复的问题,但没有。
更多阅读这里:
http://www-archive .mozilla.org/js/language/js20-2002-04/rationale/syntax.html#regular-expressions
You can only know how to interpret the / by also implementing a syntax parser. Whichever lex path arrives at a valid parse determines how to interpret the character. Apparently, this is something they had considered fixing, but didn't.
More reading here:
http://www-archive.mozilla.org/js/language/js20-2002-04/rationale/syntax.html#regular-expressions
参见第 7 节:
我同意,这很令人困惑,应该有一个顶级语法表达式而不是两个。
编辑:
也许简单的答案就在我们面前:尝试一种,然后尝试另一种。由于不允许同时使用它们,因此至多其中一个会产生无错误的匹配。
See section 7:
I agree, it's confusing and there should be one top-level grammar expression rather than two.
edit:
Maybe the simple answer is staring us in the face: try one and then try the other. Since they are not both permitted, at most one will yield an error-free match.