解析 Javascript 时,什么决定了斜杠的含义?

发布于 2024-10-28 19:28:55 字数 417 浏览 4 评论 0原文

Javascript 有一个棘手的语法需要解析。正斜杠可以表示多种不同的含义:除法运算符、正则表达式文字、注释引入符或行注释引入符。最后两个很容易区分:如果斜杠后面跟着一个星号,则它开始多行注释。如果斜杠后面跟着另一个斜杠,则它是行注释。

但消除除法和正则表达式文字歧义的规则却让我无法理解。我在 ECMAScript 标准 中找不到它。词法语法显式地分为两部分:InputElementDiv 和InputElementRegExp,具体取决于斜杠的含义。但没有任何解释何时使用哪个。

当然,可怕的分号插入规则使一切变得复杂。

有谁有一个清晰的 Javascript 代码示例来解答这个问题吗?

Javascript has a tricky grammar to parse. Forward-slashes can mean a number of different things: division operator, regular expression literal, comment introducer, or line-comment introducer. The last two are easy to distinguish: if the slash is followed by a star, it starts a multiline comment. If the slash is followed by another slash, it is a line-comment.

But the rules for disambiguating division and regex literal are escaping me. I can't find it in the ECMAScript standard. There the lexical grammar is explicitly divided into two parts, InputElementDiv and InputElementRegExp, depending on what a slash will mean. But there's nothing explaining when to use which.

And of course the dreaded semicolon insertion rules complicate everything.

Does anyone have an example of clear code for lexing Javascript that has the answer?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

云仙小弟 2024-11-04 19:28:55

这实际上相当简单,但它需要让你的词法分析器比平常更聪明一些。

除法运算符必须跟在表达式后面,而正则表达式文字不能跟在表达式后面,因此在所有其他情况下,您可以安全地假设您正在查看正则表达式文字。

如果你做得正确的话,你已经必须将标点符号识别为多字符字符串。因此,查看前面的标记,看看它是否是以下任何一个:

. ( , { } [ ; , < > <= >= == != === !== + - * % ++ --
<< >> >>> & | ^ ! ~ && || ? : = += -= *= %= <<= >>= >>>=
&= |= ^= / /=

对于其中的大多数,您现在知道您处于可以找到正则表达式文字的上下文中。现在,对于 ++ -- 来说,您需要做一些额外的工作。如果 ++-- 是预自增/自减,则其后面的 / 会启动正则表达式文字;如果它是后递增/递减,则其后面的 / 会启动 DivPunctuator。

幸运的是,您可以通过检查其前一个标记来确定它是否是“前”运算符。首先,后自增/自减是一种受限产生式,因此如果 ++-- 前面有换行符,那么您就知道它是“前置”。否则,如果前一个标记是可以在正则表达式文字之前的任何内容(是的递归!),那么您就知道它是“pre-”。在所有其他情况下,它都是“后”的。

当然,) 标点符号并不总是指示表达式的结尾 - 例如 if (something) /regex/.exec(x).这很棘手,因为它确实需要一些语义理解才能解开。

遗憾的是,这还不是全部。有一些运算符不是标点符号,还有其他值得注意的关键字。正则表达式文字也可以遵循这些。它们是:

new delete void typeof instanceof in do return case throw else

如果您刚刚使用的 IdentifierName 是其中之一,那么您正在查看正则表达式文字;否则,它是一个 DivPunctuator。

以上内容基于 ECMAScript 5.1 规范(可在 此处找到),并且不包含任何特定于浏览器的语言扩展。但如果您需要支持这些,那么这应该提供简单的指导来确定您所处的上下文类型。

当然,上面的大多数都代表了包含正则表达式文字的非常愚蠢的情况。例如,您实际上无法预先递增正则表达式,即使语法上允许这样做。因此,大多数工具都可以简化实际应用程序的正则表达式上下文检查。 JSLint 检查前面字符是否为 (,=:[!&|?{}; 的方法可能就足够了。但是,如果您在开发应该是词法分析 JS 的工具时采取这样的捷径,那么你应该确保注意到这一点。

It's actually fairly easy, but it requires making your lexer a little smarter than usual.

The division operator must follow an expression, and a regular expression literal can't follow an expression, so in all other cases you can safely assume you're looking at a regular expression literal.

You already have to identify Punctuators as multiple-character strings, if you're doing it right. So look at the previous token, and see if it's any of these:

. ( , { } [ ; , < > <= >= == != === !== + - * % ++ --
<< >> >>> & | ^ ! ~ && || ? : = += -= *= %= <<= >>= >>>=
&= |= ^= / /=

For most of these, you now know you're in a context where you can find a regular expression literal. Now, in the case of ++ --, you'll need to do some extra work. If the ++ or -- is a pre-increment/decrement, then the / following it starts a regular expression literal; if it is a post-increment/decrement, then the / following it starts a DivPunctuator.

Fortunately, you can determine whether it is a "pre-" operator by checking its previous token. First, post-increment/decrement is a restricted production, so if ++ or -- is preceded by a linebreak, then you know it is "pre-". Otherwise, if the previous token is any of the things that can precede a regular expression literal (yay recursion!), then you know it is "pre-". In all other cases, it is "post-".

Of course, the ) punctuator doesn't always indicate the end of an expression - for example if (something) /regex/.exec(x). This is tricky because it does require some semantic understanding to disentangle.

Sadly, that's not quite all. There are some operators that are not Punctuators, and other notable keywords to boot. Regular expression literals can also follow these. They are:

new delete void typeof instanceof in do return case throw else

If the IdentifierName you just consumed is one of these, then you're looking at a regular expression literal; otherwise, it's a DivPunctuator.

The above is based on the ECMAScript 5.1 specification (as found here) and does not include any browser-specific extensions to the language. But if you need to support those, then this should provide easy guidelines for determining which sort of context you're in.

Of course, most of the above represent very silly cases for including a regular expression literal. For example, you can't actually pre-increment a regular expression, even though it is syntactically allowed. So most tools can get away with simplifying the regular expression context checking for real-world applications. JSLint's method of checking the preceding character for (,=:[!&|?{}; is probably sufficient. But if you take such a shortcut when developing what's supposed to be a tool for lexing JS, then you should make sure to note that.

手长情犹 2024-11-04 19:28:55

我目前正在开发 JavaScript/使用 JavaCC 的 ECMAScript 5.1 解析器正则表达式文字自动分号插入 是让我对 ECMAScript 语法着迷的两件事。这个问题和答案对于正则表达式问题非常宝贵。在这个答案中,我想将我自己的发现放在一起。

TL;DR 在 JavaCC 中,使用 词汇状态从解析器切换它们


汤姆·布莱克(Thom Blake)写道:

除法运算符必须跟在表达式和正则表达式之后
表达式文字不能跟在表达式后面,所以在所有其他情况下
您可以放心地假设您正在查看正则表达式文字。

因此,您实际上需要了解它是否是之前的表达式。这在解析器中是微不足道的,但在词法分析器中却非常困难。

正如 Thom 指出,在许多(但不幸的是,不是全部)情况下,您可以理解它是否是“看着”最后一个标记。您必须考虑标点符号和关键字。

让我们从关键字开始。以下关键字不能位于 DivPunctuator 之前(例如,不能有 case /5),因此,如果您在这些关键字之后看到 /,则表明您有一个 RegularExpressionLiteral

case
delete
do
else
in
instanceof
new
return
throw
typeof
void

接下来是标点符号。 ,在 { /a... 中,符号 / 永远不能开始除法):

{       (       [   
.   ;   ,   <   >   <=
>=  ==  !=  === !== 
+   -   *   %       
<<  >>  >>> &   |   ^
!   ~   &&  ||  ?   :
=   +=  -=  *=  %=  <<=
>>= >>>=    &=  |=  ^=
    /=

以下标点符号不能位于 DivPunctuator 之前(例如 如果你有其中之一,并在其后看到 /...,那么它永远不可能是 DivPunctuator,因此必须是 RegularExpressionLiteral

接下来,如果您有:

/

并且 /... 之后它也必须是 RegularExpressionLiteral。如果这些斜杠之间没有空格(即 // ...),则必须将其处理为 SingleLineComment(“最大咀嚼”)。

接下来,下面的标点符号只能结束一个表达式:

]

因此下面的 / 必须开始一个 DivPunctuator

现在我们还有以下剩余情况,不幸的是,这些情况是不明确的:

}
)
++
--

对于 }) 你必须知道它们是否结束表达式,对于 ++-- - 它们结束一个 PostfixExpression 或开始一个 UnaryExpression

我得出的结论是,在词法分析器中很难(如果不是不可能的话)找到答案。为了让您了解这一点,举几个例子。

在此示例中:

{}/a/g

/a/g 是一个RegularExpressionLiteral,但在此示例中:

+{}/a/g

/a/g 是一个除法。

) 的情况下,您可以有一个除法:

('a')/a/g

以及一个 RegularExpressionLiteral

if ('a')/a/g

因此,不幸的是,看起来您无法单独使用词法分析器来解决它。或者你必须将如此多的语法引入词法分析器,这样它就不再是词法分析器了。

这是一个问题。


现在,一个可能的解决方案,在我的例子中是基于 JavaCC 的。

我不确定其他解析器生成器中是否有类似的功能,但 JavaCC 有一个 词汇状态功能,可用于在“我们期望一个 DivPunctuator”和“我们期望一个 RegularExpressionLiteral”之间切换” 状态。例如,在 此语法 NOREGEXP 状态意味着“我们不期望这里有 RegularExpressionLiteral”。

这解决了部分问题,但没有解决歧义的 )}++--

为此,您需要能够从解析器切换词汇状态。这是可能的,请参阅 JavaCC FAQ 中的以下问题:

解析器可以强制切换到新的词汇状态?

是的,但是这样做很容易产生错误。

前瞻解析器可能已经在令牌流中走得太远了(即已经将/读取为DIV,反之亦然)。

幸运的是,似乎有一种方法可以使切换词汇状态更安全:

有没有办法让 SwitchTo 更安全?

这个想法是创建一个“备份”令牌流,并将在前瞻期间读取的令牌再次推回。

我认为这应该适用于 })++--,因为它们通常可以在LOOKAHEAD(1) 情况,但我不能 100% 确定。在最坏的情况下,词法分析器可能已经尝试将 / 起始标记解析为 RegularExpressionLiteral 并失败,因为它没有被另一个 / 终止。

无论如何,我认为没有更好的方法可以做到这一点。下一件好事可能是完全放弃这种情况(就像 JSLint 和许多其他人所做的那样),记录而不是解析这些类型的表达式。无论如何,{}/a/g 没有多大意义。

I am currently developing a JavaScript/ECMAScript 5.1 parser with JavaCC. RegularExpressionLiteral and Automatic Semicolon Insertion are two things which make me crazy in ECMAScript grammar. This question and an answers were invaluable for the regex question. In this answer I'd like to put my own findings together.

TL;DR In JavaCC, use lexical states and switch them from the parser.


Very important is what Thom Blake wrote:

The division operator must follow an expression, and a regular
expression literal can't follow an expression, so in all other cases
you can safely assume you're looking at a regular expression literal.

So you actually need to understand if it was an expression or not before. This is trivial in the parser but very hard in the lexer.

As Thom pointed out, in many (but, unfortunately, not all) cases you can understand if it was an expression by "looking" at the last token. You have to consider punctuators as well as keywords.

Let's start with keywords. The following keywords cannot precede a DivPunctuator (for example, you cannot have case /5), so if you see a / after these, you have a RegularExpressionLiteral:

case
delete
do
else
in
instanceof
new
return
throw
typeof
void

Next, punctuators. The following punctuators cannot precede a DivPunctuator (ex. in { /a... the symbol / can never start a division):

{       (       [   
.   ;   ,   <   >   <=
>=  ==  !=  === !== 
+   -   *   %       
<<  >>  >>> &   |   ^
!   ~   &&  ||  ?   :
=   +=  -=  *=  %=  <<=
>>= >>>=    &=  |=  ^=
    /=

So if you have one of these and see /... after this, then this can never be a DivPunctuator and therefore must be a RegularExpressionLiteral.

Next, if you have:

/

And /... after that it also must be a RegularExpressionLiteral. If there were no space between these slashes (i.e. // ...), this must have handled as a SingleLineComment ("maximal munch").

Next, the following punctuator may only end an expression:

]

So the following / must start a DivPunctuator.

Now we have the following remaining cases which are, unfortunately, ambiguous:

}
)
++
--

For } and ) you have to know if they end an expression or not, for ++ and -- - they end an PostfixExpression or start an UnaryExpression.

And I have come to the conclusion that it is very hard (if not impossible) to find out in the lexer. To give you a sense of that, a couple of examples.

In this example:

{}/a/g

/a/g is a RegularExpressionLiteral, but in this one:

+{}/a/g

/a/g is a division.

In case of ) you can have a division:

('a')/a/g

as well as a RegularExpressionLiteral:

if ('a')/a/g

So, unfortunately, it looks like you can't solve it with the lexer alone. Or you'll have to bring in so much grammar into the lexer so it's no lexer anymore.

This is a problem.


Now, a possible solution, which is, in my case JavaCC-based.

I am not sure if you have similar features in other parser generators, but JavaCC has a lexical states feature which can be used to switch between "we expect a DivPunctuator" and "we expect a RegularExpressionLiteral" states. For instance, in this grammar the NOREGEXP state means "we don't expect a RegularExpressionLiteral here".

This solves part of the problem, but not the ambiguous ), }, ++ and --.

For this, you'll need to be able to switch lexical states from the parser. This is possible, see the following question in JavaCC FAQ:

Can the parser force a switch to a new lexical state?

Yes, but it is very easy to create bugs by doing so.

A lookahead parser may have already gone too far in the token stream (i.e. already read / as a DIV or vice versa).

Fortunately there seems to be a way to make switching lexical states a bit safer:

Is there a way to make SwitchTo safer?

The idea is to make a "backup" token stream and push tokens read during lookahead back again.

I think that this should work for }, ), ++, -- as they are normally found in LOOKAHEAD(1) situations, but I am not 100% sure of that. In the worst case the lexer may have already tried to parse /-starting token as a RegularExpressionLiteral and failed as it was not terminated by another /.

In any case, I see no better way of doing that. The next good thing would be probably to drop the case altogether (like JSLint and many others did), document and just not parse these types of expressions. {}/a/g does not make much sense anyway.

玉环 2024-11-04 19:28:55

如果前代币是犀牛

(,=:[!&|?{};

总是从Lexer返回Div(斜线)令牌,那么JSLINT似乎会期待正则表达式。

JSLint appears to expect a regular expression if the preceding token is one of

(,=:[!&|?{};

Rhino always returns a DIV (slash) token from the lexer.

堇色安年 2024-11-04 19:28:55

您只能通过实现语法解析器来知道如何解释 / 。到达有效解析的 lex 路径决定如何解释该字符。显然,这是他们考虑修复的问题,但没有。
更多阅读这里:
http://www-archive .mozilla.org/js/language/js20-2002-04/rationale/syntax.html#regular-expressions

You can only know how to interpret the / by also implementing a syntax parser. Whichever lex path arrives at a valid parse determines how to interpret the character. Apparently, this is something they had considered fixing, but didn't.
More reading here:
http://www-archive.mozilla.org/js/language/js20-2002-04/rationale/syntax.html#regular-expressions

喜爱皱眉﹌ 2024-11-04 19:28:55

参见第 7 节:

词法语法有两个目标符号。 InputElementDiv 符号用在允许使用前导除法 (/) 或除法赋值 (/=) 运算符的语法上下文中。 InputElementRegExp 符号用于其他语法上下文。

注意 不存在允许使用前导除法或除法赋值以及前导正则表达式文字的语法上下文。这不受分号插入的影响(见 7.9);在诸如以下的例子中
以下:

<前><代码>a = b
/hi/g.exec(c).map(d);

如果行终止符后的第一个非空白、非注释字符是斜杠 (/),并且语法上下文允许除法或除法赋值,则不会在行终止符处插入分号。也就是说,上面的例子被解释为
与以下方式相同:

a = b / hi / g.exec(c).map(d); 

我同意,这很令人困惑,应该有一个顶级语法表达式而不是两个。


编辑:

但是没有任何解释何时使用哪个。

也许简单的答案就在我们面前:尝试一种,然后尝试另一种。由于不允许同时使用它们,因此至多其中一个会产生无错误的匹配。

See section 7:

There are two goal symbols for the lexical grammar. The InputElementDiv symbol is used in those syntactic grammar contexts where a leading division (/) or division-assignment (/=) operator is permitted. The InputElementRegExp symbol is used in other syntactic grammar contexts.

NOTE There are no syntactic grammar contexts where both a leading division or division-assignment, and a leading RegularExpressionLiteral are permitted. This is not affected by semicolon insertion (see 7.9); in examples such as the
following:

a = b 
/hi/g.exec(c).map(d); 

where the first non-whitespace, non-comment character after a LineTerminator is slash (/) and the syntactic context allows division or division-assignment, no semicolon is inserted at the LineTerminator. That is, the above example is interpreted in
the same way as:

a = b / hi / g.exec(c).map(d); 

I agree, it's confusing and there should be one top-level grammar expression rather than two.


edit:

But there's nothing explaining when to use which.

Maybe the simple answer is staring us in the face: try one and then try the other. Since they are not both permitted, at most one will yield an error-free match.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文