JsLex 错误地解释了哪些 Javascript 构造?
JsLex 是我用 Python 编写的 Javascript 词法分析器。它可以很好地完成一天(左右)的工作,但我确信在某些情况下它会出错。特别是,它对分号插入一无所知,并且可能有一些对词法分析很重要的方法。我只是不知道它们是什么。
哪些 Javascript 代码 JsLex lex 不正确?我对有效的 Javascript 源特别感兴趣,其中 JsLex 错误地识别了正则表达式文字。
需要明确的是,“词法分析”是指识别源文件中的标记。 JsLex 不会尝试解析 Javascript,更不用说执行它了。我编写了 JsLex 来进行完整的词法分析,不过说实话,如果它能够成功找到所有正则表达式文字,我会很高兴。
JsLex is a Javascript lexer I've written in Python. It does a good job for a day's work (or so), but I'm sure there are cases it gets wrong. In particular, it doesn't understand anything about semicolon insertion, and there are probably ways that's important for lexing. I just don't know what they are.
What Javascript code does JsLex lex incorrectly? I'm especially interested in valid Javascript source where JsLex incorrectly identifies regex literals.
Just to be clear, by "lexing" I mean identifying tokens in a source file. JsLex makes no attempt to parse Javascript, much less execute it. I've written JsLex to do full lexing, though to be honest I would be happy if it merely was able to successfully find all the regex literals.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
有趣的是,我在用 JS 编写的词法分析器/求值器的代码上尝试了您的词法分析器;)您是对的,它并不总是能很好地处理正则表达式。这里有一些例子:
这个大部分都很好 - 只有
UNQUITED_LITERAL
未被识别,否则一切都很好。但现在让我们对它做一个小的补充:现在都是在
NAME's
正则表达式混乱之后。它可以制作 1 根大绳子。我认为后一个问题是String token太贪婪了。对于regex
令牌来说,前一个可能过于智能。编辑:我想我已经修复了
regex
令牌的正则表达式。在代码中,将第 146-153 行(整个“后续字符”部分)替换为以下表达式:这个想法是允许除
/
之外的所有内容,还允许\/
,但不允许\\/
。编辑:另一个有趣的案例,在修复后通过,但添加为内置测试案例可能会很有趣:
编辑:又一个案例。它似乎对关键词也太贪婪了。查看案例:
它将其词法为:
(keyword, const), (id, ructor)
。对于标识符inherits
也会发生同样的情况:in
和herits
。Interestingly enough I tried your lexer on the code of my lexer/evaluator written in JS ;) You're right, it is not always doing well with regular expressions. Here some examples:
This one is mostly fine - only
UNQUITED_LITERAL
is not recognized, otherwise all is fine. But now let's make a minor addition to it:Now all after the
NAME's
regexp messes up. It makes 1 big string. I think the latter problem is that String token is too greedy. The former one might be too smart regexp for theregex
token.Edit: I think I've fixed the regexp for the
regex
token. In your code replace lines 146-153 (the whole 'following characters' part) with the following expression:The idea is to allow everything except
/
, also allow\/
, but not allow\\/
.Edit: Another interesting case, passes after the fix, but might be interesting to add as the built-in test case:
Edit: Yet another case. It appears to be too greedy about keywords as well. See the case:
It lexes it as:
(keyword, const), (id, ructor)
. The same happens for an identifierinherits
:in
andherits
.示例:下面第一次出现的
/ 2 /i
(对a
的赋值)应标记为 Div、NumericLiteral、Div、标识符,因为它位于 InputElementDiv 上下文中。第二次出现(对b
的赋值)应标记为 RegularExpressionLiteral,因为它位于 InputElementRegExp 上下文中。来源:
Example: The first occurrence of
/ 2 /i
below (the assignment toa
) should tokenize as Div, NumericLiteral, Div, Identifier, because it is in a InputElementDiv context. The second occurrence (the assignment tob
) should tokenize as RegularExpressionLiteral, because it is in a InputElementRegExp context.Source:
处理这个棘手问题的简单解决方案非常酷,但我注意到它并不能完全处理 ES5 的
something.property
语法的更改,该更改允许在后面使用保留字.
。即,a.if = 'foo'; (function () {a.if /= 3;});,是 一些最近的实现。除非我弄错了,否则
.
无论如何对于属性都只有一种用途,因此解决方案可能是在.< /code> 只接受 identifierName 标记(这是 identifier 使用的,但它不拒绝保留字)可能会成功。 (显然 div 状态照常遵循。)
The simplicity of your solution for handling this hairy problem is very cool, but I noticed that it doesn't quite handle a change in
something.property
syntax for ES5, which allows reserved words following a.
. I.e.,a.if = 'foo'; (function () {a.if /= 3;});
, is a valid statement in some recent implementations.Unless I'm mistaken there is only one use of
.
anyway for properties, so the fix could be adding an additional state following the.
which only accepts the identifierName token (which is what identifier uses, but it doesn't reject reserved words) would probably do the trick. (Obviously the div state follows that as per usual.)我一直在思考自己为 JavaScript 编写词法分析器的问题,我刚刚在寻找好的技术时遇到了你的实现。我发现了一个你的不起作用的情况,如果你仍然感兴趣,我想我会分享:
斜杠都应该被解析为除法运算符,导致 x 被分配数值 1。你的词法分析器认为它是一个正则表达式。如果不维护一组分组上下文来区分块的结尾(期望正则表达式)和函数语句的结尾,则无法正确处理这种情况的所有变体。 strong>(期望正则表达式)、函数表达式的结尾(期望除法)以及对象文字的结尾(期望除法)。
I've been thinking about the problems of writing a lexer for JavaScript myself, and I just came across your implementation in my search for good techniques. I found a case where yours doesn't work that I thought I'd share if you're still interested:
The slashes should both be parsed as division operators, resulting in x being assigned the numeric value 1. Your lexer thinks that it is a regexp. There is no way to handle all variants of this case correctly without maintaining a stack of grouping contexts to distinguish among the end of a block (expect regexp), the end of a function statement (expect regexp), the end of a function expression (expect division), and the end of an object literal (expect division).
它对于这段代码是否可以正常工作(这不应该有分号;正确地词法分析时会产生错误)?
如果是这样,它对于依赖分号插入的代码是否可以正常工作?
Does it work properly for this code (this shouldn't have a semicolon; it produces an error when lexed properly)?
If it does, does it work properly for this code, that relies on semicolon insertion?