JsLex 错误地解释了哪些 Javascript 构造?

发布于 2024-10-29 16:27:30 字数 458 浏览 2 评论 0原文

JsLex 是我用 Python 编写的 Javascript 词法分析器。它可以很好地完成一天(左右)的工作,但我确信在某些情况下它会出错。特别是,它对分号插入一无所知,并且可能有一些对词法分析很重要的方法。我只是不知道它们是什么。

哪些 Javascript 代码 JsLex lex 不正确?我对有效的 Javascript 源特别感兴趣,其中 JsLex 错误地识别了正则表达式文字。

需要明确的是,“词法分析”是指识别源文件中的标记。 JsLex 不会尝试解析 Javascript,更不用说执行它了。我编写了 JsLex 来进行完整的词法分析,不过说实话,如果它能够成功找到所有正则表达式文字,我会很高兴。

JsLex is a Javascript lexer I've written in Python. It does a good job for a day's work (or so), but I'm sure there are cases it gets wrong. In particular, it doesn't understand anything about semicolon insertion, and there are probably ways that's important for lexing. I just don't know what they are.

What Javascript code does JsLex lex incorrectly? I'm especially interested in valid Javascript source where JsLex incorrectly identifies regex literals.

Just to be clear, by "lexing" I mean identifying tokens in a source file. JsLex makes no attempt to parse Javascript, much less execute it. I've written JsLex to do full lexing, though to be honest I would be happy if it merely was able to successfully find all the regex literals.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

要走就滚别墨迹 2024-11-05 16:27:30

有趣的是,我在用 JS 编写的词法分析器/求值器的代码上尝试了您的词法分析器;)您是对的,它并不总是能很好地处理正则表达式。这里有一些例子:

rexl.re = {
  NAME: /^(?!\d)(?:\w)+|^"(?:[^"]|"")+"/,
  UNQUOTED_LITERAL: /^@(?:(?!\d)(?:\w|\:)+|^"(?:[^"]|"")+")\[[^\]]+\]/,
  QUOTED_LITERAL: /^'(?:[^']|'')*'/,
  NUMERIC_LITERAL: /^[0-9]+(?:\.[0-9]*(?:[eE][-+][0-9]+)?)?/,
  SYMBOL: /^(?:==|=|<>|<=|<|>=|>|!~~|!~|~~|~|!==|!=|!~=|!~|!|&|\||\.|\:|,|\(|\)|\[|\]|\{|\}|\?|\:|;|@|\^|\/\+|\/|\*|\+|-)/
};

这个大部分都很好 - 只有 UNQUITED_LITERAL 未被识别,否则一切都很好。但现在让我们对它做一个小的补充:

rexl.re = {
  NAME: /^(?!\d)(?:\w)+|^"(?:[^"]|"")+"/,
  UNQUOTED_LITERAL: /^@(?:(?!\d)(?:\w|\:)+|^"(?:[^"]|"")+")\[[^\]]+\]/,
  QUOTED_LITERAL: /^'(?:[^']|'')*'/,
  NUMERIC_LITERAL: /^[0-9]+(?:\.[0-9]*(?:[eE][-+][0-9]+)?)?/,
  SYMBOL: /^(?:==|=|<>|<=|<|>=|>|!~~|!~|~~|~|!==|!=|!~=|!~|!|&|\||\.|\:|,|\(|\)|\[|\]|\{|\}|\?|\:|;|@|\^|\/\+|\/|\*|\+|-)/
};
str = '"';

现在都是在 NAME's 正则表达式混乱之后。它可以制作 1 根大绳子。我认为后一个问题是String token太贪婪了。对于 regex 令牌来说,前一个可能过于智能。

编辑:我想我已经修复了regex令牌的正则表达式。在代码中,将第 146-153 行(整个“后续字符”部分)替换为以下表达式:

([^/]|(?<!\\)(?<=\\)/)*

这个想法是允许除 / 之外的所有内容,还允许 \/,但不允许 \\/

编辑:另一个有趣的案例,在修复后通过,但添加为内置测试案例可能会很有趣:

    case 'UNQUOTED_LITERAL': 
    case 'QUOTED_LITERAL': {
        this._js =  "e.str(\"" + this.value.replace(/\\/g, "\\\\").replace(/"/g, "\\\"") + "\")";
        break;
    }

编辑:又一个案例。它似乎对关键词也太贪婪了。查看案例:

var clazz = function() {
    if (clazz.__) return delete(clazz.__);
    this.constructor = clazz;
    if(constructor)
        constructor.apply(this, arguments);
};

它将其词法为:(keyword, const), (id, ructor)。对于标识符 inherits 也会发生同样的情况:inherits

Interestingly enough I tried your lexer on the code of my lexer/evaluator written in JS ;) You're right, it is not always doing well with regular expressions. Here some examples:

rexl.re = {
  NAME: /^(?!\d)(?:\w)+|^"(?:[^"]|"")+"/,
  UNQUOTED_LITERAL: /^@(?:(?!\d)(?:\w|\:)+|^"(?:[^"]|"")+")\[[^\]]+\]/,
  QUOTED_LITERAL: /^'(?:[^']|'')*'/,
  NUMERIC_LITERAL: /^[0-9]+(?:\.[0-9]*(?:[eE][-+][0-9]+)?)?/,
  SYMBOL: /^(?:==|=|<>|<=|<|>=|>|!~~|!~|~~|~|!==|!=|!~=|!~|!|&|\||\.|\:|,|\(|\)|\[|\]|\{|\}|\?|\:|;|@|\^|\/\+|\/|\*|\+|-)/
};

This one is mostly fine - only UNQUITED_LITERAL is not recognized, otherwise all is fine. But now let's make a minor addition to it:

rexl.re = {
  NAME: /^(?!\d)(?:\w)+|^"(?:[^"]|"")+"/,
  UNQUOTED_LITERAL: /^@(?:(?!\d)(?:\w|\:)+|^"(?:[^"]|"")+")\[[^\]]+\]/,
  QUOTED_LITERAL: /^'(?:[^']|'')*'/,
  NUMERIC_LITERAL: /^[0-9]+(?:\.[0-9]*(?:[eE][-+][0-9]+)?)?/,
  SYMBOL: /^(?:==|=|<>|<=|<|>=|>|!~~|!~|~~|~|!==|!=|!~=|!~|!|&|\||\.|\:|,|\(|\)|\[|\]|\{|\}|\?|\:|;|@|\^|\/\+|\/|\*|\+|-)/
};
str = '"';

Now all after the NAME's regexp messes up. It makes 1 big string. I think the latter problem is that String token is too greedy. The former one might be too smart regexp for the regex token.

Edit: I think I've fixed the regexp for the regex token. In your code replace lines 146-153 (the whole 'following characters' part) with the following expression:

([^/]|(?<!\\)(?<=\\)/)*

The idea is to allow everything except /, also allow \/, but not allow \\/.

Edit: Another interesting case, passes after the fix, but might be interesting to add as the built-in test case:

    case 'UNQUOTED_LITERAL': 
    case 'QUOTED_LITERAL': {
        this._js =  "e.str(\"" + this.value.replace(/\\/g, "\\\\").replace(/"/g, "\\\"") + "\")";
        break;
    }

Edit: Yet another case. It appears to be too greedy about keywords as well. See the case:

var clazz = function() {
    if (clazz.__) return delete(clazz.__);
    this.constructor = clazz;
    if(constructor)
        constructor.apply(this, arguments);
};

It lexes it as: (keyword, const), (id, ructor). The same happens for an identifier inherits: in and herits.

聽兲甴掵 2024-11-05 16:27:30

示例:下面第一次出现的 / 2 /i(对 a 的赋值)应标记为 DivNumericLiteralDiv标识符,因为它位于 InputElementDiv 上下文中。第二次出现(对 b 的赋值)应标记为 RegularExpressionLiteral,因为它位于 InputElementRegExp 上下文中。

i = 1;
var a = 1 / 2 /i;
console.info(a); // ⇒ 0.5
console.info(typeof a); // number

var b = 1 + / 2 /i;
console.info(b); // ⇒ 1/2/i
console.info(typeof b); // ⇒ string

来源:

词法语法有两个目标符号。 InputElementDiv 符号用在允许除法 (/) 或除法赋值 (/=) 运算符的语法上下文中。 InputElementRegExp 符号用于其他语法上下文。

请注意,句法语法中存在上下文,其中句法语法允许除法和正则表达式文字;但是,由于在这种情况下词法语法使用 InputElementDiv 目标符号,因此在这种情况下,开头斜杠不会被识别为开始正则表达式文字。作为一种解决方法,可以将正则表达式文字括在括号中。
标准 ECMA-262 第三版 - 1999 年 12 月,第 14 页11

Example: The first occurrence of / 2 /i below (the assignment to a) should tokenize as Div, NumericLiteral, Div, Identifier, because it is in a InputElementDiv context. The second occurrence (the assignment to b) should tokenize as RegularExpressionLiteral, because it is in a InputElementRegExp context.

i = 1;
var a = 1 / 2 /i;
console.info(a); // ⇒ 0.5
console.info(typeof a); // number

var b = 1 + / 2 /i;
console.info(b); // ⇒ 1/2/i
console.info(typeof b); // ⇒ string

Source:

There are two goal symbols for the lexical grammar. The InputElementDiv symbol is used in those syntactic grammar contexts where a division (/) or division-assignment (/=) operator is permitted. The InputElementRegExp symbol is used in other syntactic grammar contexts.

Note that contexts exist in the syntactic grammar where both a division and a RegularExpressionLiteral are permitted by the syntactic grammar; however, since the lexical grammar uses the InputElementDiv goal symbol in such cases, the opening slash is not recognised as starting a regular expression literal in such a context. As a workaround, one may enclose the regular expression literal in parentheses.
Standard ECMA-262 3rd Edition - December 1999, p. 11

傻比既视感 2024-11-05 16:27:30

处理这个棘手问题的简单解决方案非常酷,但我注意到它并不能完全处理 ES5 的 something.property 语法的更改,该更改允许在 后面使用保留字.。即,a.if = 'foo'; (function () {a.if /= 3;});,是 一些最近的实现。

除非我弄错了,否则 . 无论如何对于属性都只有一种用途,因此解决方案可能是在 .< /code> 只接受 identifierName 标记(这是 identifier 使用的,但它不拒绝保留字)可能会成功。 (显然 div 状态照常遵循。)

The simplicity of your solution for handling this hairy problem is very cool, but I noticed that it doesn't quite handle a change in something.property syntax for ES5, which allows reserved words following a .. I.e., a.if = 'foo'; (function () {a.if /= 3;});, is a valid statement in some recent implementations.

Unless I'm mistaken there is only one use of . anyway for properties, so the fix could be adding an additional state following the . which only accepts the identifierName token (which is what identifier uses, but it doesn't reject reserved words) would probably do the trick. (Obviously the div state follows that as per usual.)

小…红帽 2024-11-05 16:27:30

我一直在思考自己为 JavaScript 编写词法分析器的问题,我刚刚在寻找好的技术时遇到了你的实现。我发现了一个你的不起作用的情况,如果你仍然感兴趣,我想我会分享:

var g = 3, x = { valueOf: function() { return 6;} } /2/g;

斜杠都应该被解析为除法运算符,导致 x 被分配数值 1。你的词法分析器认为它是一个正则表达式。如果不维护一组分组上下文来区分的结尾(期望正则表达式)和函数语句的结尾,则无法正确处理这种情况的所有变体。 strong>(期望正则表达式)、函数表达式的结尾(期望除法)以及对象文字的结尾(期望除法)。

I've been thinking about the problems of writing a lexer for JavaScript myself, and I just came across your implementation in my search for good techniques. I found a case where yours doesn't work that I thought I'd share if you're still interested:

var g = 3, x = { valueOf: function() { return 6;} } /2/g;

The slashes should both be parsed as division operators, resulting in x being assigned the numeric value 1. Your lexer thinks that it is a regexp. There is no way to handle all variants of this case correctly without maintaining a stack of grouping contexts to distinguish among the end of a block (expect regexp), the end of a function statement (expect regexp), the end of a function expression (expect division), and the end of an object literal (expect division).

依 靠 2024-11-05 16:27:30

它对于这段代码是否可以正常工作(这不应该有分号;正确地词法分析时会产生错误)?

function square(num) {
    var result;
    var f = function (x) {
        return x * x;
    }
    (result = f(num));
    return result;
}

如果是这样,它对于依赖分号插入的代码是否可以正常工作?

function square(num) {
    var f = function (x) {
        return x * x;
    }
    return f(num);
}

Does it work properly for this code (this shouldn't have a semicolon; it produces an error when lexed properly)?

function square(num) {
    var result;
    var f = function (x) {
        return x * x;
    }
    (result = f(num));
    return result;
}

If it does, does it work properly for this code, that relies on semicolon insertion?

function square(num) {
    var f = function (x) {
        return x * x;
    }
    return f(num);
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文