当前位置：文江博客话题详情

JsLex 错误地解释了哪些 Javascript 构造？

发布于 2024-10-29 16:27:30 字数 458 浏览 2 评论 0原文

JsLex 是我用 Python 编写的 Javascript 词法分析器。它可以很好地完成一天（左右）的工作，但我确信在某些情况下它会出错。特别是，它对分号插入一无所知，并且可能有一些对词法分析很重要的方法。我只是不知道它们是什么。

哪些 Javascript 代码 JsLex lex 不正确？我对有效的 Javascript 源特别感兴趣，其中 JsLex 错误地识别了正则表达式文字。

需要明确的是，“词法分析”是指识别源文件中的标记。 JsLex 不会尝试解析 Javascript，更不用说执行它了。我编写了 JsLex 来进行完整的词法分析，不过说实话，如果它能够成功找到所有正则表达式文字，我会很高兴。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

要走就滚别墨迹 2024-11-05 16:27:30

有趣的是，我在用 JS 编写的词法分析器/求值器的代码上尝试了您的词法分析器；）您是对的，它并不总是能很好地处理正则表达式。这里有一些例子：

rexl.re = {
  NAME: /^(?!\d)(?:\w)+|^"(?:[^"]|"")+"/,
  UNQUOTED_LITERAL: /^@(?:(?!\d)(?:\w|\:)+|^"(?:[^"]|"")+")\[[^\]]+\]/,
  QUOTED_LITERAL: /^'(?:[^']|'')*'/,
  NUMERIC_LITERAL: /^[0-9]+(?:\.[0-9]*(?:[eE][-+][0-9]+)?)?/,
  SYMBOL: /^(?:==|=|<>|<=|<|>=|>|!~~|!~|~~|~|!==|!=|!~=|!~|!|&|\||\.|\:|,|\(|\)|\[|\]|\{|\}|\?|\:|;|@|\^|\/\+|\/|\*|\+|-)/
};

这个大部分都很好 - 只有 UNQUITED_LITERAL 未被识别，否则一切都很好。但现在让我们对它做一个小的补充：

rexl.re = {
  NAME: /^(?!\d)(?:\w)+|^"(?:[^"]|"")+"/,
  UNQUOTED_LITERAL: /^@(?:(?!\d)(?:\w|\:)+|^"(?:[^"]|"")+")\[[^\]]+\]/,
  QUOTED_LITERAL: /^'(?:[^']|'')*'/,
  NUMERIC_LITERAL: /^[0-9]+(?:\.[0-9]*(?:[eE][-+][0-9]+)?)?/,
  SYMBOL: /^(?:==|=|<>|<=|<|>=|>|!~~|!~|~~|~|!==|!=|!~=|!~|!|&|\||\.|\:|,|\(|\)|\[|\]|\{|\}|\?|\:|;|@|\^|\/\+|\/|\*|\+|-)/
};
str = '"';

现在都是在 NAME's 正则表达式混乱之后。它可以制作 1 根大绳子。我认为后一个问题是String token太贪婪了。对于 regex 令牌来说，前一个可能过于智能。

编辑：我想我已经修复了regex令牌的正则表达式。在代码中，将第 146-153 行（整个“后续字符”部分）替换为以下表达式：

([^/]|(?<!\\)(?<=\\)/)*

这个想法是允许除 / 之外的所有内容，还允许 \/，但不允许 \\/。

编辑：另一个有趣的案例，在修复后通过，但添加为内置测试案例可能会很有趣：

    case 'UNQUOTED_LITERAL': 
    case 'QUOTED_LITERAL': {
        this._js =  "e.str(\"" + this.value.replace(/\\/g, "\\\\").replace(/"/g, "\\\"") + "\")";
        break;
    }

编辑：又一个案例。它似乎对关键词也太贪婪了。查看案例：

var clazz = function() {
    if (clazz.__) return delete(clazz.__);
    this.constructor = clazz;
    if(constructor)
        constructor.apply(this, arguments);
};

它将其词法为：(keyword, const), (id, ructor)。对于标识符 inherits 也会发生同样的情况：in 和 herits。

Interestingly enough I tried your lexer on the code of my lexer/evaluator written in JS ;) You're right, it is not always doing well with regular expressions. Here some examples:

rexl.re = {
  NAME: /^(?!\d)(?:\w)+|^"(?:[^"]|"")+"/,
  UNQUOTED_LITERAL: /^@(?:(?!\d)(?:\w|\:)+|^"(?:[^"]|"")+")\[[^\]]+\]/,
  QUOTED_LITERAL: /^'(?:[^']|'')*'/,
  NUMERIC_LITERAL: /^[0-9]+(?:\.[0-9]*(?:[eE][-+][0-9]+)?)?/,
  SYMBOL: /^(?:==|=|<>|<=|<|>=|>|!~~|!~|~~|~|!==|!=|!~=|!~|!|&|\||\.|\:|,|\(|\)|\[|\]|\{|\}|\?|\:|;|@|\^|\/\+|\/|\*|\+|-)/
};

This one is mostly fine - only UNQUITED_LITERAL is not recognized, otherwise all is fine. But now let's make a minor addition to it:

rexl.re = {
  NAME: /^(?!\d)(?:\w)+|^"(?:[^"]|"")+"/,
  UNQUOTED_LITERAL: /^@(?:(?!\d)(?:\w|\:)+|^"(?:[^"]|"")+")\[[^\]]+\]/,
  QUOTED_LITERAL: /^'(?:[^']|'')*'/,
  NUMERIC_LITERAL: /^[0-9]+(?:\.[0-9]*(?:[eE][-+][0-9]+)?)?/,
  SYMBOL: /^(?:==|=|<>|<=|<|>=|>|!~~|!~|~~|~|!==|!=|!~=|!~|!|&|\||\.|\:|,|\(|\)|\[|\]|\{|\}|\?|\:|;|@|\^|\/\+|\/|\*|\+|-)/
};
str = '"';

Now all after the NAME's regexp messes up. It makes 1 big string. I think the latter problem is that String token is too greedy. The former one might be too smart regexp for the regex token.

Edit: I think I've fixed the regexp for the regex token. In your code replace lines 146-153 (the whole 'following characters' part) with the following expression:

([^/]|(?<!\\)(?<=\\)/)*

The idea is to allow everything except /, also allow \/, but not allow \\/.

Edit: Another interesting case, passes after the fix, but might be interesting to add as the built-in test case:

    case 'UNQUOTED_LITERAL': 
    case 'QUOTED_LITERAL': {
        this._js =  "e.str(\"" + this.value.replace(/\\/g, "\\\\").replace(/"/g, "\\\"") + "\")";
        break;
    }

Edit: Yet another case. It appears to be too greedy about keywords as well. See the case:

var clazz = function() {
    if (clazz.__) return delete(clazz.__);
    this.constructor = clazz;
    if(constructor)
        constructor.apply(this, arguments);
};

It lexes it as: (keyword, const), (id, ructor). The same happens for an identifier inherits: in and herits.

回复收藏 0 原文

聽兲甴掵 2024-11-05 16:27:30

示例：下面第一次出现的 / 2 /i（对 a 的赋值）应标记为 Div、NumericLiteral、Div、标识符，因为它位于 InputElementDiv 上下文中。第二次出现（对 b 的赋值）应标记为 RegularExpressionLiteral，因为它位于 InputElementRegExp 上下文中。

i = 1;
var a = 1 / 2 /i;
console.info(a); // ⇒ 0.5
console.info(typeof a); // number

var b = 1 + / 2 /i;
console.info(b); // ⇒ 1/2/i
console.info(typeof b); // ⇒ string

来源：

词法语法有两个目标符号。 InputElementDiv 符号用在允许除法 (/) 或除法赋值 (/=) 运算符的语法上下文中。 InputElementRegExp 符号用于其他语法上下文。
请注意，句法语法中存在上下文，其中句法语法允许除法和正则表达式文字；但是，由于在这种情况下词法语法使用 InputElementDiv 目标符号，因此在这种情况下，开头斜杠不会被识别为开始正则表达式文字。作为一种解决方法，可以将正则表达式文字括在括号中。
— 标准 ECMA-262 第三版 - 1999 年 12 月，第 14 页11

Example: The first occurrence of / 2 /i below (the assignment to a) should tokenize as Div, NumericLiteral, Div, Identifier, because it is in a InputElementDiv context. The second occurrence (the assignment to b) should tokenize as RegularExpressionLiteral, because it is in a InputElementRegExp context.

i = 1;
var a = 1 / 2 /i;
console.info(a); // ⇒ 0.5
console.info(typeof a); // number

var b = 1 + / 2 /i;
console.info(b); // ⇒ 1/2/i
console.info(typeof b); // ⇒ string

Source:

There are two goal symbols for the lexical grammar. The InputElementDiv symbol is used in those syntactic grammar contexts where a division (/) or division-assignment (/=) operator is permitted. The InputElementRegExp symbol is used in other syntactic grammar contexts.
Note that contexts exist in the syntactic grammar where both a division and a RegularExpressionLiteral are permitted by the syntactic grammar; however, since the lexical grammar uses the InputElementDiv goal symbol in such cases, the opening slash is not recognised as starting a regular expression literal in such a context. As a workaround, one may enclose the regular expression literal in parentheses.
— Standard ECMA-262 3rd Edition - December 1999, p. 11

回复收藏 0 原文

傻比既视感 2024-11-05 16:27:30

处理这个棘手问题的简单解决方案非常酷，但我注意到它并不能完全处理 ES5 的 something.property 语法的更改，该更改允许在 后面使用保留字.。即，a.if = 'foo'; (function () {a.if /= 3;});，是一些最近的实现。

除非我弄错了，否则 . 无论如何对于属性都只有一种用途，因此解决方案可能是在 .< /code> 只接受 identifierName 标记（这是 identifier 使用的，但它不拒绝保留字）可能会成功。（显然 div 状态照常遵循。）

回复收藏 0 原文

小…红帽 2024-11-05 16:27:30

我一直在思考自己为 JavaScript 编写词法分析器的问题，我刚刚在寻找好的技术时遇到了你的实现。我发现了一个你的不起作用的情况，如果你仍然感兴趣，我想我会分享：

var g = 3, x = { valueOf: function() { return 6;} } /2/g;

斜杠都应该被解析为除法运算符，导致 x 被分配数值 1。你的词法分析器认为它是一个正则表达式。如果不维护一组分组上下文来区分块的结尾（期望正则表达式）和函数语句的结尾，则无法正确处理这种情况的所有变体。 strong>（期望正则表达式）、函数表达式的结尾（期望除法）以及对象文字的结尾（期望除法）。

I've been thinking about the problems of writing a lexer for JavaScript myself, and I just came across your implementation in my search for good techniques. I found a case where yours doesn't work that I thought I'd share if you're still interested:

var g = 3, x = { valueOf: function() { return 6;} } /2/g;

The slashes should both be parsed as division operators, resulting in x being assigned the numeric value 1. Your lexer thinks that it is a regexp. There is no way to handle all variants of this case correctly without maintaining a stack of grouping contexts to distinguish among the end of a block (expect regexp), the end of a function statement (expect regexp), the end of a function expression (expect division), and the end of an object literal (expect division).

回复收藏 0 原文

依靠 2024-11-05 16:27:30

它对于这段代码是否可以正常工作（这不应该有分号；正确地词法分析时会产生错误）？

function square(num) {
    var result;
    var f = function (x) {
        return x * x;
    }
    (result = f(num));
    return result;
}

如果是这样，它对于依赖分号插入的代码是否可以正常工作？

function square(num) {
    var f = function (x) {
        return x * x;
    }
    return f(num);
}

Does it work properly for this code (this shouldn't have a semicolon; it produces an error when lexed properly)?

function square(num) {
    var result;
    var f = function (x) {
        return x * x;
    }
    (result = f(num));
    return result;
}

If it does, does it work properly for this code, that relies on semicolon insertion?

function square(num) {
    var f = function (x) {
        return x * x;
    }
    return f(num);
}

回复收藏 0 原文

~没有更多了~

关于作者

各空

暂无简介

0 文章

0 评论

24 人气

关注发私信

友情链接

文江博客

JsLex 错误地解释了哪些 Javascript 构造？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

胡图图

zt006

z祗昰~

冰葑

野の

天空

友情链接

JsLex 错误地解释了哪些 Javascript 构造？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

胡图图

zt006

z祗昰~

冰葑

野の

天空

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。