如何使用 Ragel 正确扫描标识符

发布于 2024-10-20 16:43:52 字数 1393 浏览 10 评论 0原文

我正在尝试为我出于个人原因设计的类似 C/C++/C#/Java/D 的编程语言编写一个扫描器。对于此任务,我使用 Ragel 来生成我的扫描仪。我无法准确理解许多运算符何时触发操作,可能是因为我的学术重点是实践知识而不是理论,并且大量这种非确定性/确定性有限自动机业务超出了我的理解范围。我发现要么缺乏文档,要么缺乏我对它的理解。我假设是后者。

无论如何,我正在从基础开始努力。我在第一次迭代中确定了几个关键字和特殊字符。现在我遇到了所有关键字都被扫描为标识符的问题。我对所有关键字使用扫描仪运算符,因为这解决了字符串 returns 被扫描为 returnreturns 的问题> 关键字。

如何正确扫描标识符?我知道,为了使这一确定性,我需要有效地指定词素只能是一个标识符(如果它与其他标记的模式不匹配)。原谅我的知识匮乏。

拉格尔脚本:

%%{
    Identifier = (alpha | '_') . (alnum | '_')*;
    action IdentifierAction
    {
        std::cout << "identifier(\"";
        std::cout.write(ts, te - ts);
        std::cout << "\")";
    }
}%%

%%{
    main :=
    |*
        Interface => InterfaceAction;
        Class => ClassAction;
        Property => PropertyAction;
        Function => FunctionAction;
        TypeQualifier => TypeQualifierAction;
        OpenParenthesis => OpenParenthesisAction;
        CloseParenthesis => CloseParenthesisAction;
        OpenBracket => OpenBracketAction;
        CloseBracket => CloseBracketAction;
        OpenBrace => OpenBraceAction;
        CloseBrace => CloseBraceAction;
        Semicolon => SemicolonAction;
        Returns => ReturnsAction;
        Return => ReturnAction;
        Identifier => IdentifierAction;
        space+;
    *|;
}%%

I'm trying to write a scanner for my C/C++/C#/Java/D-like programming language that I'm designing for personal reasons. For this task I'm using Ragel to generate my scanner. I'm having trouble understanding exactly when a lot of the operators trigger actions, probably because my academics were focused on practical knowledge rather than theory and great deal of this non-deterministic/deterministic finite automata business goes right over my head. I find the documentation to either be lacking or my understanding of it to be so. I'm assuming the latter.

In any case, I'm working my way up from the basics. I've identified several keywords and special characters in my first iteration. Now I've run into the issue where all keywords are being scanned as identifiers. I'm using the scanner operator for all of my keywords, as that resolved my issue of the string returns being scanned as both the return and returns keyword.

How can I properly scan for identifiers? I understand that to make this deterministic, I need to effectively specify that a lexeme can only be an identifier if it matches no other token's pattern. Forgive my lack of knowledge.

Ragel Script:

%%{
    Identifier = (alpha | '_') . (alnum | '_')*;
    action IdentifierAction
    {
        std::cout << "identifier(\"";
        std::cout.write(ts, te - ts);
        std::cout << "\")";
    }
}%%

%%{
    main :=
    |*
        Interface => InterfaceAction;
        Class => ClassAction;
        Property => PropertyAction;
        Function => FunctionAction;
        TypeQualifier => TypeQualifierAction;
        OpenParenthesis => OpenParenthesisAction;
        CloseParenthesis => CloseParenthesisAction;
        OpenBracket => OpenBracketAction;
        CloseBracket => CloseBracketAction;
        OpenBrace => OpenBraceAction;
        CloseBrace => CloseBraceAction;
        Semicolon => SemicolonAction;
        Returns => ReturnsAction;
        Return => ReturnAction;
        Identifier => IdentifierAction;
        space+;
    *|;
}%%

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

终遇你 2024-10-27 16:43:52

不熟悉 Ragel,但是,已经完成了一些自定义解析器和扫描仪。

您的问题似乎更多地与检测关键字相关,而不是检测通用标识符。

您有规则告诉 Ragel 检测代码何时为数字、“return”关键字、分号、“returns”关键字、标识符等。不过,可以为每个关键字制定规则,但我不推荐。

我从经验中学到的是,最好阅读所有关键字显式地作为标识符(分配通用“标识符”标记),并在 C/C++ 代码的某些部分中检测哪些标识符是“关键词”。

换句话说。 Ragel 将仅检测标识符。 “myvar”、“return”和“returns”都将被标记为“标识符”。稍后,在语义操作的代码中(C/C++ 而不是 Ragel),您将检查每个标识符,并检测是否是 C/C++ 中的关键字。这通常是通过关键字列表来完成的。

我认为它会是这样的:

%%{
Identifier = (alpha | '_') . (alnum | '_')*;
action IdentifierAction
{
    String Keywords[] = 
    (
       "return",
       "if",
       "else"
    ); 

    String MyIdentifier = te - ts;
    if (SearchKeywordCode(Keywords, MyIdentifier)) {
      std::cout << "keyword(\"";
      std::cout.write(ts, te - ts);
      std::cout << "\")";
    }
    else {
      std::cout << "identifier(\"";
      std::cout.write(ts, te - ts);
      std::cout << "\")";
    }
}
}%%

所以,没有“返回”或“返回”规则,只有“标识符”。

Not familiar with Ragel, but, have done some custom parsers & scanners.

Your question seems to relate more to detect keywords, than detect generic identifiers.

You have rules telling Ragel to detect when a section the code is a number, the "return" keyword, a semicolon, the "returns" keyword, an identifier, and so on. Altought, it's possible to make a rule for each keyword, I won't recommended.

What I have learn by experience, is that is better to read all keywords explicity as identifiers (assign a general "identifier" token ), and in some part of your C/C++ code, detect which identifiers are "keywords".

In other words. Ragel will detect only identifiers. "myvar", "return" and "returns", will all be marked as "identifiers". Later, in the code of your semantic action (C/C++ not Ragel), you will check each identifier, and detect if is a keyword in C/C++. This is usually done, by having a list of keywords.

I think It will be something like these:

%%{
Identifier = (alpha | '_') . (alnum | '_')*;
action IdentifierAction
{
    String Keywords[] = 
    (
       "return",
       "if",
       "else"
    ); 

    String MyIdentifier = te - ts;
    if (SearchKeywordCode(Keywords, MyIdentifier)) {
      std::cout << "keyword(\"";
      std::cout.write(ts, te - ts);
      std::cout << "\")";
    }
    else {
      std::cout << "identifier(\"";
      std::cout.write(ts, te - ts);
      std::cout << "\")";
    }
}
}%%

So, there not be a "Return" or "Returns" rule, just "Identifier".

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文