如何使用 Ragel 正确扫描标识符
我正在尝试为我出于个人原因设计的类似 C/C++/C#/Java/D 的编程语言编写一个扫描器。对于此任务,我使用 Ragel 来生成我的扫描仪。我无法准确理解许多运算符何时触发操作,可能是因为我的学术重点是实践知识而不是理论,并且大量这种非确定性/确定性有限自动机业务超出了我的理解范围。我发现要么缺乏文档,要么缺乏我对它的理解。我假设是后者。
无论如何,我正在从基础开始努力。我在第一次迭代中确定了几个关键字和特殊字符。现在我遇到了所有关键字都被扫描为标识符的问题。我对所有关键字使用扫描仪运算符,因为这解决了字符串 returns
被扫描为 return
和 returns
的问题> 关键字。
如何正确扫描标识符?我知道,为了使这一确定性,我需要有效地指定词素只能是一个标识符
(如果它与其他标记的模式不匹配)。原谅我的知识匮乏。
拉格尔脚本:
%%{
Identifier = (alpha | '_') . (alnum | '_')*;
action IdentifierAction
{
std::cout << "identifier(\"";
std::cout.write(ts, te - ts);
std::cout << "\")";
}
}%%
%%{
main :=
|*
Interface => InterfaceAction;
Class => ClassAction;
Property => PropertyAction;
Function => FunctionAction;
TypeQualifier => TypeQualifierAction;
OpenParenthesis => OpenParenthesisAction;
CloseParenthesis => CloseParenthesisAction;
OpenBracket => OpenBracketAction;
CloseBracket => CloseBracketAction;
OpenBrace => OpenBraceAction;
CloseBrace => CloseBraceAction;
Semicolon => SemicolonAction;
Returns => ReturnsAction;
Return => ReturnAction;
Identifier => IdentifierAction;
space+;
*|;
}%%
I'm trying to write a scanner for my C/C++/C#/Java/D-like programming language that I'm designing for personal reasons. For this task I'm using Ragel to generate my scanner. I'm having trouble understanding exactly when a lot of the operators trigger actions, probably because my academics were focused on practical knowledge rather than theory and great deal of this non-deterministic/deterministic finite automata business goes right over my head. I find the documentation to either be lacking or my understanding of it to be so. I'm assuming the latter.
In any case, I'm working my way up from the basics. I've identified several keywords and special characters in my first iteration. Now I've run into the issue where all keywords are being scanned as identifiers. I'm using the scanner operator for all of my keywords, as that resolved my issue of the string returns
being scanned as both the return
and returns
keyword.
How can I properly scan for identifiers? I understand that to make this deterministic, I need to effectively specify that a lexeme can only be an identifier
if it matches no other token's pattern. Forgive my lack of knowledge.
Ragel Script:
%%{
Identifier = (alpha | '_') . (alnum | '_')*;
action IdentifierAction
{
std::cout << "identifier(\"";
std::cout.write(ts, te - ts);
std::cout << "\")";
}
}%%
%%{
main :=
|*
Interface => InterfaceAction;
Class => ClassAction;
Property => PropertyAction;
Function => FunctionAction;
TypeQualifier => TypeQualifierAction;
OpenParenthesis => OpenParenthesisAction;
CloseParenthesis => CloseParenthesisAction;
OpenBracket => OpenBracketAction;
CloseBracket => CloseBracketAction;
OpenBrace => OpenBraceAction;
CloseBrace => CloseBraceAction;
Semicolon => SemicolonAction;
Returns => ReturnsAction;
Return => ReturnAction;
Identifier => IdentifierAction;
space+;
*|;
}%%
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
不熟悉 Ragel,但是,已经完成了一些自定义解析器和扫描仪。
您的问题似乎更多地与检测关键字相关,而不是检测通用标识符。
您有规则告诉 Ragel 检测代码何时为数字、“return”关键字、分号、“returns”关键字、标识符等。不过,可以为每个关键字制定规则,但我不推荐。
我从经验中学到的是,最好阅读所有关键字显式地作为标识符(分配通用“标识符”标记),并在 C/C++ 代码的某些部分中检测哪些标识符是“关键词”。
换句话说。 Ragel 将仅检测标识符。 “myvar”、“return”和“returns”都将被标记为“标识符”。稍后,在语义操作的代码中(C/C++ 而不是 Ragel),您将检查每个标识符,并检测是否是 C/C++ 中的关键字。这通常是通过关键字列表来完成的。
我认为它会是这样的:
所以,没有“返回”或“返回”规则,只有“标识符”。
Not familiar with Ragel, but, have done some custom parsers & scanners.
Your question seems to relate more to detect keywords, than detect generic identifiers.
You have rules telling Ragel to detect when a section the code is a number, the "return" keyword, a semicolon, the "returns" keyword, an identifier, and so on. Altought, it's possible to make a rule for each keyword, I won't recommended.
What I have learn by experience, is that is better to read all keywords explicity as identifiers (assign a general "identifier" token ), and in some part of your C/C++ code, detect which identifiers are "keywords".
In other words. Ragel will detect only identifiers. "myvar", "return" and "returns", will all be marked as "identifiers". Later, in the code of your semantic action (C/C++ not Ragel), you will check each identifier, and detect if is a keyword in C/C++. This is usually done, by having a list of keywords.
I think It will be something like these:
So, there not be a "Return" or "Returns" rule, just "Identifier".