处理对上下文敏感的重叠词法分析器模式的最佳方法是什么?

发布于 2025-01-10 02:22:52 字数 2440 浏览 0 评论 0原文

我正在尝试编写一个 Antlr 语法来解析 C4 DSL 。然而,DSL 有很多语法非常开放的地方,导致词法分析器规则重叠(在多个标记规则匹配的意义上)。

例如,工作区规则可以具有子属性定义 的元素<值> 对。这是一个有效的文件:

workspace "Name" "Description" {
    properties {
        xyz "a string property"
        nonstring nodoublequotes
    }
}

我遇到的问题是 的规则必须非常广泛,基本上是任何内容除了空白。此外,带有双引号空格的属性将与我的 STRING 标记匹配。

我当前的解决方案是下面的语法,使用property_element: BLOB | STRING; 匹配值,BLOB 匹配名称。这里有更好的方法吗?如果我可以制作上下文敏感的词法分析器标记,我会制作 NAMEVALUE 标记。在实际语法中,我定义了不区分大小写的名称标记,例如 workspaceproperties。这使我能够轻松匹配现有的 DSL 语义,但会带来问题,即 workspace 的属性名称或值将标记为 K_WORKSPACE

grammar c4mce;

workspace : 'workspace' (STRING (STRING)?)?  '{' NL workspace_body '}';

workspace_body : (workspace_element NL)* ;
workspace_element: 'properties' '{' NL (property_element NL)* '}';

property_element: BLOB property_value;
property_value : BLOB | STRING;

BLOB: [\p{Alpha}]+;
STRING: '"' (~('\n' | '\r' | '"' | '\\') | '\\\\' | '\\"')* '"';
NL: '\r'? '\n';
WS: [ \t]+ -> skip;

这表示

[@0,0:8='workspace',<'workspace'>,1:0]
[@1,10:15='"Name"',<STRING>,1:10]
[@2,17:29='"Description"',<STRING>,1:17]
[@3,31:31='{',<'{'>,1:31]
[@4,32:32='\n',<NL>,1:32]
[@5,37:46='properties',<'properties'>,2:4]
[@6,48:48='{',<'{'>,2:15]
[@7,49:49='\n',<NL>,2:16]
[@8,58:60='xyz',<BLOB>,3:8]
[@9,62:80='"a string property"',<STRING>,3:12]
[@10,81:81='\n',<NL>,3:31]
[@11,90:98='nonstring',<BLOB>,4:8]
[@12,100:113='nodoublequotes',<BLOB>,4:18]
[@13,114:114='\n',<NL>,4:32]
[@14,119:119='}',<'}'>,5:4]
[@15,120:120='\n',<NL>,5:5]
[@16,121:121='}',<'}'>,6:0]
[@17,122:122='\n',<NL>,6:1]
[@18,123:122='<EOF>',<EOF>,7:0]

一切都很好,我想这就是 DSL 语法给我的那么多。有没有更好的方法来处理这样的情况? 当我扩展语法时,我期望有很多 BLOB 标记,因为在词法分析器中创建较窄的标记是毫无意义的,因为 BLOB 会匹配。

I'm attempting to write an Antlr grammar for parsing the C4 DSL. However, the DSL has a number of places where the grammar is very open ended, resulting in overlapping lexer rules (in the sense that multiple token rules match).

For example, the workspace rule can have a child properties element defining <name> <value> pairs. This is a valid file:

workspace "Name" "Description" {
    properties {
        xyz "a string property"
        nonstring nodoublequotes
    }
}

The issue I'm running into is that the rules for the <name> and <value> have to be very broad, basically anything except whitespace. Also, properties with spaces with double quotes will match my STRING token.

My current solution is the grammar below, using property_element: BLOB | STRING; to match values and BLOB to match names. Is there a better way here? If I could make context sensitive lexer tokens I would make NAME and VALUE tokens instead. In the actual grammar I define case insensitive name tokens for thinks like workspace and properties. This allows me to easily match the existing DSL semantics, but raises the wrinkle that a property name or value of workspace will tokenize to K_WORKSPACE.

grammar c4mce;

workspace : 'workspace' (STRING (STRING)?)?  '{' NL workspace_body '}';

workspace_body : (workspace_element NL)* ;
workspace_element: 'properties' '{' NL (property_element NL)* '}';

property_element: BLOB property_value;
property_value : BLOB | STRING;

BLOB: [\p{Alpha}]+;
STRING: '"' (~('\n' | '\r' | '"' | '\\') | '\\\\' | '\\"')* '"';
NL: '\r'? '\n';
WS: [ \t]+ -> skip;

This tokenizes to

[@0,0:8='workspace',<'workspace'>,1:0]
[@1,10:15='"Name"',<STRING>,1:10]
[@2,17:29='"Description"',<STRING>,1:17]
[@3,31:31='{',<'{'>,1:31]
[@4,32:32='\n',<NL>,1:32]
[@5,37:46='properties',<'properties'>,2:4]
[@6,48:48='{',<'{'>,2:15]
[@7,49:49='\n',<NL>,2:16]
[@8,58:60='xyz',<BLOB>,3:8]
[@9,62:80='"a string property"',<STRING>,3:12]
[@10,81:81='\n',<NL>,3:31]
[@11,90:98='nonstring',<BLOB>,4:8]
[@12,100:113='nodoublequotes',<BLOB>,4:18]
[@13,114:114='\n',<NL>,4:32]
[@14,119:119='}',<'}'>,5:4]
[@15,120:120='\n',<NL>,5:5]
[@16,121:121='}',<'}'>,6:0]
[@17,122:122='\n',<NL>,6:1]
[@18,123:122='<EOF>',<EOF>,7:0]

This is all fine, and I suppose it's as much as the DSL grammar gives me. Is there a better way to handle situations like this?
As I expand the grammar I expect to have a lot of BLOB tokens simply because creating a narrower token in the lexer would be pointless because BLOB would match instead.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

゛时过境迁 2025-01-17 02:22:52

这是典型的关键字作为标识符问题。如果您希望作为关键字词法分析的特定字符组合也可以在某些地方用作普通标识符,那么您必须将此关键字列为可能的替代方案。例如:

property_element: (BLOB | K_WORKSPACE) property_value;
property_value : BLOB | STRING | K_WORKSPACE;

This is the classic keywords-as-identifier problem. If you want that a specific char combination, which is lexed as keyword, can also be used as a normal identifier in certain places, then you have to list this keyword as possible alternative. For example:

property_element: (BLOB | K_WORKSPACE) property_value;
property_value : BLOB | STRING | K_WORKSPACE;
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文