处理对上下文敏感的重叠词法分析器模式的最佳方法是什么?
我正在尝试编写一个 Antlr 语法来解析 C4 DSL 。然而,DSL 有很多语法非常开放的地方,导致词法分析器规则重叠(在多个标记规则匹配的意义上)。
例如,工作区规则可以具有子属性定义
对。这是一个有效的文件:
workspace "Name" "Description" {
properties {
xyz "a string property"
nonstring nodoublequotes
}
}
我遇到的问题是
和
的规则必须非常广泛,基本上是任何内容除了空白。此外,带有双引号空格的属性将与我的 STRING
标记匹配。
我当前的解决方案是下面的语法,使用property_element: BLOB | STRING; 匹配值,BLOB
匹配名称。这里有更好的方法吗?如果我可以制作上下文敏感的词法分析器标记,我会制作 NAME
和 VALUE
标记。在实际语法中,我定义了不区分大小写的名称标记,例如 workspace
和 properties
。这使我能够轻松匹配现有的 DSL 语义,但会带来问题,即 workspace
的属性名称或值将标记为 K_WORKSPACE
。
grammar c4mce;
workspace : 'workspace' (STRING (STRING)?)? '{' NL workspace_body '}';
workspace_body : (workspace_element NL)* ;
workspace_element: 'properties' '{' NL (property_element NL)* '}';
property_element: BLOB property_value;
property_value : BLOB | STRING;
BLOB: [\p{Alpha}]+;
STRING: '"' (~('\n' | '\r' | '"' | '\\') | '\\\\' | '\\"')* '"';
NL: '\r'? '\n';
WS: [ \t]+ -> skip;
这表示
[@0,0:8='workspace',<'workspace'>,1:0]
[@1,10:15='"Name"',<STRING>,1:10]
[@2,17:29='"Description"',<STRING>,1:17]
[@3,31:31='{',<'{'>,1:31]
[@4,32:32='\n',<NL>,1:32]
[@5,37:46='properties',<'properties'>,2:4]
[@6,48:48='{',<'{'>,2:15]
[@7,49:49='\n',<NL>,2:16]
[@8,58:60='xyz',<BLOB>,3:8]
[@9,62:80='"a string property"',<STRING>,3:12]
[@10,81:81='\n',<NL>,3:31]
[@11,90:98='nonstring',<BLOB>,4:8]
[@12,100:113='nodoublequotes',<BLOB>,4:18]
[@13,114:114='\n',<NL>,4:32]
[@14,119:119='}',<'}'>,5:4]
[@15,120:120='\n',<NL>,5:5]
[@16,121:121='}',<'}'>,6:0]
[@17,122:122='\n',<NL>,6:1]
[@18,123:122='<EOF>',<EOF>,7:0]
一切都很好,我想这就是 DSL 语法给我的那么多。有没有更好的方法来处理这样的情况? 当我扩展语法时,我期望有很多 BLOB 标记,因为在词法分析器中创建较窄的标记是毫无意义的,因为 BLOB 会匹配。
I'm attempting to write an Antlr grammar for parsing the C4 DSL. However, the DSL has a number of places where the grammar is very open ended, resulting in overlapping lexer rules (in the sense that multiple token rules match).
For example, the workspace rule can have a child properties element defining <name> <value>
pairs. This is a valid file:
workspace "Name" "Description" {
properties {
xyz "a string property"
nonstring nodoublequotes
}
}
The issue I'm running into is that the rules for the <name>
and <value>
have to be very broad, basically anything except whitespace. Also, properties with spaces with double quotes will match my STRING
token.
My current solution is the grammar below, using property_element: BLOB | STRING;
to match values and BLOB
to match names. Is there a better way here? If I could make context sensitive lexer tokens I would make NAME
and VALUE
tokens instead. In the actual grammar I define case insensitive name tokens for thinks like workspace
and properties
. This allows me to easily match the existing DSL semantics, but raises the wrinkle that a property name or value of workspace
will tokenize to K_WORKSPACE
.
grammar c4mce;
workspace : 'workspace' (STRING (STRING)?)? '{' NL workspace_body '}';
workspace_body : (workspace_element NL)* ;
workspace_element: 'properties' '{' NL (property_element NL)* '}';
property_element: BLOB property_value;
property_value : BLOB | STRING;
BLOB: [\p{Alpha}]+;
STRING: '"' (~('\n' | '\r' | '"' | '\\') | '\\\\' | '\\"')* '"';
NL: '\r'? '\n';
WS: [ \t]+ -> skip;
This tokenizes to
[@0,0:8='workspace',<'workspace'>,1:0]
[@1,10:15='"Name"',<STRING>,1:10]
[@2,17:29='"Description"',<STRING>,1:17]
[@3,31:31='{',<'{'>,1:31]
[@4,32:32='\n',<NL>,1:32]
[@5,37:46='properties',<'properties'>,2:4]
[@6,48:48='{',<'{'>,2:15]
[@7,49:49='\n',<NL>,2:16]
[@8,58:60='xyz',<BLOB>,3:8]
[@9,62:80='"a string property"',<STRING>,3:12]
[@10,81:81='\n',<NL>,3:31]
[@11,90:98='nonstring',<BLOB>,4:8]
[@12,100:113='nodoublequotes',<BLOB>,4:18]
[@13,114:114='\n',<NL>,4:32]
[@14,119:119='}',<'}'>,5:4]
[@15,120:120='\n',<NL>,5:5]
[@16,121:121='}',<'}'>,6:0]
[@17,122:122='\n',<NL>,6:1]
[@18,123:122='<EOF>',<EOF>,7:0]
This is all fine, and I suppose it's as much as the DSL grammar gives me. Is there a better way to handle situations like this?
As I expand the grammar I expect to have a lot of BLOB
tokens simply because creating a narrower token in the lexer would be pointless because BLOB
would match instead.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这是典型的关键字作为标识符问题。如果您希望作为关键字词法分析的特定字符组合也可以在某些地方用作普通标识符,那么您必须将此关键字列为可能的替代方案。例如:
This is the classic keywords-as-identifier problem. If you want that a specific char combination, which is lexed as keyword, can also be used as a normal identifier in certain places, then you have to list this keyword as possible alternative. For example: