为 Antlr3 语法添加带引号的字符串支持

发布于 2024-12-28 10:47:24 字数 1125 浏览 2 评论 0原文

我正在尝试实现一种用于解析查询的语法。单个查询由 items 组成,其中每个项目可以是 namename-ref

namemystring (只有字母,没有空格)或 "my long string" (字母和空格,始终用引号引起来)。 name-refname 非常相似,唯一的区别是它应该以 ref: 开头 (ref:mystring,<代码>参考:“我的长字符串”)。查询应至少包含 1 项(namename-ref)。

这就是我所拥有的:

NAME: ('a'..'z')+;
REF_TAG: 'ref:';
SP: ' '+;

name: NAME;
name_ref: REF_TAG name;
item: name | name_ref;
query: item (SP item)*;

此语法演示了我基本上需要获得的内容,唯一的功能是它不支持长引用字符串(它适用于没有空格的名称)。

SHORT_NAME: ('a'..'z')+;
LONG_NAME: SHORT_NAME (SP SHORT_NAME)*;
REF_TAG: 'ref:';
SP: ' '+;
Q: '"';

short_name: SHORT_NAME;
long_name: LONG_NAME;
name_ref: REF_TAG (short_name | (Q long_name Q));
item: (short_name | (Q long_name Q)) | name_ref;
query: item (SP item)*;

但这是行不通的。有什么想法有什么问题吗?也许,这很重要:我的第一个查询应该被视为 3 item(3 name)和“我的第一个查询” 是 1 item (1 long_name)。

I'm trying to implement a grammar for parsing queries. Single query consists of items where each item can be either name or name-ref.

name is either mystring (only letters, no spaces) or "my long string" (letters and spaces, always quoted). name-ref is very similar to name and the only difference is that it should start with ref: (ref:mystring, ref:"my long string"). Query should contain at least 1 item (name or name-ref).

Here's what I have:

NAME: ('a'..'z')+;
REF_TAG: 'ref:';
SP: ' '+;

name: NAME;
name_ref: REF_TAG name;
item: name | name_ref;
query: item (SP item)*;

This grammar demonstrates what I basically need to get and the only feature is that it doesn't support long quoted strings (it works fine for names that doesn't have spaces).

SHORT_NAME: ('a'..'z')+;
LONG_NAME: SHORT_NAME (SP SHORT_NAME)*;
REF_TAG: 'ref:';
SP: ' '+;
Q: '"';

short_name: SHORT_NAME;
long_name: LONG_NAME;
name_ref: REF_TAG (short_name | (Q long_name Q));
item: (short_name | (Q long_name Q)) | name_ref;
query: item (SP item)*;

But that doesn't work. Any ideas what's the problem? Probably, that's important: my first query should be treated as 3 items (3 names) and "my first query" is 1 item (1 long_name).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

好久不见√ 2025-01-04 10:47:26

ANTLR 的词法分析器贪婪地匹配:这就是为什么像我的第一个查询这样的输入被标记为LONG_NAME,而不是3个之间有空格的SHORT_NAME

只需删除 LONG_NAME 规则并在解析器规则 long_name 中定义它即可。

以下语法:

SHORT_NAME : ('a'..'z')+;
REF_TAG    : 'ref:';
SP         : ' '+;
Q          : '"';

short_name : SHORT_NAME;
long_name  : Q SHORT_NAME (SP SHORT_NAME)* Q;
name_ref   : REF_TAG (short_name | (Q long_name Q));
item       : short_name | long_name | name_ref;
query      : item (SP item)*;

将解析输入:

my first query "my first query" ref:mystring

如下:

在此处输入图像描述

但是,您可以 还在词法分析器中对带引号的名称进行标记,并使用一些自定义代码从中删除引号。从词法分析器中删除空格也是一种选择。像这样的东西:

SHORT_NAME : ('a'..'z')+;
LONG_NAME  : '"' ~'"'* '"' {setText(getText().substring(1, getText().length()-1));};
REF_TAG    : 'ref:';
SP         : ' '+ {skip();};

name_ref   : REF_TAG (SHORT_NAME | LONG_NAME);
item       : SHORT_NAME | LONG_NAME | name_ref;
query      : item+ EOF;

它将解析相同的输入,如下所示:

在此处输入图像描述

请注意,实际令牌 LONG_NAME 将被去除其开始和结束引号。

ANTLR's lexer matches greedily: that is why input like my first query is being tokenized as LONG_NAME instead of 3 SHORT_NAMEs with spaces in between.

Simply remove the LONG_NAME rule and define it in the parser rule long_name.

The following grammar:

SHORT_NAME : ('a'..'z')+;
REF_TAG    : 'ref:';
SP         : ' '+;
Q          : '"';

short_name : SHORT_NAME;
long_name  : Q SHORT_NAME (SP SHORT_NAME)* Q;
name_ref   : REF_TAG (short_name | (Q long_name Q));
item       : short_name | long_name | name_ref;
query      : item (SP item)*;

will parse the input:

my first query "my first query" ref:mystring

as follows:

enter image description here

However, you could also tokenize a quoted name in the lexer and strip the quotes from it with a bit of custom code. And removing spaces from the lexer could also be an option. Something like this:

SHORT_NAME : ('a'..'z')+;
LONG_NAME  : '"' ~'"'* '"' {setText(getText().substring(1, getText().length()-1));};
REF_TAG    : 'ref:';
SP         : ' '+ {skip();};

name_ref   : REF_TAG (SHORT_NAME | LONG_NAME);
item       : SHORT_NAME | LONG_NAME | name_ref;
query      : item+ EOF;

which would parse the same input as follows:

enter image description here

Note that the actual token LONG_NAME will be stripped of its start- and end-quote.

无力看清 2025-01-04 10:47:26

这是一个应该满足您的要求的语法:

  SP: ' '+;
  SHORT_NAME: ('a'..'z')+;
  LONG_NAME: '"' SHORT_NAME (SP SHORT_NAME)* '"';
  REF: 'ref:' (SHORT_NAME | LONG_NAME);

  item: SHORT_NAME | LONG_NAME  | REF;
  query: item (SP item)*;

如果您将其放在顶部:

  grammar Query;

  @members {
      public static void main(String[] args) throws Exception {
          QueryLexer lex = new QueryLexer(new ANTLRFileStream(args[0]));
           CommonTokenStream tokens = new CommonTokenStream(lex);

          QueryParser parser = new QueryParser(tokens);

          try {
              TokenSource ts = parser.getTokenStream().getTokenSource();
              Token tok = ts.nextToken();
              while (EOF != (tok.getType())) {
                 System.out.println("Got a token: " + tok);
                 tok = ts.nextToken();
              }
          } catch (Exception e) {
              e.printStackTrace();
           }
      }
  }

您应该看到词法分析器很好地分解了所有内容(我希望;-))

hi there "long name" ref:shortname ref:"long name"

应该给出:

Got a token: [@-1,0:1='hi',<6>,1:0]
Got a token: [@-1,2:2=' ',<7>,1:2]
Got a token: [@-1,3:7='there',<6>,1:3]
Got a token: [@-1,8:8=' ',<7>,1:8]
Got a token: [@-1,9:19='"long name"',<4>,1:9]
Got a token: [@-1,20:20=' ',<7>,1:20]
Got a token: [@-1,21:33='ref:shortname',<5>,1:21]
Got a token: [@-1,34:34=' ',<7>,1:34]
Got a token: [@-1,35:49='ref:"long name"',<5>,1:35]

我不是 100% 确定您的问题是什么语法,但我怀疑问题与您对不带引号的 LONG_NAME 的定义有关。或许你能明白其中的区别是什么?

Here's a grammar that should work for your requirements:

  SP: ' '+;
  SHORT_NAME: ('a'..'z')+;
  LONG_NAME: '"' SHORT_NAME (SP SHORT_NAME)* '"';
  REF: 'ref:' (SHORT_NAME | LONG_NAME);

  item: SHORT_NAME | LONG_NAME  | REF;
  query: item (SP item)*;

If you put this at the top:

  grammar Query;

  @members {
      public static void main(String[] args) throws Exception {
          QueryLexer lex = new QueryLexer(new ANTLRFileStream(args[0]));
           CommonTokenStream tokens = new CommonTokenStream(lex);

          QueryParser parser = new QueryParser(tokens);

          try {
              TokenSource ts = parser.getTokenStream().getTokenSource();
              Token tok = ts.nextToken();
              while (EOF != (tok.getType())) {
                 System.out.println("Got a token: " + tok);
                 tok = ts.nextToken();
              }
          } catch (Exception e) {
              e.printStackTrace();
           }
      }
  }

You should see the lexer break everything apart nicely (I hope ;-) )

hi there "long name" ref:shortname ref:"long name"

Should give:

Got a token: [@-1,0:1='hi',<6>,1:0]
Got a token: [@-1,2:2=' ',<7>,1:2]
Got a token: [@-1,3:7='there',<6>,1:3]
Got a token: [@-1,8:8=' ',<7>,1:8]
Got a token: [@-1,9:19='"long name"',<4>,1:9]
Got a token: [@-1,20:20=' ',<7>,1:20]
Got a token: [@-1,21:33='ref:shortname',<5>,1:21]
Got a token: [@-1,34:34=' ',<7>,1:34]
Got a token: [@-1,35:49='ref:"long name"',<5>,1:35]

I'm not 100% sure what the problem is with your grammar, but I suspect the issue relates to your definition of a LONG_NAME without the quotes. Perhaps you can see what the distinction is?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文