使用antlr解析|分离的文件

发布于 2024-08-19 15:07:55 字数 605 浏览 5 评论 0原文

所以我认为这应该很容易,但我遇到了困难。我正在尝试解析 |分隔文件以及任何不以 | 开头的行是一条评论。我想我不明白评论是如何运作的。它总是在注释行出错。这是旧文件,因此无需更改。这是我的语法。

grammar Route;

@header {
package org.benheath.codegeneration;
}

@lexer::header {
package org.benheath.codegeneration;
}

file: line+;
line: route+ '\n';
route: ('|' elt) {System.out.println("element: [" + $elt.text + "]");} ;
elt: (ELEMENT)*;

COMMENT: ~'|' .* '\n' ;
ELEMENT: ('a'..'z'|'A'..'Z'|'0'..'9'|'*'|'_'|'@'|'#') ;
WS: (' '|'\t') {$channel=HIDDEN;} ; // ignore whitespace

数据:

! a comment
Another comment
| a | abc | b | def | ...

So I think this should be easy, but I'm having a tough time with it. I'm trying to parse a | delimited file, and any line that doesn't start with a | is a comment. I guess I don't understand how comments work. It always errors out on a comment line. This is a legacy file, so there's no changing it. Here's my grammar.

grammar Route;

@header {
package org.benheath.codegeneration;
}

@lexer::header {
package org.benheath.codegeneration;
}

file: line+;
line: route+ '\n';
route: ('|' elt) {System.out.println("element: [" + $elt.text + "]");} ;
elt: (ELEMENT)*;

COMMENT: ~'|' .* '\n' ;
ELEMENT: ('a'..'z'|'A'..'Z'|'0'..'9'|'*'|'_'|'@'|'#') ;
WS: (' '|'\t') {$channel=HIDDEN;} ; // ignore whitespace

Data:

! a comment
Another comment
| a | abc | b | def | ...

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

只怪假的太真实 2024-08-26 15:07:55

其语法如下所示:

parse
  :  line* EOF   
  ;

line
  :  ( comment | values ) ( NL | EOF )
  ;

comment
  :  ELEMENT+
  ;

values
  :  PIPE ( ELEMENT PIPE )+
  ;

PIPE
  :  '|'    
  ;

ELEMENT
  :  ('a'..'z')+
  ;

NL
  :  '\r'? '\n' |  '\r' 
  ;

WS
  :  (' '|'\t') {$channel=HIDDEN;} 
  ;

要测试它,您只需在语法中添加一些代码,如下所示:

grammar Route;

@members {
  List<List<String>> values = new ArrayList<List<String>>();
}

parse
  :  line* EOF   
  ;

line
  :  ( comment | v=values {values.add($v.line);} ) ( NL | EOF )
  ;

comment
  :  ELEMENT+
  ;

values returns [List<String> line]
@init {line = new ArrayList<String>();}
  :  PIPE ( e=ELEMENT {line.add($e.text);} PIPE )*
  ;

PIPE
  :  '|'    
  ;

ELEMENT
  :  ('a'..'z')+
  ;

NL
  :  '\r'? '\n' |  '\r' 
  ;

WS
  :  (' '|'\t') {$channel=HIDDEN;} 
  ;

现在通过调用生成词法分析器/解析器:

java -cp antlr-3.2.jar org.antlr.Tool Route.g

创建一个类 RouteTest.java

import org.antlr.runtime.*;
import java.util.List;

public class RouteTest {
  public static void main(String[] args) throws Exception {
    String data = 
        "a comment\n"+
        "| xxxxx | y |     zzz   |\n"+
        "another comment\n"+
        "| a | abc | b | def |";
    ANTLRStringStream in = new ANTLRStringStream(data);
    RouteLexer lexer = new RouteLexer(in);
    CommonTokenStream tokens = new CommonTokenStream(lexer);
    RouteParser parser = new RouteParser(tokens);
    parser.parse();
    for(List<String> line : parser.values) {
      System.out.println(line);
    }
  }
}

编译所有源文件:

javac -cp antlr-3.2.jar *.java

并运行类 RouteTest

// Windows
java -cp .;antlr-3.2.jar RouteTest

// *nix/MacOS
java -cp .:antlr-3.2.jar RouteTest

如果一切顺利,您会看到打印到控制台:

[xxxxx, y, zzz]
[a, abc, b, def]

编辑:请注意,我通过只允许小写字母来简化它,您当然可以随时扩展集合。

A grammar for that would look like this:

parse
  :  line* EOF   
  ;

line
  :  ( comment | values ) ( NL | EOF )
  ;

comment
  :  ELEMENT+
  ;

values
  :  PIPE ( ELEMENT PIPE )+
  ;

PIPE
  :  '|'    
  ;

ELEMENT
  :  ('a'..'z')+
  ;

NL
  :  '\r'? '\n' |  '\r' 
  ;

WS
  :  (' '|'\t') {$channel=HIDDEN;} 
  ;

And to test it, you just need to sprinkle a bit of code in your grammar like this:

grammar Route;

@members {
  List<List<String>> values = new ArrayList<List<String>>();
}

parse
  :  line* EOF   
  ;

line
  :  ( comment | v=values {values.add($v.line);} ) ( NL | EOF )
  ;

comment
  :  ELEMENT+
  ;

values returns [List<String> line]
@init {line = new ArrayList<String>();}
  :  PIPE ( e=ELEMENT {line.add($e.text);} PIPE )*
  ;

PIPE
  :  '|'    
  ;

ELEMENT
  :  ('a'..'z')+
  ;

NL
  :  '\r'? '\n' |  '\r' 
  ;

WS
  :  (' '|'\t') {$channel=HIDDEN;} 
  ;

Now generate a lexer/parser by invoking:

java -cp antlr-3.2.jar org.antlr.Tool Route.g

create a class RouteTest.java:

import org.antlr.runtime.*;
import java.util.List;

public class RouteTest {
  public static void main(String[] args) throws Exception {
    String data = 
        "a comment\n"+
        "| xxxxx | y |     zzz   |\n"+
        "another comment\n"+
        "| a | abc | b | def |";
    ANTLRStringStream in = new ANTLRStringStream(data);
    RouteLexer lexer = new RouteLexer(in);
    CommonTokenStream tokens = new CommonTokenStream(lexer);
    RouteParser parser = new RouteParser(tokens);
    parser.parse();
    for(List<String> line : parser.values) {
      System.out.println(line);
    }
  }
}

Compile all source files:

javac -cp antlr-3.2.jar *.java

and run the class RouteTest:

// Windows
java -cp .;antlr-3.2.jar RouteTest

// *nix/MacOS
java -cp .:antlr-3.2.jar RouteTest

If all goes well, you see this printed to your console:

[xxxxx, y, zzz]
[a, abc, b, def]

Edit: note that I simplified it a bit by only allowing lower case letters, you can always expand the set of course.

撑一把青伞 2024-08-26 15:07:55

使用 ANTLR 来完成这样的工作是一个好主意,尽管我确实认为这是多余的。例如,这将非常容易(在伪代码中):

for each line in file:
if line begins with '|':
    fields = /|\s*([a-z]+)\s*/g

编辑:好吧,您无法词汇表达注释和行之间的区别,因为没有任何词汇可以区分它们。一个让你朝着一个可行的方向前进的提示。

line: comment | fields;
comment: NONBAR+ (BAR|NONBAR+) '\n';
fields = (BAR NONBAR)+;

It's a nice idea to use ANTLR for a job like this, although I do think it's overkill. For example, it would be very easy to (in pseudo-code):

for each line in file:
if line begins with '|':
    fields = /|\s*([a-z]+)\s*/g

Edit: Well, you can't express the distinction between comments and lines lexically, because there is nothing lexical that distinguishes them. A hint to get you in one workable direction.

line: comment | fields;
comment: NONBAR+ (BAR|NONBAR+) '\n';
fields = (BAR NONBAR)+;
み零 2024-08-26 15:07:55

这似乎有效,我发誓我尝试过。将注释更改为小写将其切换到解析器与词法分析器,我仍然不明白。

grammar Route;

@header {
    package org.benheath.codegeneration;
}

@lexer::header {
    package org.benheath.codegeneration;
}

file: (line|comment)+;
line: route+ '\n';
route: ('|' elt) {System.out.println("element: [" + $elt.text + "]");} ;
elt: (ELEMENT)*;

comment : ~'|' .* '\n';

ELEMENT: ('a'..'z'|'A'..'Z'|'0'..'9'|'*'|'_'|'@'|'#') ;
WS: (' '|'\t') {$channel=HIDDEN;} ; // ignore whitespace

This seems to work, I swear I tried it. Changing comment to lower case switched it to the parser vs the lexer, I still don't get it.

grammar Route;

@header {
    package org.benheath.codegeneration;
}

@lexer::header {
    package org.benheath.codegeneration;
}

file: (line|comment)+;
line: route+ '\n';
route: ('|' elt) {System.out.println("element: [" + $elt.text + "]");} ;
elt: (ELEMENT)*;

comment : ~'|' .* '\n';

ELEMENT: ('a'..'z'|'A'..'Z'|'0'..'9'|'*'|'_'|'@'|'#') ;
WS: (' '|'\t') {$channel=HIDDEN;} ; // ignore whitespace
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文