帮助解析日志文件 (ANTLR3)

发布于 2024-09-01 10:08:36 字数 1214 浏览 12 评论 0原文

我需要一些指导来编写语法来解析游戏《永恒之塔》的日志文件。我决定使用 Antlr3（因为它似乎是一个可以完成这项工作的工具，并且我认为学习使用它对我有好处）。但是，我遇到了问题，因为日志文件的结构不准确。

我需要解析的日志文件如下所示：

2010.04.27 22:32:22 : You changed the connection status to Online. 
2010.04.27 22:32:22 : You changed the group to the Solo state. 
2010.04.27 22:32:22 : You changed the group to the Solo state. 
2010.04.27 22:32:28 : Legion Message: www.xxxxxxxx.com (forum)



ventrillo: 19x.xxx.xxx.xxx

Port: 3712

Pass: xxxx (blabla) 

 4/27/2010 7:47 PM 
2010.04.27 22:32:28 : You have item(s) left to settle in the sales agency window.

如您所见，大多数行都以时间戳开头，但也有例外。我想在 Antlr3 中做的是编写一个解析器，它仅使用以时间戳开头的行，同时默默地丢弃其他行。

这就是我到目前为止所写的内容（我是这些东西的初学者，所以请不要笑：D）

grammar Antlr;

options {
  language = Java;
}

logfile: line* EOF;

line : dataline | textline;

dataline: timestamp WS ':' WS text NL ;
textline: ~DIG text NL;

timestamp: four_dig '.' two_dig '.' two_dig WS two_dig ':' two_dig ':' two_dig ;

four_dig: DIG DIG DIG DIG;
two_dig: DIG DIG;

text: ~NL+;

/* Whitespace */ 
WS: (' ' | '\t')+;

/* New line goes to \r\n or EOF */
NL: '\r'? '\n' ;

/* Digits */
DIG : '0'..'9';

所以我需要的是一个示例，说明如何解析它而不为没有时间戳的行生成错误。

谢谢！

原文

I need a little guidance in writing a grammar to parse the log file of the game Aion. I've decided upon using Antlr3 (because it seems to be a tool that can do the job and I figured it's good for me to learn to use it). However, I've run into problems because the log file is not exactly structured.

The log file I need to parse looks like the one below:

2010.04.27 22:32:22 : You changed the connection status to Online. 
2010.04.27 22:32:22 : You changed the group to the Solo state. 
2010.04.27 22:32:22 : You changed the group to the Solo state. 
2010.04.27 22:32:28 : Legion Message: www.xxxxxxxx.com (forum)



ventrillo: 19x.xxx.xxx.xxx

Port: 3712

Pass: xxxx (blabla) 

 4/27/2010 7:47 PM 
2010.04.27 22:32:28 : You have item(s) left to settle in the sales agency window.

As you can see, most lines start with a timestamp, but there are exceptions. What I'd like to do in Antlr3 is write a parser that uses only the lines starting with the timestamp while silently discarding the others.

This is what I've written so far (I'm a beginner with these things so please don't laugh :D)

grammar Antlr;

options {
  language = Java;
}

logfile: line* EOF;

line : dataline | textline;

dataline: timestamp WS ':' WS text NL ;
textline: ~DIG text NL;

timestamp: four_dig '.' two_dig '.' two_dig WS two_dig ':' two_dig ':' two_dig ;

four_dig: DIG DIG DIG DIG;
two_dig: DIG DIG;

text: ~NL+;

/* Whitespace */ 
WS: (' ' | '\t')+;

/* New line goes to \r\n or EOF */
NL: '\r'? '\n' ;

/* Digits */
DIG : '0'..'9';

So what I need is an example of how to parse this without generating errors for lines without the timestamp.

Thanks!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

泡沫很甜 2024-09-08 10:08:36

没有人会笑。事实上，你的第一次尝试就做得很好了。当然，还有改进的空间！ :)

首先一些说明：您只能否定单个字符。由于您的 NL 规则可能包含两个字符，因此您无法否定它。另外，当从解析器规则中进行否定时，您不会否定单个字符，而是否定词法分析器规则。这听起来可能有点令人困惑，所以让我用一个例子来澄清。采用组合（解析器和词法分析器）语法 T：

grammar T;

// parser rule
foo
  :  ~A
  ;

// lexer rules
A
  :  'a'
  ;

B
  :  'b'
  ;

C
  :  'c'
  ;

如您所见，我否定了 foo 中的 A 词法分析器规则解析器规则。 foo 规则现在不匹配除 'a' 之外的任何字符，但它匹配除 A 之外的任何词法分析器规则>。换句话说，它只会匹配 'b' 或 'c' 字符。

另外，您不需要

options {
  language = Java;
}

在语法中添加：：默认目标是Java（当然将其保留在那里也没有什么坏处）。

现在，在您的语法中，您已经可以区分词法分析器语法中的 data 行和 text 行。这是一种可能的方法：

logfile
  :  line+
  ;

line
  :  dataline 
  |  textline
  ;

dataline
  :  DataLine
  ;

textline
  :  TextLine
  ;

DataLine
  :  TwoDigits TwoDigits '.' TwoDigits '.' TwoDigits Space+ TwoDigits ':' TwoDigits ':' TwoDigits Space+ ':' TextLine
  ;

TextLine
  :  ~('\r' | '\n')* (NewLine | EOF)
  ;

fragment
NewLine
  :  '\r'? '\n'
  |  '\r'
  ;

fragment
TwoDigits
  :  '0'..'9' '0'..'9'
  ;

fragment
Space
  :  ' ' 
  |  '\t'
  ;

请注意，词法分析器规则中的 fragment 部分意味着不会从这些规则创建任何标记：它们仅在其他词法分析器规则中使用。因此词法分析器只会创建两种不同类型的标记：DataLine 和 TextLine。

No one is going to laugh. In fact, you did a pretty good job for a first try. Of course, there's room for improvement! :)

First some remarks: you can only negate single characters. Since your NL rule can possibly consist of two characters, you can't negate it. Also, when negating from within your parser rule(s), you don't negate single characters, but you're negating lexer rules. This may sound a bit confusing so let me clarify with an example. Take the combined (parser & lexer) grammar T:

grammar T;

// parser rule
foo
  :  ~A
  ;

// lexer rules
A
  :  'a'
  ;

B
  :  'b'
  ;

C
  :  'c'
  ;

As you can see, I'm negating the A lexer-rule in the foo parser-rule. The foo rule does now not match any character except the 'a', but it matches any lexer rule except A. In other words, it will only match a 'b' or 'c' character.

Also, you don't need to put:

options {
  language = Java;
}

in your grammar: the default target is Java (it does not hurt to leave it in there of course).

Now, in your grammar, you can already make a distinction between data- and text-lines in your lexer grammar. Here's a possible way to do so:

logfile
  :  line+
  ;

line
  :  dataline 
  |  textline
  ;

dataline
  :  DataLine
  ;

textline
  :  TextLine
  ;

DataLine
  :  TwoDigits TwoDigits '.' TwoDigits '.' TwoDigits Space+ TwoDigits ':' TwoDigits ':' TwoDigits Space+ ':' TextLine
  ;

TextLine
  :  ~('\r' | '\n')* (NewLine | EOF)
  ;

fragment
NewLine
  :  '\r'? '\n'
  |  '\r'
  ;

fragment
TwoDigits
  :  '0'..'9' '0'..'9'
  ;

fragment
Space
  :  ' ' 
  |  '\t'
  ;

Note that the fragment part in the lexer rules mean that no tokens are being created from those rules: they are only used in other lexer rules. So the lexer will only create two different type of tokens: DataLine's and TextLine's.

回复收藏 0 原文

霞映澄塘 2024-09-08 10:08:36

为了使您的语法尽可能接近，以下是我如何根据示例输入使其工作的方法。由于空格从词法分析器传递到解析器，因此我确实将所有标记从解析器移至实际的词法分析器规则中。主要的变化实际上只是添加另一个行选项，然后尝试让它匹配您的测试数据而不是实际的其他好数据，我还假设应该丢弃空白行，正如您可以从规则中看出的那样。所以这就是我能够得到的工作：

logfile: line* EOF;

//line : dataline | textline;
line : dataline | textline | discardline;

dataline: timestamp WS COLON WS text NL ;
textline: ~DIG text NL;

//"new"
discardline: (WS)+ discardtext (text|DIG|PERIOD|COLON|SLASH|WS)* NL
    | (WS)* NL;
discardtext: (two_dig| DIG) WS* SLASH;
// two_dig SLASH four_dig;

timestamp: four_dig PERIOD two_dig PERIOD two_dig WS two_dig COLON two_dig COLON two_dig ;

four_dig: DIG DIG DIG DIG;
two_dig: DIG DIG;

//Following is very different
text: CHAR (CHAR|DIG|PERIOD|COLON|SLASH|WS)*;

/* Whitespace */ 
WS: (' ' | '\t')+ ;

/* New line goes to \r\n or EOF */
NL: '\r'? '\n' ;

/* Digits */
DIG : '0'..'9'; 

//new lexer rules
CHAR : 'a'..'z'|'A'..'Z';
PERIOD : '.';
COLON : ':';
SLASH : '/' | '\\';

希望这对你有帮助，祝你好运。

Trying to keep your grammar as close as possible, here is how I was able to get it to work based on the example input. Because whitespace is being passed to the parser from the lexer, I did move all your tokens from the parser into actual lexer rules. The main change is really just adding another line option and then trying to get it to match your test data and not the actual other good data, I also assumed that a blank line should be discarded as you can tell by the rule. So here is what I was able to get working:

logfile: line* EOF;

//line : dataline | textline;
line : dataline | textline | discardline;

dataline: timestamp WS COLON WS text NL ;
textline: ~DIG text NL;

//"new"
discardline: (WS)+ discardtext (text|DIG|PERIOD|COLON|SLASH|WS)* NL
    | (WS)* NL;
discardtext: (two_dig| DIG) WS* SLASH;
// two_dig SLASH four_dig;

timestamp: four_dig PERIOD two_dig PERIOD two_dig WS two_dig COLON two_dig COLON two_dig ;

four_dig: DIG DIG DIG DIG;
two_dig: DIG DIG;

//Following is very different
text: CHAR (CHAR|DIG|PERIOD|COLON|SLASH|WS)*;

/* Whitespace */ 
WS: (' ' | '\t')+ ;

/* New line goes to \r\n or EOF */
NL: '\r'? '\n' ;

/* Digits */
DIG : '0'..'9'; 

//new lexer rules
CHAR : 'a'..'z'|'A'..'Z';
PERIOD : '.';
COLON : ':';
SLASH : '/' | '\\';

Hopefully that helps you, good luck.

回复收藏 0 原文

~没有更多了~