ANTLR 获取并拆分词法分析器内容

发布于 2024-11-05 23:28:45 字数 935 浏览 0 评论 0原文

首先，对我的英语感到抱歉，我还在学习。

我为我的框架编写 Python 模块，用于解析 CSS 文件。我尝试了 regex、ply（python 词法分析器和解析器），但我发现自己在 ANTLR 中。

第一次尝试，我需要解析 CSS 文件中的注释。这是我要解析的 CSS 字符串：

/*test*/

/*
test1
/*

/*test2/*nested*/comment/*

我知道 CSS 不允许嵌套注释，但我的框架中需要它。我写了简单的 ANTLR 语法：

grammar CSS;

options {
    language = Python;
}

styleSheet
    : comments EOF ;

comments
    : NESTED_ML_COMMENT*
    ;

NESTED_ML_COMMENT
    :   '/*' 
        (options {greedy=false;} : (NESTED_ML_COMMENT | . ) )* 
        '*/' 
    ;

LINEBREAK 
    :  ('\n\r' | '\n')+{$channel=HIDDEN; };

我得到的结果是：

在此处输入图像描述

我期望什么（绘画工作：D） :

在此处输入图像描述

请注意，我不想要 /* 和 */ 在结果中。

有没有办法在纯 ANTLR 中做到这一点？我在 ANTLR 中使用 python 没有问题，但如果有任何方法可以在不使用 python 的情况下做到这一点，我将不胜感激。

原文

first, sorry about my english, i still learning.

I writing Python module for my framework, which parsing CSS files. I try regex, ply (python lexer and parser), but i found myself in ANTLR.

First try, i need to parse comments from CSS file.
This is my CSS string to parse:

/*test*/

/*
test1
/*

/*test2/*nested*/comment/*

I know that CSS doesn't allow nested comments, but i need it in my framework. I wrote simple ANTLR grammar:

grammar CSS;

options {
    language = Python;
}

styleSheet
    : comments EOF ;

comments
    : NESTED_ML_COMMENT*
    ;

NESTED_ML_COMMENT
    :   '/*' 
        (options {greedy=false;} : (NESTED_ML_COMMENT | . ) )* 
        '*/' 
    ;

LINEBREAK 
    :  ('\n\r' | '\n')+{$channel=HIDDEN; };

What i get in result is:

enter image description here

What i expect (paint work :D):

enter image description here

Notice that i don't want /* and */ in result.

Is there any way to do this in pure ANTLR? I have no problem with using python in ANTLR, but if there any way to do this without python i will be grateful.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

归途 2024-11-12 23:28:45

不，没有简单的方法。由于 NESTED_ML_COMMENT 是一个词法分析器规则（一个“简单”标记），因此您不能让解析器规则在源中创建任何更多结构，例如 /*test2/*nested*/comment*/：词法分析器规则将始终保持“平面”字符序列。当然，有（简单）方法可以重写此字符序列（即删除 /* 和 */），但是创建父兄弟层次结构，不行。

为了创建像第二个^nd图像中显示的层次结构，您必须将您的注释规则“提升”到解析器（因此使其成为解析器规则）。在这种情况下，您的词法分析器将具有 COMMENT_START : '/*'; 和 COMMENT_END : '*/'; 规则。但这会带来麻烦：在词法分析器中，您现在还需要考虑 /* 和 */ 之间的所有字符。

您可以创建另一个解析器来解析（嵌套）注释并在您的 CSS 语法中使用它。在您的 CSS 语法中，您只需保持原样即可，而您的第二个解析器是一个专用的注释解析器，它根据注释标记创建层次结构。

一个快速演示。语法：

grammar T;

parse
  :  comment EOF 
  ;

comment
  :  COMMENT_START (ANY | comment)* COMMENT_END
  ;

COMMENT_START : '/*';
COMMENT_END   : '*/';
ANY           :  . ;

将源 /*test2/*nested*/comment*/ 解析为以下解析树：

在此处输入图像描述

当然，您可以重写

，以便删除 /* 和 */ 。在您的 CSS 语法中，您可以执行以下操作：

comment
  :  NESTED_ML_COMMENT 
     {
       text = $NESTED_ML_COMMENT.text
       # invoke the TParser (my demo grammar) on `text`
     }
  ;

编辑

请注意，ANTLRWorks 创建了您无权访问的自己的内部解析树。如果你不告诉 ANTLR 生成正确的 AST，你最终只会得到一个简单的 token 列表（尽管 ANTLRWorks 建议它是某种树）。

以下是之前的问答，解释了如何创建正确的 AST：如何输出使用 ANTLR 构建的 AST？

现在让我们回到我上面发布的“注释”语法。我会将 ANY 规则重命名为 TEXT。目前，此规则一次仅匹配一个字符。但更方便的是让它一直匹配到下一个 /* 或 */。这可以通过在执行此检查的词法分析器类中引入一个简单的 Python 方法来完成。在 TEXT 规则中，我们将在谓词内使用该方法，以便 * 在不后面直接跟有 时得到匹配/，如果 / 不直接跟在 * 后面，则 / 会匹配：

grammar Comment;

options {
  output=AST;
  language=Python;
}

tokens {
  COMMENT;
}

@lexer::members {
  def not_part_of_comment(self):
    current = self.input.LA(1)
    next = self.input.LA(2)
    if current == ord('*'): return next != ord('/')
    if current == ord('/'): return next != ord('*')  
    return True
}

parse
  :  comment EOF -> comment
  ;

comment
  :  COMMENT_START atom* COMMENT_END -> ^(COMMENT atom*)
  ;

atom
  :  TEXT
  |  comment
  ;

COMMENT_START : '/*';
COMMENT_END   : '*/';
TEXT          : ({self.not_part_of_comment()}?=> . )+ ;

了解有关谓词语法的更多信息，<代码>{ 布尔表达式}?=>，在此问答中：什么是ANTLR 中的“语义谓词”？

要测试这一切，请确保安装了正确的 Python 运行时库（请参阅ANTLR 维基）。请务必在此运行时中使用 ANTLR 版本 3.1.3。

像这样生成词法分析器和解析器：

java -cp antlr-3.1.3.jar org.antlr.Tool Comment.g

并使用以下 Python 脚本测试词法分析器和解析器：

#!/usr/bin/env python

import antlr3
from antlr3 import *
from antlr3.tree import *
from CommentLexer import *
from CommentParser import *

# http://www.antlr.org/wiki/display/ANTLR3/Python+runtime
# http://www.antlr.org/download/antlr-3.1.3.jar

def print_level_order(tree, indent):
  print '{0}{1}'.format('   '*indent, tree.text)
  for child in tree.getChildren():
    print_level_order(child, indent+1)

input = '/*aaa1/*bbb/*ccc*/*/aaa2*/'
char_stream = antlr3.ANTLRStringStream(input)
lexer = CommentLexer(char_stream)
tokens = antlr3.CommonTokenStream(lexer)
parser = CommentParser(tokens)
tree = parser.parse().tree 
print_level_order(tree, 0)

如您所见，从源代码 "/*aaa1/*bbb/*ccc*/*/aaa2*/"，创建以下 AST：

COMMENT
   aaa1
   COMMENT
      bbb
      COMMENT
         ccc
   aaa2

编辑 II

我还想展示如何从 CSS 语法调用注释解析器。这是一个快速演示：

grammar CSS;

options {
  output=AST;
  language=Python;
}

tokens {
  CSS_FILE;
  RULE;
  BLOCK;
  DECLARATION;
}

@parser::header {
import antlr3
from antlr3 import *
from antlr3.tree import *
from CommentLexer import *
from CommentParser import *
}

@parser::members {
  def parse_comment(self, text):
    lexer = CommentLexer(antlr3.ANTLRStringStream(text))
    parser = CommentParser(antlr3.CommonTokenStream(lexer))
    return parser.parse().tree 
}

parse
  :  atom+ EOF -> ^(CSS_FILE atom+)
  ;

atom
  :  rule
  |  Comment -> {self.parse_comment($Comment.text)}
  ;

rule
  :  Identifier declarationBlock -> ^(RULE Identifier declarationBlock)
  ;

declarationBlock
  :  '{' declaration+ '}' -> ^(BLOCK declaration+)
  ;

declaration
  :  a=Identifier ':' b=Identifier ';' -> ^(DECLARATION $a $b)
  ;

Identifier
  :  ('a'..'z' | 'A'..'Z') ('a'..'z' | 'A'..'Z' | '0'..'9')*
  ;

Comment
  :  '/*' (options {greedy=false;} : Comment | . )* '*/'
  ;

Space
  :  (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;}
  ;

如果您使用 CSSParser 解析源：

h1 {  a: b;  c: d;}

/*aaa1/*bbb/*ccc*/*/aaa2*/

p {x  :  y;}

，您将获得以下树：

CSS_FILE
   RULE
      h1
      BLOCK
         DECLARATION
            a
            b
         DECLARATION
            c
            d
   COMMENT
      aaa1
      COMMENT
         bbb
         COMMENT
            ccc
      aaa2
   RULE
      p
      BLOCK
         DECLARATION
            x
            y

正如您通过运行此测试脚本所看到的：

#!/usr/bin/env python

import antlr3
from antlr3 import *
from antlr3.tree import *
from CSSLexer import *
from CSSParser import *

def print_level_order(tree, indent):
  print '{0}{1}'.format('   '*indent, tree.text)
  for child in tree.getChildren():
    print_level_order(child, indent+1)

input = 'h1 {  a: b;  c: d;}\n\n/*aaa1/*bbb/*ccc*/*/aaa2*/\n\np {x  :  y;}'
char_stream = antlr3.ANTLRStringStream(input)
lexer = CSSLexer(char_stream)
tokens = antlr3.CommonTokenStream(lexer)
parser = CSSParser(tokens)
tree = parser.parse().tree 
print_level_order(tree, 0)

No, there is no easy way. Since NESTED_ML_COMMENT is a lexer rule (a "simple" token), you cannot let a parser rule create any more structure in source like /*test2/*nested*/comment*/: lexer rules will always stay a "flat" sequence of characters. Sure, there are (easy) ways to rewrite this character sequence (ie. remove /* and */), but creating parent-sibling hierarchies, no.

In order to create a hierarchy like you displayed in your 2^nd image, you will have to "promote" your comment-rule to the parser (so make it into a parser rule). In that case, your lexer would have a COMMENT_START : '/*'; and COMMENT_END : '*/'; rule. But that opens a can of worms: inside your lexer you would now also need to account for all characters that can come between /* and */.

You could create another parser that parses (nested) comments and use that inside your CSS grammar. Inside your CSS grammar, you simply keep it as it is, and your second parser is a dedicated comments-parser that creates a hierarchy from the comment-tokens.

A quick demo. The grammar:

grammar T;

parse
  :  comment EOF 
  ;

comment
  :  COMMENT_START (ANY | comment)* COMMENT_END
  ;

COMMENT_START : '/*';
COMMENT_END   : '*/';
ANY           :  . ;

will parse the source /*test2/*nested*/comment*/ into the following parse tree:

enter image description here

which you can rewrite so that /* and */ are removed, of course.

Inside your CSS grammar, you then do:

comment
  :  NESTED_ML_COMMENT 
     {
       text = $NESTED_ML_COMMENT.text
       # invoke the TParser (my demo grammar) on `text`
     }
  ;

EDIT

Note that ANTLRWorks creates it's own internal parse tree to which you have no access. If you do not tell ANTLR to generate a proper AST, you will just end up with a flat list of tokens (even though ANTLRWorks suggests it is some sort of tree).

Here's a previous Q&A that explains how to create a proper AST: How to output the AST built using ANTLR?

Now let's get back to the "comment" grammar I posted above. I'll rename the ANY rule to TEXT. At the moment, this rule only matches a single character at a time. But it's more convenient to let it match all the way up to the next /* or */. This can be done by introducing a plain Python method in the lexer class that performs this check. Inside the TEXT rule, we'll use that method inside a predicate so that * gets matched if it's not directly followed by a /, and a / gets matched if it's not directly followed by a *:

grammar Comment;

options {
  output=AST;
  language=Python;
}

tokens {
  COMMENT;
}

@lexer::members {
  def not_part_of_comment(self):
    current = self.input.LA(1)
    next = self.input.LA(2)
    if current == ord('*'): return next != ord('/')
    if current == ord('/'): return next != ord('*')  
    return True
}

parse
  :  comment EOF -> comment
  ;

comment
  :  COMMENT_START atom* COMMENT_END -> ^(COMMENT atom*)
  ;

atom
  :  TEXT
  |  comment
  ;

COMMENT_START : '/*';
COMMENT_END   : '*/';
TEXT          : ({self.not_part_of_comment()}?=> . )+ ;

Find out more about the predicate syntax, { boolean_expression }?=>, in this Q&A: What is a 'semantic predicate' in ANTLR?

To test this all, make sure you have the proper Python runtime libraries installed (see the ANTLR Wiki). And be sure to use ANTLR version 3.1.3 with this runtime.

Generate the lexer- and parser like this:

java -cp antlr-3.1.3.jar org.antlr.Tool Comment.g

and test the lexer and parser with the following Python script:

#!/usr/bin/env python

import antlr3
from antlr3 import *
from antlr3.tree import *
from CommentLexer import *
from CommentParser import *

# http://www.antlr.org/wiki/display/ANTLR3/Python+runtime
# http://www.antlr.org/download/antlr-3.1.3.jar

def print_level_order(tree, indent):
  print '{0}{1}'.format('   '*indent, tree.text)
  for child in tree.getChildren():
    print_level_order(child, indent+1)

input = '/*aaa1/*bbb/*ccc*/*/aaa2*/'
char_stream = antlr3.ANTLRStringStream(input)
lexer = CommentLexer(char_stream)
tokens = antlr3.CommonTokenStream(lexer)
parser = CommentParser(tokens)
tree = parser.parse().tree 
print_level_order(tree, 0)

As you can see, from the source "/*aaa1/*bbb/*ccc*/*/aaa2*/", the following AST is created:

COMMENT
   aaa1
   COMMENT
      bbb
      COMMENT
         ccc
   aaa2

EDIT II

I mind as well show how you can invoke the Comment parser from your CSS grammar. Here's a quick demo:

grammar CSS;

options {
  output=AST;
  language=Python;
}

tokens {
  CSS_FILE;
  RULE;
  BLOCK;
  DECLARATION;
}

@parser::header {
import antlr3
from antlr3 import *
from antlr3.tree import *
from CommentLexer import *
from CommentParser import *
}

@parser::members {
  def parse_comment(self, text):
    lexer = CommentLexer(antlr3.ANTLRStringStream(text))
    parser = CommentParser(antlr3.CommonTokenStream(lexer))
    return parser.parse().tree 
}

parse
  :  atom+ EOF -> ^(CSS_FILE atom+)
  ;

atom
  :  rule
  |  Comment -> {self.parse_comment($Comment.text)}
  ;

rule
  :  Identifier declarationBlock -> ^(RULE Identifier declarationBlock)
  ;

declarationBlock
  :  '{' declaration+ '}' -> ^(BLOCK declaration+)
  ;

declaration
  :  a=Identifier ':' b=Identifier ';' -> ^(DECLARATION $a $b)
  ;

Identifier
  :  ('a'..'z' | 'A'..'Z') ('a'..'z' | 'A'..'Z' | '0'..'9')*
  ;

Comment
  :  '/*' (options {greedy=false;} : Comment | . )* '*/'
  ;

Space
  :  (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;}
  ;

If you parse the source:

h1 {  a: b;  c: d;}

/*aaa1/*bbb/*ccc*/*/aaa2*/

p {x  :  y;}

with the CSSParser, you'll get the following tree:

CSS_FILE
   RULE
      h1
      BLOCK
         DECLARATION
            a
            b
         DECLARATION
            c
            d
   COMMENT
      aaa1
      COMMENT
         bbb
         COMMENT
            ccc
      aaa2
   RULE
      p
      BLOCK
         DECLARATION
            x
            y

as you can see by running this test script:

#!/usr/bin/env python

import antlr3
from antlr3 import *
from antlr3.tree import *
from CSSLexer import *
from CSSParser import *

def print_level_order(tree, indent):
  print '{0}{1}'.format('   '*indent, tree.text)
  for child in tree.getChildren():
    print_level_order(child, indent+1)

input = 'h1 {  a: b;  c: d;}\n\n/*aaa1/*bbb/*ccc*/*/aaa2*/\n\np {x  :  y;}'
char_stream = antlr3.ANTLRStringStream(input)
lexer = CSSLexer(char_stream)
tokens = antlr3.CommonTokenStream(lexer)
parser = CSSParser(tokens)
tree = parser.parse().tree 
print_level_order(tree, 0)

回复收藏 0 原文

ゝ杯具 2024-11-12 23:28:45

您应该使用 ! 和 ^ AST 提示。要使 /* 不出现在 AST 中，请将 ! 放在它后面。要控制哪些元素成为 AST 子树的根，请附加 ^。它可能看起来像这样：

NESTED_ML_COMMENT
:   COMMENT_START!
    (options {greedy=false;} : (NESTED_ML_COMMENT^ | . ) )* 
    COMMENT_END!
;

这是一个专门关于这些运算符的问题，既然您知道这些运算符的存在，我希望会有用：
^ 和 ! 是什么意思？在ANTLR语法中代表

You should use the ! and ^ AST hints. To make /* not appear in your AST, put ! after it. To control which elements become roots of AST subtrees, append ^. It might look something like this:

NESTED_ML_COMMENT
:   COMMENT_START!
    (options {greedy=false;} : (NESTED_ML_COMMENT^ | . ) )* 
    COMMENT_END!
;

Here's a question specifically about these operators, which now that you know exist, I hope will be useful:
What does ^ and ! stand for in ANTLR grammar

回复收藏 0 原文

~没有更多了~