使用 Antlr3 匹配词位变体

发布于 2024-09-25 04:32:58 字数 1591 浏览 1 评论 0原文

我正在尝试使用 Antlr 3.2 和 Java1.6 来匹配英文输入文本中的测量值。我有如下的词汇规则:

fragment
MILLIMETRE
    :   'millimetre' | 'millimetres'
    |   'millimeter' | 'millimeters'
    |   'mm'
    ;

MEASUREMENT
    :   MILLIMETRE | CENTIMETRE | ... ;

我希望能够接受大小写输入的任意组合,并且更重要的是,只为 MILLIMETRE 的所有变体返回一个词汇标记。但目前,我的 AST 包含“毫米”、“毫米”、“毫米”等,就像输入文本中一样。

阅读http://www.antlr.org/wiki/pages/viewpage后.action?pageId=1802308,我想我需要执行如下操作:

tokens {
    T_MILLIMETRE;
}

fragment
MILLIMETRE
    :   ('millimetre' | 'millimetres'
    |   'millimeter' | 'millimeters'
    |   'mm') { $type = T_MILLIMETRE; }
    ;

但是,当我这样做时,我在 Antlr 生成的 Java 代码中收到以下编译器错误:

cannot find symbol
_type = T_MILLIMETRE;

我尝试了以下操作:

MEASUREMENT
    :   MILLIMETRE  { $type = T_MILLIMETRE; }
    |   ...

但是那么 MEASUREMENT 不再匹配。

使用重写规则的更明显的解决方案:

MEASUREMENT
    :   MILLIMETRE  -> ^(T_MILLIMETRE MILLIMETRE)
    |   ...

导致 NPE:

java.lang.NullPointerException at org.antlr.grammar.v2.DefineGrammarItemsWalker.alternative(DefineGrammarItemsWalker.java:1555).

将 MEASUREMENT 放入解析器规则中会给我带来可怕的“以下标记定义永远无法匹配,因为先前的标记与相同的输入匹配”错误。

通过创建解析器规则,

measurement :  T_MILLIMETRE | ...

我收到警告“没有与标记对应的词法分析器规则:T_MILLIMETRE”。 Antlr 虽然运行,但它仍然给我 AST 中的输入文本,而不是 T_MILLIMETRE。

显然我还没有像 Antlr 那样看待世界。有人可以给我任何提示或建议吗?

史蒂夫

I'm trying to match measurements in English input text, using Antlr 3.2 and Java1.6. I've got lexical rules like the following:

fragment
MILLIMETRE
    :   'millimetre' | 'millimetres'
    |   'millimeter' | 'millimeters'
    |   'mm'
    ;

MEASUREMENT
    :   MILLIMETRE | CENTIMETRE | ... ;

I'd like to be able to accept any combination of upper- and lowercase input and - more importantly - just return a single lexical token for all the variants of MILLIMETRE. But at the moment, my AST contains 'millimetre', 'millimeters', 'mm' etc. just as in the input text.

After reading http://www.antlr.org/wiki/pages/viewpage.action?pageId=1802308, I think I need to do something like the following:

tokens {
    T_MILLIMETRE;
}

fragment
MILLIMETRE
    :   ('millimetre' | 'millimetres'
    |   'millimeter' | 'millimeters'
    |   'mm') { $type = T_MILLIMETRE; }
    ;

However, when I do this, I get the following compiler errors in the Java code generated by Antlr:

cannot find symbol
_type = T_MILLIMETRE;

I tried the following instead:

MEASUREMENT
    :   MILLIMETRE  { $type = T_MILLIMETRE; }
    |   ...

but then MEASUREMENT is not matched anymore.

The more obvious solution with a rewrite rule:

MEASUREMENT
    :   MILLIMETRE  -> ^(T_MILLIMETRE MILLIMETRE)
    |   ...

causes an NPE:

java.lang.NullPointerException at org.antlr.grammar.v2.DefineGrammarItemsWalker.alternative(DefineGrammarItemsWalker.java:1555).

Making MEASUREMENT into a parser rule gives me the dreaded "The following token definitions can never be matched because prior tokens match the same input" error.

By creating a parser rule

measurement :  T_MILLIMETRE | ...

I get the warning "no lexer rule corresponding to token: T_MILLIMETRE". Antlr runs though, but it still gives me the input text in the AST and not T_MILLIMETRE.

I'm obviously not yet seeing the world the way Antlr does. Can anyone give me any hints or advice please?

Steve

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

以为你会在 2024-10-02 04:32:58

这是一种方法:

grammar Measurement;

options {
  output=AST;
}

tokens {
  ROOT;
  MM;
  CM;
}

parse
  :  measurement+ EOF -> ^(ROOT measurement+)
  ;

measurement
  :  Number MilliMeter -> ^(MM Number)
  |  Number CentiMeter -> ^(CM Number)
  ;

Number
  :  '0'..'9'+
  ;

MilliMeter
  :  'millimetre'
  |  'millimetres'
  |  'millimeter'
  |  'millimeters'
  |  'mm'
  ;

CentiMeter
  :  'centimetre'
  |  'centimetres'
  |  'centimeter'
  |  'centimeters'
  |  'cm'
  ;

Space
  :  (' ' | '\t' | '\r' | '\n'){$channel=HIDDEN;}
  ;

可以使用以下类进行测试:

import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;

public class Main {
    public static void main(String[] args) throws Exception {
        ANTLRStringStream in = new ANTLRStringStream("12 millimeters 3 mm 456 cm");
        MeasurementLexer lexer = new MeasurementLexer(in);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        MeasurementParser parser = new MeasurementParser(tokens);
        MeasurementParser.parse_return returnValue = parser.parse();
        CommonTree tree = (CommonTree)returnValue.getTree();
        DOTTreeGenerator gen = new DOTTreeGenerator();
        StringTemplate st = gen.toDOT(tree);
        System.out.println(st);
    }
}

它会生成以下 DOT 文件:

digraph {

    ordering=out;
    ranksep=.4;
    bgcolor="lightgrey"; node [shape=box, fixedsize=false, fontsize=12, fontname="Helvetica-bold", fontcolor="blue"
        width=.25, height=.25, color="black", fillcolor="white", style="filled, solid, bold"];
    edge [arrowsize=.5, color="black", style="bold"]

  n0 [label="ROOT"];
  n1 [label="MM"];
  n1 [label="MM"];
  n2 [label="12"];
  n3 [label="MM"];
  n3 [label="MM"];
  n4 [label="3"];
  n5 [label="CM"];
  n5 [label="CM"];
  n6 [label="456"];

  n0 -> n1 // "ROOT" -> "MM"
  n1 -> n2 // "MM" -> "12"
  n0 -> n3 // "ROOT" -> "MM"
  n3 -> n4 // "MM" -> "3"
  n0 -> n5 // "ROOT" -> "CM"
  n5 -> n6 // "CM" -> "456"

}

对应于树:

“替代文本”

(由 http://graph.gafol.net/ 创建的图像)

编辑

请注意以下内容:

measurement
  :  Number m=MilliMeter {System.out.println($m.getType() == MeasurementParser.MilliMeter);}
  |  Number CentiMeter
  ;

将始终打印 true,无论(毫米)标记的“内容”是否为 mmmmimeter, 毫米, ...

Here's a way to do that:

grammar Measurement;

options {
  output=AST;
}

tokens {
  ROOT;
  MM;
  CM;
}

parse
  :  measurement+ EOF -> ^(ROOT measurement+)
  ;

measurement
  :  Number MilliMeter -> ^(MM Number)
  |  Number CentiMeter -> ^(CM Number)
  ;

Number
  :  '0'..'9'+
  ;

MilliMeter
  :  'millimetre'
  |  'millimetres'
  |  'millimeter'
  |  'millimeters'
  |  'mm'
  ;

CentiMeter
  :  'centimetre'
  |  'centimetres'
  |  'centimeter'
  |  'centimeters'
  |  'cm'
  ;

Space
  :  (' ' | '\t' | '\r' | '\n'){$channel=HIDDEN;}
  ;

It can be tested with the following class:

import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;

public class Main {
    public static void main(String[] args) throws Exception {
        ANTLRStringStream in = new ANTLRStringStream("12 millimeters 3 mm 456 cm");
        MeasurementLexer lexer = new MeasurementLexer(in);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        MeasurementParser parser = new MeasurementParser(tokens);
        MeasurementParser.parse_return returnValue = parser.parse();
        CommonTree tree = (CommonTree)returnValue.getTree();
        DOTTreeGenerator gen = new DOTTreeGenerator();
        StringTemplate st = gen.toDOT(tree);
        System.out.println(st);
    }
}

which produces the following DOT file:

digraph {

    ordering=out;
    ranksep=.4;
    bgcolor="lightgrey"; node [shape=box, fixedsize=false, fontsize=12, fontname="Helvetica-bold", fontcolor="blue"
        width=.25, height=.25, color="black", fillcolor="white", style="filled, solid, bold"];
    edge [arrowsize=.5, color="black", style="bold"]

  n0 [label="ROOT"];
  n1 [label="MM"];
  n1 [label="MM"];
  n2 [label="12"];
  n3 [label="MM"];
  n3 [label="MM"];
  n4 [label="3"];
  n5 [label="CM"];
  n5 [label="CM"];
  n6 [label="456"];

  n0 -> n1 // "ROOT" -> "MM"
  n1 -> n2 // "MM" -> "12"
  n0 -> n3 // "ROOT" -> "MM"
  n3 -> n4 // "MM" -> "3"
  n0 -> n5 // "ROOT" -> "CM"
  n5 -> n6 // "CM" -> "456"

}

which corresponds to the tree:

alt text

(image created by http://graph.gafol.net/)

EDIT

Note that the following:

measurement
  :  Number m=MilliMeter {System.out.println($m.getType() == MeasurementParser.MilliMeter);}
  |  Number CentiMeter
  ;

will always print true, regardless if the "contents" of the (millimeter) tokens are mm, millimetre, millimetres, ...

鱼忆七猫命九 2024-10-02 04:32:58

请注意,fragment 规则仅在词法分析器内“存活”,并且不再存在于解析器中。例如:

grammar Measurement;

options {
  output=AST;
}

parse
  :  (m=MEASUREMENT {
       String contents = $m.text;
       boolean isMeasurementType = $m.getType() == MeasurementParser.MEASUREMENT;
       System.out.println("contents="+contents+", isMeasurementType="+isMeasurementType);
     })+ EOF
  ;

MEASUREMENT
  :  MILLIMETRE
  ;

fragment
MILLIMETRE
  :  'millimetre' 
  |  'millimetres'
  |  'millimeter' 
  |  'millimeters'
  |  'mm'
  ;

SPACE
  :  (' ' | '\t' | '\r' | '\n'){$channel=HIDDEN;}
  ;

输入文本:

"millimeters mm"

将打印:

contents=millimeters, isMeasurementType=true
contents=mm, isMeasurementType=true

换句话说:MILLIMETRE 类型不存在,它们都是 MEASUREMENT 类型。

Note that fragment rules only "live" inside the lexer and cease to exist in the parser. For example:

grammar Measurement;

options {
  output=AST;
}

parse
  :  (m=MEASUREMENT {
       String contents = $m.text;
       boolean isMeasurementType = $m.getType() == MeasurementParser.MEASUREMENT;
       System.out.println("contents="+contents+", isMeasurementType="+isMeasurementType);
     })+ EOF
  ;

MEASUREMENT
  :  MILLIMETRE
  ;

fragment
MILLIMETRE
  :  'millimetre' 
  |  'millimetres'
  |  'millimeter' 
  |  'millimeters'
  |  'mm'
  ;

SPACE
  :  (' ' | '\t' | '\r' | '\n'){$channel=HIDDEN;}
  ;

with input text:

"millimeters mm"

will print:

contents=millimeters, isMeasurementType=true
contents=mm, isMeasurementType=true

in other words: the type MILLIMETRE does not exist, they're all of type MEASUREMENT.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文