我可以在运行时添加 Antlr 令牌吗？

发布于 2024-11-09 00:52:46 字数 238 浏览 5 评论 0原文

我遇到的情况是，我的语言包含一些在构建时未知但在运行时已知的单词，导致需要不断重建/重新部署程序以考虑新单词。我在想 Antlr 是否可以从配置文件生成一些令牌？

例如，在一个简化的示例中，如果我有一条规则

rule : WORDS+;

WORDS : 'abc';

并且我的语言在运行时遇到“bcd”，我希望能够修改配置文件以将 bcd 定义为单词，而不必重建然后重新部署。

原文

I have a situation where my language contains some words that aren't known at build time but will be known at run time causing the need to constantly rebuild / redeploy the program to take into account new words. I was wandering if it was possible in Antlr generate some of the tokens from a config file?

e.g In a simplified example if I have a rule

rule : WORDS+;

WORDS : 'abc';

And my language comes across 'bcd' at runntime, I would like to be able to modify a config file to define bcd as a word rather than having to rebuild then redeploy.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

最冷一天 2024-11-16 00:52:46

您可以将某种集合添加到您的词法分析器类中。该集合将保存所有运行时单词。然后，您在规则中添加一些可能与这些运行时单词匹配的自定义代码，并更改令牌的类型（如果令牌存在于集合中）。

演示

假设您要解析输入：

"foo bar baz"

在运行时，单词 "foo" 和 "baz" 应成为特殊的运行时单词。下面的语法展示了如何解决这个问题：

grammar RuntimeWords;

tokens {
  RUNTIME_WORD;
}

@lexer::members {

  private java.util.Set<String> runtimeWords;

  public RuntimeWordsLexer(CharStream input, java.util.Set<String> words) {
    super(input);
    runtimeWords = words;
  }
}

parse
  :  (w=. {System.out.printf("\%-15s :: \%s \n", tokenNames[$w.type], $w.text);})+ EOF
  ;

Word
  :  ('a'..'z' | 'A'..'Z')+
     {
       if(runtimeWords.contains(getText())) {
         $type = RUNTIME_WORD;
       }
     }
  ;

Space
  :  ' ' {skip();}
  ;

还有一个小测试类：

import org.antlr.runtime.*;
import java.util.*;

public class Main {
  public static void main(String[] args) throws Exception {
    Set<String> words = new HashSet<String>(Arrays.asList("foo", "baz"));
    ANTLRStringStream in = new ANTLRStringStream("foo bar baz");
    RuntimeWordsLexer lexer = new RuntimeWordsLexer(in, words);
    CommonTokenStream tokens = new CommonTokenStream(lexer);
    RuntimeWordsParser parser = new RuntimeWordsParser(tokens);        
    parser.parse();
  }
}

它将产生以下输出：

RUNTIME_WORD    :: foo 
Word            :: bar 
RUNTIME_WORD    :: baz

演示 II

这是另一个更适合您的问题的演示（一开始我太快地浏览了您的问题，但我会留下我的第一个演示到位，因为它可能对某人有用）。其中没有太多评论，但我的猜测是您不会在理解发生的事情上遇到问题（如果没有，请随时要求澄清！）。

grammar RuntimeWords;

@lexer::members {

  private java.util.Set<String> runtimeWords;

  public RuntimeWordsLexer(CharStream input, java.util.Set<String> words) {
    super(input);
    runtimeWords = words;
  }

  private boolean runtimeWordAhead() {
    for(String word : runtimeWords) {
      if(ahead(word)) {
        return true;
      }
    }
    return false;
  }

  private boolean ahead(String word) {
    for(int i = 0; i < word.length(); i++) {
      if(input.LA(i+1) != word.charAt(i)) {
        return false;
      }
    } 
    return true; 
  }
}

parse
  :  (w=. {System.out.printf("\%-15s :: \%s \n", tokenNames[$w.type], $w.text);})+ EOF
  ;

Word
  :  {runtimeWordAhead()}?=> ('a'..'z' | 'A'..'Z')+
  |  'abc'
  ;

Space
  :  ' ' {skip();}
  ;

并且类：

import org.antlr.runtime.*;
import java.util.*;

public class Main {
  public static void main(String[] args) throws Exception {
    Set<String> words = new HashSet<String>(Arrays.asList("BBB", "CDEFG"));
    ANTLRStringStream in = new ANTLRStringStream("BBB abc CDEFG");
    RuntimeWordsLexer lexer = new RuntimeWordsLexer(in, words);
    CommonTokenStream tokens = new CommonTokenStream(lexer);
    RuntimeWordsParser parser = new RuntimeWordsParser(tokens);        
    parser.parse();
  }
}

将产生：

Word            :: BBB 
Word            :: abc 
Word            :: CDEFG

如果某些运行时单词以另一个单词开头，请小心。例如，如果您的运行时单词包含 "stack" 和 "stacker"，您希望首先检查较长的单词！根据字符串的长度对集合进行排序应该是有序的。

最后要注意的是：如果运行时单词列表中只有 "stack" 并且词法分析器遇到 "stacker"，那么您可能不想创建 < code>"stack"-token 并让 "er" 悬空。在这种情况下，您需要检查单词中最后一个字符后面的字符是否不是字母：

private boolean ahead(String word) {
  for(int i = 0; i < word.length(); i++) {
    if(input.LA(i+1) != word.charAt(i)) {
      return false;
    }
  }
  // charAfterWord = input.LA(word.length())
  // assert charAfterWord != letter
  // note that charAfterWord could also be EOF
  return ... ; 
}

You could add some sort of collection to your lexer class. This collection will hold all runtime-words. Then you add some custom code inside the rule that could possibly match these runtime-words and change the type of the token if it is present in the collection.

Demo

Let's say you want to parse the input:

"foo bar baz"

and at runtime, the words "foo" and "baz" should become special runtime words. The following grammar shows how to solve this:

grammar RuntimeWords;

tokens {
  RUNTIME_WORD;
}

@lexer::members {

  private java.util.Set<String> runtimeWords;

  public RuntimeWordsLexer(CharStream input, java.util.Set<String> words) {
    super(input);
    runtimeWords = words;
  }
}

parse
  :  (w=. {System.out.printf("\%-15s :: \%s \n", tokenNames[$w.type], $w.text);})+ EOF
  ;

Word
  :  ('a'..'z' | 'A'..'Z')+
     {
       if(runtimeWords.contains(getText())) {
         $type = RUNTIME_WORD;
       }
     }
  ;

Space
  :  ' ' {skip();}
  ;

And a little test class:

import org.antlr.runtime.*;
import java.util.*;

public class Main {
  public static void main(String[] args) throws Exception {
    Set<String> words = new HashSet<String>(Arrays.asList("foo", "baz"));
    ANTLRStringStream in = new ANTLRStringStream("foo bar baz");
    RuntimeWordsLexer lexer = new RuntimeWordsLexer(in, words);
    CommonTokenStream tokens = new CommonTokenStream(lexer);
    RuntimeWordsParser parser = new RuntimeWordsParser(tokens);        
    parser.parse();
  }
}

which will produce the following output:

RUNTIME_WORD    :: foo 
Word            :: bar 
RUNTIME_WORD    :: baz

Demo II

Here's another demo that is more tailored to your problem (I skimmed your question too quickly at first, but I'll leave my first demo in place because it might come in handy for someone). There's not much comments in it, but my guess is that you won't have problems grasping what happens (if not, don't hesitate to ask for clarification!).

grammar RuntimeWords;

@lexer::members {

  private java.util.Set<String> runtimeWords;

  public RuntimeWordsLexer(CharStream input, java.util.Set<String> words) {
    super(input);
    runtimeWords = words;
  }

  private boolean runtimeWordAhead() {
    for(String word : runtimeWords) {
      if(ahead(word)) {
        return true;
      }
    }
    return false;
  }

  private boolean ahead(String word) {
    for(int i = 0; i < word.length(); i++) {
      if(input.LA(i+1) != word.charAt(i)) {
        return false;
      }
    } 
    return true; 
  }
}

parse
  :  (w=. {System.out.printf("\%-15s :: \%s \n", tokenNames[$w.type], $w.text);})+ EOF
  ;

Word
  :  {runtimeWordAhead()}?=> ('a'..'z' | 'A'..'Z')+
  |  'abc'
  ;

Space
  :  ' ' {skip();}
  ;

and the class:

import org.antlr.runtime.*;
import java.util.*;

public class Main {
  public static void main(String[] args) throws Exception {
    Set<String> words = new HashSet<String>(Arrays.asList("BBB", "CDEFG"));
    ANTLRStringStream in = new ANTLRStringStream("BBB abc CDEFG");
    RuntimeWordsLexer lexer = new RuntimeWordsLexer(in, words);
    CommonTokenStream tokens = new CommonTokenStream(lexer);
    RuntimeWordsParser parser = new RuntimeWordsParser(tokens);        
    parser.parse();
  }
}

will produce:

Word            :: BBB 
Word            :: abc 
Word            :: CDEFG

Be careful if some of your runtime words start with another one. For example, if your runtime words contain "stack" and "stacker", you want the longer word to be checked first! Sorting the set based on the length of the strings should be in order.

One final word of caution: if only "stack" is in your runtime word list and the lexer encounters "stacker", then you probably don't want to create a "stack"-token and leave "er" dangling. In that case, you'll want to check if the character after the last char in the word is not a letter:

private boolean ahead(String word) {
  for(int i = 0; i < word.length(); i++) {
    if(input.LA(i+1) != word.charAt(i)) {
      return false;
    }
  }
  // charAfterWord = input.LA(word.length())
  // assert charAfterWord != letter
  // note that charAfterWord could also be EOF
  return ... ; 
}

回复收藏 0 原文

~没有更多了~