在 ANTLR3 中对空格分隔的单词进行词法分析,其中某些单词是关键字
我正在开展一个项目,该项目涉及将词性标记文本转换为 ANTLR3 AST,并以短语作为 AST 的节点。
ANTLR 的输入看起来像:
DT-THE The NN dog VBD sat IN-ON on DT-THE the NN mat STOP .
ie (tag token)+ 其中 tag 或 token 都不包含空格。
以下是对此进行词法分析的好方法吗:
WS : (' ')+ {skip();};
TOKEN : (~' ')+;
语法然后具有如下所示的条目来描述 AST 的最低级别:(
dtTHE:'DT-THE' TOKEN -> ^('DT-THE' TOKEN);
nn:'NN' TOKEN -> ^('NN' TOKEN);
以及其中的 186 个以上!)
这种方法似乎有效,但会导致约 9000 行 Java Lexer 和需要大量内存来构建(~2GB),因此我想知道这是否是解决这个问题的最佳方法。
I am working on a project that involves transforming part of speech tagged text into an ANTLR3 AST with phrases as nodes of the AST.
The input to ANTLR looks like:
DT-THE The NN dog VBD sat IN-ON on DT-THE the NN mat STOP .
i.e. (tag token)+ where neither the tag or the token contain white space.
Is the following a good way of lexing this:
WS : (' ')+ {skip();};
TOKEN : (~' ')+;
The grammar then has entries like the following to describe the lowest level of the AST:
dtTHE:'DT-THE' TOKEN -> ^('DT-THE' TOKEN);
nn:'NN' TOKEN -> ^('NN' TOKEN);
(and 186 more of these!)
This approach seems to work but results in a ~9000 line Java Lexer and takes a large amount of memory to build (~2gb) hence I was wondering whether this is the optimal way of solving this problem.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您能否将 TAG 空间 TOKEN 合并到单个 AST 树中?然后您可以将 TAG 和 TOKEN 传递到源代码中进行处理。如果用于处理生成的树的 Java 代码在各个 TAG 之间非常相似,那么您也许可以简化 ANTLR,但代价是 Java 代码稍微复杂一些。
Could you combine the TAG space TOKEN into a single AST tree? Then you could pass both the TAG and TOKEN into your source code for handling. If the Java code used to handle the resulting tree is very similar between the various TAGs, then you could perhaps simplify the ANTLR with the trade-off of a bit more complication in your Java code.