解析Java中类似乳胶的语言
我正在尝试用 Java 为类似于 Latex 的简单语言编写一个解析器,即它包含大量非结构化文本,中间有几个 \commands[with]{some}{parameters} 。像 \\ 这样的转义序列也必须考虑在内。
我尝试用 JavaCC 生成一个解析器,但看起来像 JavaCC 这样的编译器只适合高度结构化的代码(通常用于通用编程语言),而不适合凌乱的类似 Latex 的标记。到目前为止,看来我必须深入底层并编写自己的有限状态机。
所以我的问题是,解析大部分非结构化输入(中间只有一些类似 Latex 的命令)的最简单方法是什么?
编辑:使用有限状态机进入低层是很困难的,因为 Latex 命令可以嵌套,例如 \cmd1{\cmd2{\cmd3{...}}}
I'm trying to write a parser in Java for a simple language similar to Latex, i.e. it contains lots of unstructured text with a couple of \commands[with]{some}{parameters} in between. Escape sequences like \\ also have to be taken into account.
I've tried to generate a parser for that with JavaCC, but it looks as if compiler-compilers like JavaCC were only suitable for highly structured code (typical for general-purpose programming languages), not for messy Latex-like markup. So far, it seems I have to go low level and write my own finite state machine.
So my question is, what's the easiest way to parse input that is mostly unstructured, with only a few Latex-like commands in between?
EDIT: Going low level with a finite state machine is difficult because the Latex commands can be nested, e.g. \cmd1{\cmd2{\cmd3{...}}}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以定义一个语法来接受 Latex 输入,仅使用字符作为最差演员表中的标记。 JavaCC 应该可以满足这个目的。
语法和解析器生成器的好处是它可以解析 FSA 遇到困难的东西,尤其是嵌套结构。
你的语法的第一个削减可能是(我不确定这是有效的 JavaCC,但它是合理的 EBNF):
You can define a grammar to accept the Latex input, using just characters as tokens in the worst cast. JavaCC should be just fine for this purpose.
The good thing about a grammar and a parser generator is that it can parse things that FSAs have trouble with, especially nested structures.
A first cut at your grammar could be (I'm not sure this is valid JavaCC, but it is reasonable EBNF):