解析Java中类似乳胶的语言

发布于 2024-09-14 09:24:53 字数 366 浏览 8 评论 0原文

我正在尝试用 Java 为类似于 Latex 的简单语言编写一个解析器,即它包含大量非结构化文本,中间有几个 \commands[with]{some}{parameters} 。像 \\ 这样的转义序列也必须考虑在内。

我尝试用 JavaCC 生成一个解析器,但看起来像 JavaCC 这样的编译器只适合高度结构化的代码(通常用于通用编程语言),而不适合凌乱的类似 Latex 的标记。到目前为止,看来我必须深入底层并编写自己的有限状态机。

所以我的问题是,解析大部分非结构化输入(中间只有一些类似 Latex 的命令)的最简单方法是什么?

编辑:使用有限状态机进入低层是很困难的,因为 Latex 命令可以嵌套,例如 \cmd1{\cmd2{\cmd3{...}}}

I'm trying to write a parser in Java for a simple language similar to Latex, i.e. it contains lots of unstructured text with a couple of \commands[with]{some}{parameters} in between. Escape sequences like \\ also have to be taken into account.

I've tried to generate a parser for that with JavaCC, but it looks as if compiler-compilers like JavaCC were only suitable for highly structured code (typical for general-purpose programming languages), not for messy Latex-like markup. So far, it seems I have to go low level and write my own finite state machine.

So my question is, what's the easiest way to parse input that is mostly unstructured, with only a few Latex-like commands in between?

EDIT: Going low level with a finite state machine is difficult because the Latex commands can be nested, e.g. \cmd1{\cmd2{\cmd3{...}}}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

机场等船 2024-09-21 09:24:53

您可以定义一个语法来接受 Latex 输入,使用字符作为最差演员表中的标记。 JavaCC 应该可以满足这个目的。

语法和解析器生成器的好处是它可以解析 FSA 遇到困难的东西,尤其是嵌套结构。

你的语法的第一个削减可能是(我不确定这是有效的 JavaCC,但它是合理的 EBNF):

 Latex = item* ;
 item = command | rawtext ;
 command =  command arguments ;
 command = '\' letter ( letter | digit )* ;  -- might pick this up as lexeme
 letter = 'a' | 'b' | ... | 'z' ;
 digit= '0' | ...  | '9' ;
 arguments =  epsilon |  '{' item* '}' ;
 rawtext = ( letter | digit | whitespace | punctuationminusbackslash )+ ; -- might pick this up as lexeme
 whitespace = ' ' | '\t' | '\n' | '\:0D' ; 
 punctuationminusbackslash = '!' | ... | '^' ;

You can define a grammar to accept the Latex input, using just characters as tokens in the worst cast. JavaCC should be just fine for this purpose.

The good thing about a grammar and a parser generator is that it can parse things that FSAs have trouble with, especially nested structures.

A first cut at your grammar could be (I'm not sure this is valid JavaCC, but it is reasonable EBNF):

 Latex = item* ;
 item = command | rawtext ;
 command =  command arguments ;
 command = '\' letter ( letter | digit )* ;  -- might pick this up as lexeme
 letter = 'a' | 'b' | ... | 'z' ;
 digit= '0' | ...  | '9' ;
 arguments =  epsilon |  '{' item* '}' ;
 rawtext = ( letter | digit | whitespace | punctuationminusbackslash )+ ; -- might pick this up as lexeme
 whitespace = ' ' | '\t' | '\n' | '\:0D' ; 
 punctuationminusbackslash = '!' | ... | '^' ;
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文