当前位置：文江博客话题详情

解析Java中类似乳胶的语言

发布于 2024-09-14 09:24:53 字数 366 浏览 13 评论 0原文

我正在尝试用 Java 为类似于 Latex 的简单语言编写一个解析器，即它包含大量非结构化文本，中间有几个 \commands[with]{some}{parameters} 。像 \\ 这样的转义序列也必须考虑在内。

我尝试用 JavaCC 生成一个解析器，但看起来像 JavaCC 这样的编译器只适合高度结构化的代码（通常用于通用编程语言），而不适合凌乱的类似 Latex 的标记。到目前为止，看来我必须深入底层并编写自己的有限状态机。

所以我的问题是，解析大部分非结构化输入（中间只有一些类似 Latex 的命令）的最简单方法是什么？

编辑：使用有限状态机进入低层是很困难的，因为 Latex 命令可以嵌套，例如 \cmd1{\cmd2{\cmd3{...}}}

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

机场等船 2024-09-21 09:24:53

您可以定义一个语法来接受 Latex 输入，仅使用字符作为最差演员表中的标记。 JavaCC 应该可以满足这个目的。

语法和解析器生成器的好处是它可以解析 FSA 遇到困难的东西，尤其是嵌套结构。

你的语法的第一个削减可能是（我不确定这是有效的 JavaCC，但它是合理的 EBNF）：

 Latex = item* ;
 item = command | rawtext ;
 command =  command arguments ;
 command = '\' letter ( letter | digit )* ;  -- might pick this up as lexeme
 letter = 'a' | 'b' | ... | 'z' ;
 digit= '0' | ...  | '9' ;
 arguments =  epsilon |  '{' item* '}' ;
 rawtext = ( letter | digit | whitespace | punctuationminusbackslash )+ ; -- might pick this up as lexeme
 whitespace = ' ' | '\t' | '\n' | '\:0D' ; 
 punctuationminusbackslash = '!' | ... | '^' ;

You can define a grammar to accept the Latex input, using just characters as tokens in the worst cast. JavaCC should be just fine for this purpose.

The good thing about a grammar and a parser generator is that it can parse things that FSAs have trouble with, especially nested structures.

A first cut at your grammar could be (I'm not sure this is valid JavaCC, but it is reasonable EBNF):

 Latex = item* ;
 item = command | rawtext ;
 command =  command arguments ;
 command = '\' letter ( letter | digit )* ;  -- might pick this up as lexeme
 letter = 'a' | 'b' | ... | 'z' ;
 digit= '0' | ...  | '9' ;
 arguments =  epsilon |  '{' item* '}' ;
 rawtext = ( letter | digit | whitespace | punctuationminusbackslash )+ ; -- might pick this up as lexeme
 whitespace = ' ' | '\t' | '\n' | '\:0D' ; 
 punctuationminusbackslash = '!' | ... | '^' ;

回复收藏 0 原文

~没有更多了~