以编程方式将 LaTeX 代码转换/解析为纯文本
我有几个 C++/Python 代码项目,其中使用 LaTeX 格式的描述和标签来生成使用 LaTeX+pstricks 制作的 PDF 文档或图表。然而,我们也有一些纯文本输出,例如文档的 HTML 版本(我已经有代码来为此编写最小标记)和不支持 TeX 的绘图渲染器。
对于这些,我想消除例如表示物理单位所必需的 TeX 标记。这包括不间断(薄)空格、\text、\mathrm 等。将 \frac{#1}{#2} 之类的内容解析为 #1/#2 以获得纯文本输出(以及使用 MathJax 作为 HTML)。由于我们目前拥有的系统,我需要能够从 Python 执行此操作,即理想我正在寻找一个 Python 包,但我需要一个非 Python 可执行文件可以从Python调用并捕获输出字符串也可以。
我知道 TeX StackExchange 网站上有类似问题,但是没有任何真正的编程解决方案:我研究过 detex、plasTeX 和 pytex,它们看起来都有点死了,并没有真正做我需要的事情:将 TeX 字符串编程转换为代表纯文本字符串。
我可以尝试使用例如 pyparsing 编写一个基本的 TeX 解析器,但是 a)这可能充满陷阱,并且我们将不胜感激,b)肯定有人以前尝试过,或者知道一种方法来连接 TeX 本身以获得更好的结果结果?
更新:感谢您的所有回答...这确实是一个有点尴尬的请求!我可以用较少的 LaTeX 解析来凑合,但考虑解析器而不是循环中的正则表达式负载的原因是我希望能够很好地处理嵌套宏和多参数宏,并获得大括号匹配才能正常工作。然后我可以首先减少与 txt 不相关的宏,例如 \text 和 \mathrm,然后处理与 txt 相关的宏,例如 \frac 最后...甚至可以使用适当的括号!好吧,我可以梦想......目前正则表达式并没有做那么糟糕的工作。
更新更新:很长一段时间,我通过连接我需要的所有字符串并传递给 pandoc 来非常简单地处理我们的用例!恶心,但它对于我们的非关键命令行化妆品来说已经足够好了......直到 cat 字符串变得太长!同时 pydetex (https://pydetex.readthedocs.io/en/latest/ )已经出现,是另一个竞争者,参见。 TeXSoup,如果您可以通过正则表达式或简单替换扩展足够的 \def
。
I have a couple of code projects in C++/Python in which LaTeX-format descriptions and labels are used to generate PDF documentation or graphs made using LaTeX+pstricks. However, we also have some plain text outputs, such as an HTML version of the documentation (I already have code to write minimal markup for that) and a non-TeX-enabled plot renderer.
For these I would like to eliminate the TeX markup that is necessary for e.g. representing physical units. This includes non-breaking (thin) spaces, \text, \mathrm etc. It would also be nice to parse down things like \frac{#1}{#2} into #1/#2 for the plain text output (and use MathJax for the HTML). Due to the system that we've got at the moment, I need to be able to do this from Python, i.e. ideally I'm looking for a Python package, but a non-Python executable which I can call from Python and catch the output string would also be fine.
I'm aware of the similar question on the TeX StackExchange site, but there weren't any really programmatic solutions to that: I've looked at detex, plasTeX and pytex, which they all seem a bit dead and don't really do what I need: programmatic conversion of a TeX string to a representative plain text string.
I could try writing a basic TeX parser using e.g. pyparsing, but a) that might be pitfall-laden and help would be appreciated and b) surely someone has tried that before, or knows of a way to hook into TeX itself to get a better result?
Update: Thanks for all the answers... it does indeed seem to be a bit of an awkward request! I can make do with less than general parsing of LaTeX, but the reason for considering a parser rather than a load of regexes in a loop is that I want to be able to handle nested macros and multi-arg macros nicely, and get the brace matching to work properly. Then I can e.g. reduce txt-irrelevant macros like \text and \mathrm first, and handle txt-relevant ones like \frac last... maybe even with appropriate parentheses! Well, I can dream... for now regexes are not doing such a terrible job.
Update update: for a long time I handled our use-case very hackily by concatenating all the strings I needed, and passing to pandoc! Yuck, but it worked well enough for our non-critical command-line cosmetics... until the cat'd string became too long! In the meantime pydetex (https://pydetex.readthedocs.io/en/latest/) has emerged and is another contender, cf. TeXSoup, if you can expand enough \def
s via regex or simple replacement.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
我知道这是一篇旧文章,但由于这篇文章经常出现在 Latex-python-parsing 搜索中(如 仅从 arXiv 文章中提取格式为 .tex 的正文文本),将其留在这里供后续人员使用:这是一个 Python 中的 LaTeX 解析器,支持搜索并修改解析树,https://github.com/alvinwan/texsoup。以下是示例文本以及如何通过 TexSoup 与其交互的示例文本,摘自自述文件。
以下是如何导航解析树。
免责声明:我写了这个库,但也是出于类似的原因。关于 Little Bobby Tales 的帖子(关于
def
),TexSoup 不处理定义。I understand this is an old post, but since this post comes up often in latex-python-parsing searches (as evident by Extract only body text from arXiv articles formatted as .tex), leaving this here for folks down the line: Here's a LaTeX parser in Python that supports search over and modification of the parse tree, https://github.com/alvinwan/texsoup. Taken from the README, here is sample text and how you can interact with it via TexSoup.
Here's how to navigate the parse tree.
Disclaimer: I wrote this lib, but it was for similar reasons. Regarding the post by Little Bobby Tales (regarding
def
), TexSoup doesn't handle definitions.需要注意的是:为 plain TeX 编写一个完整的解析器比您想象的要困难得多。 TeX 级(不是 LaTeX)
\def
命令实际上扩展了 TeX 的语法。例如,\def\foo #1.{{\bf #1}}
会将\foo goo.
扩展为 goo - 请注意点成为 foo 宏的分隔符!因此,如果您必须处理任何形式的TeX,并且不限制可以使用哪些包,那么不建议依赖简单的解析。你需要 TeX 渲染。 catdvi 是我使用的,尽管它并不完美。A word of caution: It is much more difficult to write a complete parser for plain TeX than what you might think. The TeX-level (not LaTeX)
\def
command actually extends TeX's syntax. For example,\def\foo #1.{{\bf #1}}
will expand\foo goo.
into goo - Notice that the dot became a delimiter for the foo macro! Therefore, if you have to deal with any form of TeX, without restrictions on which packages may be used, it is not recommended to rely on simple parsing. You need TeX rendering. catdvi is what I use, although it is not perfect.破坏了这个旧线程,但发现了这个名为
pylatexenc
的漂亮库这似乎几乎完全符合OP的要求:产生
如您所见,结果对于方程来说并不完美,但它在剥离和转换所有tex命令方面做得很好。
Necroing this old thread, but found this nifty library called
pylatexenc
that seems to do almost exactly what the OP was after:which produces
As you can see, the result is not perfect for the equations, but it does a great job of stripping and converting all the tex commands.
尝试
detex
(大多数 *TeX 发行版附带),或改进版本:http ://code.google.com/p/opendetex/编辑:哦,我看到你已经尝试过 detex 了。尽管如此,opendetex 可能对你有用。
Try
detex
(shipped with most *TeX distributions), or the improved version: http://code.google.com/p/opendetex/Edit: oh, I see you tried detex already. Still, opendetex might work for you.
我会尝试 pandoc [在此处输入链接描述][1]。它是用 Haskell 编写的,但它是一个非常好的 Latex 2 转换器。
[1]:http://johnmacfarlane.net/pandoc/index.html 。
I would try pandoc [enter link description here][1]. It is written in Haskell, but it is a really nice latex 2 whatever converter.
[1]: http://johnmacfarlane.net/pandoc/index.html .
当您考虑使用 TeX 本身进行渲染时,我怀疑性能不是问题。在这种情况下,您有几个选项: dvi2txt 从单个 dvi 文件中获取文本(准备为每个标签生成一个),甚至将 dvi 渲染为光栅图像(如果您可以的话) - 这就是 hevea 或Latex2html 处理公式。
As you're considering using TeX itself for doing the rendering, I suspect that performance is not an issue. In this case you've got a couple of options: dvi2txt to fetch your text from a single dvi file (be prepared to generate one for each label) or even rendering dvi into raster images, if it's ok for you - that's how hevea or latex2html treats formulas.
构建另一篇文章 Eduardo Leoni,我正在查看 pandoc,我发现它带有一个独立的可执行文件,但也在 上这个页面它承诺了一种构建 C 可调用系统库的方法。也许这是你可以忍受的事情?
Building the other post Eduardo Leoni, I was looking at pandoc and I see that it comes with a standalone executable but also on this page it promises a way to build to a C-callable system library. Perhaps this is something that you can live with?
这是您的错误。你不应该那样做。
使用 RST 或其他更好的标记语言。
使用 Docutils 从 RST 源创建 LaTeX 和 HTML。
This is your mistake. You shouldn't have done that.
Use RST or some other -- better -- markup language.
Use Docutils to create LaTeX and HTML from the RST source.