以编程方式将 LaTeX 代码转换/解析为纯文本

发布于 2024-10-14 03:31:49 字数 1288 浏览 0 评论 0原文

我有几个 C++/Python 代码项目，其中使用 LaTeX 格式的描述和标签来生成使用 LaTeX+pstricks 制作的 PDF 文档或图表。然而，我们也有一些纯文本输出，例如文档的 HTML 版本（我已经有代码来为此编写最小标记）和不支持 TeX 的绘图渲染器。

对于这些，我想消除例如表示物理单位所必需的 TeX 标记。这包括不间断（薄）空格、\text、\mathrm 等。将 \frac{#1}{#2} 之类的内容解析为 #1/#2 以获得纯文本输出（以及使用 MathJax 作为 HTML）。由于我们目前拥有的系统，我需要能够从 Python 执行此操作，即理想我正在寻找一个 Python 包，但我需要一个非 Python 可执行文件可以从Python调用并捕获输出字符串也可以。

我知道 TeX StackExchange 网站上有类似问题，但是没有任何真正的编程解决方案：我研究过 detex、plasTeX 和 pytex，它们看起来都有点死了，并没有真正做我需要的事情：将 TeX 字符串编程转换为代表纯文本字符串。

我可以尝试使用例如 pyparsing 编写一个基本的 TeX 解析器，但是 a）这可能充满陷阱，并且我们将不胜感激，b）肯定有人以前尝试过，或者知道一种方法来连接 TeX 本身以获得更好的结果结果？

更新：感谢您的所有回答...这确实是一个有点尴尬的请求！我可以用较少的 LaTeX 解析来凑合，但考虑解析器而不是循环中的正则表达式负载的原因是我希望能够很好地处理嵌套宏和多参数宏，并获得大括号匹配才能正常工作。然后我可以首先减少与 txt 不相关的宏，例如 \text 和 \mathrm，然后处理与 txt 相关的宏，例如 \frac 最后...甚至可以使用适当的括号！好吧，我可以梦想......目前正则表达式并没有做那么糟糕的工作。

更新更新：很长一段时间，我通过连接我需要的所有字符串并传递给 pandoc 来非常简单地处理我们的用例！恶心，但它对于我们的非关键命令行化妆品来说已经足够好了......直到 cat 字符串变得太长！同时 pydetex (https://pydetex.readthedocs.io/en/latest/ ）已经出现，是另一个竞争者，参见。 TeXSoup，如果您可以通过正则表达式或简单替换扩展足够的 \def。

原文

I have a couple of code projects in C++/Python in which LaTeX-format descriptions and labels are used to generate PDF documentation or graphs made using LaTeX+pstricks. However, we also have some plain text outputs, such as an HTML version of the documentation (I already have code to write minimal markup for that) and a non-TeX-enabled plot renderer.

For these I would like to eliminate the TeX markup that is necessary for e.g. representing physical units. This includes non-breaking (thin) spaces, \text, \mathrm etc. It would also be nice to parse down things like \frac{#1}{#2} into #1/#2 for the plain text output (and use MathJax for the HTML). Due to the system that we've got at the moment, I need to be able to do this from Python, i.e. ideally I'm looking for a Python package, but a non-Python executable which I can call from Python and catch the output string would also be fine.

I'm aware of the similar question on the TeX StackExchange site, but there weren't any really programmatic solutions to that: I've looked at detex, plasTeX and pytex, which they all seem a bit dead and don't really do what I need: programmatic conversion of a TeX string to a representative plain text string.

I could try writing a basic TeX parser using e.g. pyparsing, but a) that might be pitfall-laden and help would be appreciated and b) surely someone has tried that before, or knows of a way to hook into TeX itself to get a better result?

Update: Thanks for all the answers... it does indeed seem to be a bit of an awkward request! I can make do with less than general parsing of LaTeX, but the reason for considering a parser rather than a load of regexes in a loop is that I want to be able to handle nested macros and multi-arg macros nicely, and get the brace matching to work properly. Then I can e.g. reduce txt-irrelevant macros like \text and \mathrm first, and handle txt-relevant ones like \frac last... maybe even with appropriate parentheses! Well, I can dream... for now regexes are not doing such a terrible job.

Update update: for a long time I handled our use-case very hackily by concatenating all the strings I needed, and passing to pandoc! Yuck, but it worked well enough for our non-critical command-line cosmetics... until the cat'd string became too long! In the meantime pydetex (https://pydetex.readthedocs.io/en/latest/) has emerged and is another contender, cf. TeXSoup, if you can expand enough \defs via regex or simple replacement.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

童话里做英雄 2024-10-21 03:31:50

我知道这是一篇旧文章，但由于这篇文章经常出现在 Latex-python-parsing 搜索中（如仅从 arXiv 文章中提取格式为 .tex 的正文文本），将其留在这里供后续人员使用：这是一个 Python 中的 LaTeX 解析器，支持搜索并修改解析树，https://github.com/alvinwan/texsoup。以下是示例文本以及如何通过 TexSoup 与其交互的示例文本，摘自自述文件。

from TexSoup import TexSoup
soup = TexSoup("""
\begin{document}

\section{Hello \textit{world}.}

\subsection{Watermelon}

(n.) A sacred fruit. Also known as:

\begin{itemize}
\item red lemon
\item life
\end{itemize}

Here is the prevalence of each synonym.

\begin{tabular}{c c}
red lemon & uncommon \\
life & common
\end{tabular}

\end{document}
""")

以下是如何导航解析树。

>>> soup.section  # grabs the first `section`
\section{Hello \textit{world}.}
>>> soup.section.name
'section'
>>> soup.section.string
'Hello \\textit{world}.'
>>> soup.section.parent.name
'document'
>>> soup.tabular
\begin{tabular}{c c}
red lemon & uncommon \\
life & common
\end{tabular}
>>> soup.tabular.args[0]
'c c'
>>> soup.item
\item red lemon
>>> list(soup.find_all('item'))
[\item red lemon, \item life]

免责声明：我写了这个库，但也是出于类似的原因。关于 Little Bobby Tales 的帖子（关于 def），TexSoup 不处理定义。

I understand this is an old post, but since this post comes up often in latex-python-parsing searches (as evident by Extract only body text from arXiv articles formatted as .tex), leaving this here for folks down the line: Here's a LaTeX parser in Python that supports search over and modification of the parse tree, https://github.com/alvinwan/texsoup. Taken from the README, here is sample text and how you can interact with it via TexSoup.

from TexSoup import TexSoup
soup = TexSoup("""
\begin{document}

\section{Hello \textit{world}.}

\subsection{Watermelon}

(n.) A sacred fruit. Also known as:

\begin{itemize}
\item red lemon
\item life
\end{itemize}

Here is the prevalence of each synonym.

\begin{tabular}{c c}
red lemon & uncommon \\
life & common
\end{tabular}

\end{document}
""")

Here's how to navigate the parse tree.

>>> soup.section  # grabs the first `section`
\section{Hello \textit{world}.}
>>> soup.section.name
'section'
>>> soup.section.string
'Hello \\textit{world}.'
>>> soup.section.parent.name
'document'
>>> soup.tabular
\begin{tabular}{c c}
red lemon & uncommon \\
life & common
\end{tabular}
>>> soup.tabular.args[0]
'c c'
>>> soup.item
\item red lemon
>>> list(soup.find_all('item'))
[\item red lemon, \item life]

Disclaimer: I wrote this lib, but it was for similar reasons. Regarding the post by Little Bobby Tales (regarding def), TexSoup doesn't handle definitions.

回复收藏 0 原文

荆棘i 2024-10-21 03:31:50

需要注意的是：为 plain TeX 编写一个完整的解析器比您想象的要困难得多。 TeX 级（不是 LaTeX）\def 命令实际上扩展了 TeX 的语法。例如， \def\foo #1.{{\bf #1}} 会将 \foo goo. 扩展为 goo - 请注意点成为 foo 宏的分隔符！因此，如果您必须处理任何形式的TeX，并且不限制可以使用哪些包，那么不建议依赖简单的解析。你需要 TeX 渲染。 catdvi 是我使用的，尽管它并不完美。

回复收藏 0 原文

恍梦境° 2024-10-21 03:31:50

破坏了这个旧线程，但发现了这个名为 pylatexenc 的漂亮库这似乎几乎完全符合OP的要求：

from pylatexenc.latex2text import LatexNodes2Text


LatexNodes2Text().latex_to_text(r"""\
\section{Euler}
\emph{This} bit is \textbf{very} clever:
\begin{equation}
    \mathrm{e}^{i \pi} + 1 = 0  % wow!!
\end{equation}
where
\[
\mathrm{e} = \lim_{n \to \infty} \left(1 + \frac{1}{n}\right)^n
\]
""")

产生


§ EULER

This bit is very clever:

    e^i π + 1 = 0

where

    e = lim_n →∞(1 + 1/n)^n

如您所见，结果对于方程来说并不完美，但它在剥离和转换所有tex命令方面做得很好。

Necroing this old thread, but found this nifty library called pylatexenc that seems to do almost exactly what the OP was after:

from pylatexenc.latex2text import LatexNodes2Text


LatexNodes2Text().latex_to_text(r"""\
\section{Euler}
\emph{This} bit is \textbf{very} clever:
\begin{equation}
    \mathrm{e}^{i \pi} + 1 = 0  % wow!!
\end{equation}
where
\[
\mathrm{e} = \lim_{n \to \infty} \left(1 + \frac{1}{n}\right)^n
\]
""")

which produces


§ EULER

This bit is very clever:

    e^i π + 1 = 0

where

    e = lim_n →∞(1 + 1/n)^n

As you can see, the result is not perfect for the equations, but it does a great job of stripping and converting all the tex commands.

回复收藏 0 原文

依靠 2024-10-21 03:31:50

尝试 detex （大多数 *TeX 发行版附带），或改进版本：http ://code.google.com/p/opendetex/

编辑：哦，我看到你已经尝试过 detex 了。尽管如此，opendetex 可能对你有用。

回复收藏 0 原文

娇俏 2024-10-21 03:31:50

我会尝试 pandoc [在此处输入链接描述][1]。它是用 Haskell 编写的，但它是一个非常好的 Latex 2 转换器。

[1]：http://johnmacfarlane.net/pandoc/index.html 。

回复收藏 0 原文

欢烬 2024-10-21 03:31:50

当您考虑使用 TeX 本身进行渲染时，我怀疑性能不是问题。在这种情况下，您有几个选项： dvi2txt 从单个 dvi 文件中获取文本（准备为每个标签生成一个），甚至将 dvi 渲染为光栅图像（如果您可以的话） - 这就是 hevea 或Latex2html 处理公式。

回复收藏 0 原文

荒人说梦 2024-10-21 03:31:50

构建另一篇文章 Eduardo Leoni，我正在查看 pandoc，我发现它带有一个独立的可执行文件，但也在上这个页面它承诺了一种构建 C 可调用系统库的方法。也许这是你可以忍受的事情？

回复收藏 0 原文

夜血缘 2024-10-21 03:31:50

LaTeX 格式的描述和标签用于生成使用 LaTeX+pstricks 制作的 PDF 文档或图表

这是您的错误。你不应该那样做。

使用 RST 或其他更好的标记语言。

使用 Docutils 从 RST 源创建 LaTeX 和 HTML。

回复收藏 0 原文

~没有更多了~

关于作者

夏了南城

暂无简介

0 文章

0 评论

22 人气

关注发私信

友情链接

文江博客

以编程方式将 LaTeX 代码转换/解析为纯文本

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（8）

关于作者

相关话题

热门标签

推荐作者

烙印

singlesman

给自己一个微笑

独孤求败

晨钟暮鼓

我是自愿种绣球花的

友情链接

以编程方式将 LaTeX 代码转换/解析为纯文本

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（8）

关于作者

相关话题

热门标签

推荐作者

烙印

singlesman

给自己一个微笑

独孤求败

晨钟暮鼓

我是自愿种绣球花的

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。