如何从 LaTeX 文档中提取重要的文本内容
我需要从用 LaTeX 编写的论文文档中提取纯文本内容,以进行自动反剽窃检查。我只知道“草稿”选项,但这还不够。
我应该省略:
- 图像、
- 表格和其他图形、
- 方程、
- 标题和脚注。
删除所有引用也很好。输出应该是纯文本文件(UTF-8 编码)。
有什么简单的方法可以做到这一点吗? 我真的不喜欢手动逐页复制它。
I need to extract text-only content from my thesis document written in LaTeX for an automated anti-plagiarism check. I know only about the "draft" option and it's not enough.
I am supposed to omit:
- images,
- tables and other figures,
- equations,
- captions and footnotes.
It'd also be nice to remove all the references. The output should be a plain (UTF-8 encoded) text file.
Is there any straightforward way to do this?
I don't really fancy copying it manually page-by-page.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
您可以尝试使用 comment 包(或十几个替代方案之一)将方程、图形、表格等转换为注释环境,并使用 \renewcommand\footnote[1]{} 删除脚注。 \pagestyle{empty} 应该删除页面标题等,因此在结果上运行 pdftotext 应该接近您想要的结果。
You could try to use the comment package (or one of a dozen of alternatives) to turn equation, figure, table etc. into commenting environments and \renewcommand\footnote[1]{} to remove footnotes. \pagestyle{empty} should remove page headings etc., so running pdftotext on the result should come close ot what you want.
您可以使用 pandoc 之类的文档转换器,或者使用 Calibre。
You could use a document converter like pandoc, or convert the output PDF to plain text with something like Calibre.
通常你想要对文本进行一些 LaTeX 处理,假设你有
当文本段落包含任何宏时,仅过滤掉此处的文本段落将不会给出与预期结果类似的文本。
因此,尝试直接从 *.tex 文件中提取内容通常会导致结果有很多不足之处。因此,通常最好处理乳胶加工的输出。我建议将 Latex 转换为 html,然后从 html 转换为文本。您可能需要一些手动清理,但我认为它应该相对接近。
Usually you want some LaTeX processing done on the text, say you have
Just filtering out the text paragraph here will not give a text like the intended result when it contains any macros.
Therefore trying to extract things directly from the *.tex file usually will leave much to be wanted from the result. It is typically therefore better to work on output from latex processing. I would recommend to convert latex to html and then from html to text. You will probably need some manual clean-up, but I think it should be relatively close.
虽然已经提到了 detex,但是还有另一个项目旨在改进它。它的名字叫opendetex,看看吧!
While detex has been mentioned, however there is another project, aimed at improving it. It is called opendetex, give it a look!
是的:untex,一个简单的 C 脚本。您还可以查看 detex。
Yes: untex, a simple C script. You can also look at detex.