从 PDF 文件中提取文本数据
是否可以在 R 中解析 PDF 文件中的文本数据? 似乎没有用于此类提取的相关包,但有人尝试过或见过这是在 R 中完成的吗?
在 Python 中有 PDFMiner< /a>,但如果可能的话,我想将这个分析全部保留在 R 中。
有什么建议吗?
Is it possible to parse text data from PDF files in R? There does not appear to be a relevant package for such extraction, but has anyone attempted or seen this done in R?
In Python there is PDFMiner, but I would like to keep this analysis all in R if possible.
Any suggestions?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
Linux 系统有
pdftotext
,我在这方面取得了一定的成功。默认情况下,它从给定的foo.pdf
创建foo.txt
。也就是说,文本挖掘包可能有转换器。 快速 rseek.org 搜索 似乎与您疯狂的搜索一致。
Linux systems have
pdftotext
which I had reasonable success with. By default, it createsfoo.txt
from a givefoo.pdf
.That said, the text mining packages may have converters. A quick rseek.org search seems to concur with your crantastic search.
这是一个非常古老的线程,但供将来参考: pdftools R 包提取文本来自 PDF。
This is a very old thread, but for future reference: the pdftools R package extracts text from PDFs.
一位同事向我介绍了这个方便的开源工具:http://tabula.nerdpower.org/。安装、上传PDF,选择PDF中需要数据化的表格。不是 R 中的直接解决方案,但肯定比体力劳动更好。
A colleague turned me on to this handy open-source tool: http://tabula.nerdpower.org/. Install, upload the PDF, and select the table in the PDF that requires data-ization. Not a direct solution in R, but certainly better than manual labor.
一个纯粹的 R 解决方案可能是:
然后你将在数组中拥有 pdf 行。
A purely R solution could be:
then you'll have pdf lines in an array.
tabula PDF 表格提取器应用程序基于基于 Java JAR 包的命令行应用程序,tabula-extractor。
R tabulizer 包 提供了一个 R 包装器,可以轻松传递 PDF 文件的路径并获取数据从数据表中提取出来。
Tabula 可以很好地猜测表格的位置,但您也可以通过指定页面的目标区域来告诉它要查看页面的哪个部分。
可以从多个页面中提取数据,并且如果需要,可以为每个页面指定不同的区域。
有关示例用例,请参阅:当文档成为数据库时 – Tabula PDF 表提取器的 Tabulizer R 包装器。
The tabula PDF table extractor app is based around a command line application based on a Java JAR package, tabula-extractor.
The R tabulizer package provides an R wrapper that makes it easy to pass in the path to a PDF file and get data extracted from data tables out.
Tabula will have a good go at guessing where the tables are, but you can also tell it which part of a page to look at by specifying a target area of the page.
Data can be extracted from multiple pages, and a different area can be specified for each page, if required.
For an example use case, see: When Documents Become Databases – Tabulizer R Wrapper for Tabula PDF Table Extractor.
我使用外部实用程序进行转换并从 R 调用它。所有文件都有一个包含所需信息的前导表
将路径设置为 pdftotxt.exe 并将 pdf 转换为文本
I used an external utility to do the conversion and called it from R. All files had a leading table with the desired information
Set path to pdftotxt.exe and convert pdf to text
您还可以使用 Ghostscript(参见 https://www.ghostscript.com/)软件并调用它来自R如下:
Ghostscript 有一个免费版本。
You can also use the Ghostscript (see https://www.ghostscript.com/) software and call it from R as follows :
There is a free version of Ghostscript.