将 PDF 文件中的数据读取到 R 中
这还有可能吗!?!
我有一堆旧报告需要导入到数据库中。不过,它们都是 pdf 格式。有没有可以读取pdf的R
包?或者我应该将其留给命令行工具?
这些报告是用 Excel 制作的,然后以 pdf 形式生成,因此它们具有规则的结构,但有许多空白的“单元格”。
Is that even possible!?!
I have a bunch of legacy reports that I need to import into a database. However, they're all in pdf format. Are there any R
packages that can read pdf? Or should I leave that to a command line tool?
The reports were made in excel and then pdfed, so they have regular structure, but many blank "cells".
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
所以...即使在相当复杂的桌子上,这也让我很接近。
从 bmi pdf 下载 pdf 样本
So... this gets me close even on a fairly complex table.
Download a sample pdf from bmi pdf
只是对其他可能希望提取数据的人发出警告:PDF 是一个容器,而不是一种格式。如果原始文档不包含实际文本,而不是文本的位图图像,甚至可能比我想象的更难看,那么除了 OCR 之外没有什么可以帮助您。
最重要的是,根据我的悲惨经历,无法保证创建 PDF 文档的应用程序都具有相同的行为,因此表中的数据可能会也可能不会按所需的顺序读出(由于文档的读取方式)建)。一定要小心。
让几个研究生为你转录数据可能会更好。它们很便宜:-)
Just a warning to others who may be hoping to extract data: PDF is a container, not a format. If the original document does not contain actual text, as opposed to bitmapped images of text or possibly even uglier things than I can imagine, nothing other than OCR can help you.
On top of that, in my sad experience there's no guarantee that apps which create PDF docs all behave the same, so the data in your table may or may not be read out in the desired order (as a result of the way the doc was built). Be cautious.
Probably better to make a couple grad students transcribe the data for you. They're cheap :-)
当前用于从 PDF 中获取文本的包 du jour 是
pdftools
(Rpoppler 的后继者,如上所述),在 Linux、Windows 和 OSX 上运行良好:The current package du jour for getting text out of PDFs is
pdftools
(successor to Rpoppler, noted above), works great on Linux, Windows and OSX:您还可以(现在)使用新的 (2015-07)
Rpoppler
pacakge:它包含 3 个函数(实际上是 4 个函数,但其中一个只是为您提供 PDF 对象的 ptr):
PDF_fonts
PDF 字体信息PDF_info
PDF 文档信息PDF_text
PDF 文本提取(作为答案发布以帮助新搜索者找到该包)。
You can also (now) use the new (2015-07)
Rpoppler
pacakge:It includes 3 functions (4, really, but one just gets you a ptr to the PDF object):
PDF_fonts
PDF font informationPDF_info
PDF document informationPDF_text
PDF text extraction(posting as an answer to help new searchers find the package).
根据 zx8754 ...以下内容在 Win7 中工作,工作目录中包含 pdftotext.exe:
per zx8754 ... the following works in Win7 with pdftotext.exe in the working directory:
这是另一个可以与 Acrobat Pro 一起使用的:
Here is another that can be used with Acrobat Pro :