将 PDF 文件中的数据读取到 R 中

发布于 2025-01-03 01:10:42 字数 163 浏览 0 评论 0原文

这还有可能吗!?!

我有一堆旧报告需要导入到数据库中。不过,它们都是 pdf 格式。有没有可以读取pdf的R包?或者我应该将其留给命令行工具?

这些报告是用 Excel 制作的,然后以 pdf 形式生成,因此它们具有规则的结构,但有许多空白的“单元格”。

Is that even possible!?!

I have a bunch of legacy reports that I need to import into a database. However, they're all in pdf format. Are there any R packages that can read pdf? Or should I leave that to a command line tool?

The reports were made in excel and then pdfed, so they have regular structure, but many blank "cells".

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

能否归途做我良人 2025-01-10 01:10:42

所以...即使在相当复杂的桌子上,这也让我很接近。

bmi pdf 下载 pdf 样本

library(tm)

pdf <- readPDF(PdftotextOptions = "-layout")

dat <- pdf(elem = list(uri='bmi_tbl.pdf'), language='en', id='id1')

dat <- gsub(' +', ',', dat)
out <- read.csv(textConnection(dat), header=FALSE)

So... this gets me close even on a fairly complex table.

Download a sample pdf from bmi pdf

library(tm)

pdf <- readPDF(PdftotextOptions = "-layout")

dat <- pdf(elem = list(uri='bmi_tbl.pdf'), language='en', id='id1')

dat <- gsub(' +', ',', dat)
out <- read.csv(textConnection(dat), header=FALSE)
巷雨优美回忆 2025-01-10 01:10:42

只是对其他可能希望提取数据的人发出警告:PDF 是一个容器,而不是一种格式。如果原始文档不包含实际文本,而不是文本的位图图像,甚至可能比我想象的更难看,那么除了 OCR 之外没有什么可以帮助您。

最重要的是,根据我的悲惨经历,无法保证创建 PDF 文档的应用程序都具有相同的行为,因此表中的数据可能会也可能不会按所需的顺序读出(由于文档的读取方式)建)。一定要小心。

让几个研究生为你转录数据可能会更好。它们很便宜:-)

Just a warning to others who may be hoping to extract data: PDF is a container, not a format. If the original document does not contain actual text, as opposed to bitmapped images of text or possibly even uglier things than I can imagine, nothing other than OCR can help you.

On top of that, in my sad experience there's no guarantee that apps which create PDF docs all behave the same, so the data in your table may or may not be read out in the desired order (as a result of the way the doc was built). Be cautious.

Probably better to make a couple grad students transcribe the data for you. They're cheap :-)

半寸时光 2025-01-10 01:10:42

当前用于从 PDF 中获取文本的包 du jourpdftools(Rpoppler 的后继者,如上所述),在 Linux、Windows 和 OSX 上运行良好:

install.packages("pdftools")
library(pdftools)
download.file("http://arxiv.org/pdf/1403.2805.pdf", "1403.2805.pdf", mode = "wb")
txt <- pdf_text("1403.2805.pdf")

# first page text
cat(txt[1])

# second page text
cat(txt[2])

The current package du jour for getting text out of PDFs is pdftools (successor to Rpoppler, noted above), works great on Linux, Windows and OSX:

install.packages("pdftools")
library(pdftools)
download.file("http://arxiv.org/pdf/1403.2805.pdf", "1403.2805.pdf", mode = "wb")
txt <- pdf_text("1403.2805.pdf")

# first page text
cat(txt[1])

# second page text
cat(txt[2])
雨的味道风的声音 2025-01-10 01:10:42

您还可以(现在)使用新的 (2015-07) Rpoppler pacakge:

Rpoppler::PDF_text(file)

它包含 3 个函数(实际上是 4 个函数,但其​​中一个只是为您提供 PDF 对象的 ptr):

  • PDF_fonts PDF 字体信息
  • PDF_info PDF 文档信息
  • PDF_text PDF 文本提取

(作为答案发布以帮助新搜索者找到该包)。

You can also (now) use the new (2015-07) Rpoppler pacakge:

Rpoppler::PDF_text(file)

It includes 3 functions (4, really, but one just gets you a ptr to the PDF object):

  • PDF_fonts PDF font information
  • PDF_info PDF document information
  • PDF_text PDF text extraction

(posting as an answer to help new searchers find the package).

盛夏已如深秋| 2025-01-10 01:10:42

根据 zx8754 ...以下内容在 Win7 中工作,工作目录中包含 pdftotext.exe:

library(tm)
uri = 'bmi_tbl.pdf'
pdf = readPDF(control = list(text = "-layout"))(elem = list(uri = uri),
                                                language = "en", id = "id1")   

per zx8754 ... the following works in Win7 with pdftotext.exe in the working directory:

library(tm)
uri = 'bmi_tbl.pdf'
pdf = readPDF(control = list(text = "-layout"))(elem = list(uri = uri),
                                                language = "en", id = "id1")   
月亮坠入山谷 2025-01-10 01:10:42

这是另一个可以与 Acrobat Pro 一起使用的:

library(RDCOMClient)
acrobat_App <- COMCreate("AcroExch.App")
acrobat_PDDoc <- COMCreate("AcroExch.PDDoc")
acrobat_AVDoc <- COMCreate("AcroExch.AVDoc")
acrobat_PageContent <- COMCreate("AcroExch.HiliteList")
objADOStream <- COMCreate("ADODB.Stream")
acrobat_AVDoc$open("C:\\my_PDF.pdf", 1)
av_Doc <- acrobat_App$GetActiveDoc()
pdf_doc <- av_Doc$GetPDDoc()
pdf_doc$GetNumPages()
page_Number <- pdf_doc$AcquirePage(1)
acrobat_PageContent$Add(0, 9000)
sel_Text <- page_Number$CreatePageHilite(acrobat_PageContent)
index <- 0 : (sel_Text$GetNumText() - 1)
vec_Char <- rep("", length(index))

for(i in index)
{
  print(i)
  vec_Char[i] <- sel_Text$GetText(i)
}

Here is another that can be used with Acrobat Pro :

library(RDCOMClient)
acrobat_App <- COMCreate("AcroExch.App")
acrobat_PDDoc <- COMCreate("AcroExch.PDDoc")
acrobat_AVDoc <- COMCreate("AcroExch.AVDoc")
acrobat_PageContent <- COMCreate("AcroExch.HiliteList")
objADOStream <- COMCreate("ADODB.Stream")
acrobat_AVDoc$open("C:\\my_PDF.pdf", 1)
av_Doc <- acrobat_App$GetActiveDoc()
pdf_doc <- av_Doc$GetPDDoc()
pdf_doc$GetNumPages()
page_Number <- pdf_doc$AcquirePage(1)
acrobat_PageContent$Add(0, 9000)
sel_Text <- page_Number$CreatePageHilite(acrobat_PageContent)
index <- 0 : (sel_Text$GetNumText() - 1)
vec_Char <- rep("", length(index))

for(i in index)
{
  print(i)
  vec_Char[i] <- sel_Text$GetText(i)
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文