当前位置：文江博客话题详情

将 PDF 文件中的数据读取到 R 中

发布于 2025-01-03 01:10:42 字数 163 浏览 3 评论 0原文

这还有可能吗！？！

我有一堆旧报告需要导入到数据库中。不过，它们都是 pdf 格式。有没有可以读取pdf的R包？或者我应该将其留给命令行工具？

这些报告是用 Excel 制作的，然后以 pdf 形式生成，因此它们具有规则的结构，但有许多空白的“单元格”。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

能否归途做我良人 2025-01-10 01:10:42

所以...即使在相当复杂的桌子上，这也让我很接近。

从 bmi pdf 下载 pdf 样本

library(tm)

pdf <- readPDF(PdftotextOptions = "-layout")

dat <- pdf(elem = list(uri='bmi_tbl.pdf'), language='en', id='id1')

dat <- gsub(' +', ',', dat)
out <- read.csv(textConnection(dat), header=FALSE)

So... this gets me close even on a fairly complex table.

Download a sample pdf from bmi pdf

library(tm)

pdf <- readPDF(PdftotextOptions = "-layout")

dat <- pdf(elem = list(uri='bmi_tbl.pdf'), language='en', id='id1')

dat <- gsub(' +', ',', dat)
out <- read.csv(textConnection(dat), header=FALSE)

回复收藏 0 原文

巷雨优美回忆 2025-01-10 01:10:42

只是对其他可能希望提取数据的人发出警告：PDF 是一个容器，而不是一种格式。如果原始文档不包含实际文本，而不是文本的位图图像，甚至可能比我想象的更难看，那么除了 OCR 之外没有什么可以帮助您。

最重要的是，根据我的悲惨经历，无法保证创建 PDF 文档的应用程序都具有相同的行为，因此表中的数据可能会也可能不会按所需的顺序读出（由于文档的读取方式）建）。一定要小心。

让几个研究生为你转录数据可能会更好。它们很便宜:-)

回复收藏 0 原文

半寸时光 2025-01-10 01:10:42

当前用于从 PDF 中获取文本的包 du jour 是 pdftools（Rpoppler 的后继者，如上所述），在 Linux、Windows 和 OSX 上运行良好：

install.packages("pdftools")
library(pdftools)
download.file("http://arxiv.org/pdf/1403.2805.pdf", "1403.2805.pdf", mode = "wb")
txt <- pdf_text("1403.2805.pdf")

# first page text
cat(txt[1])

# second page text
cat(txt[2])

The current package du jour for getting text out of PDFs is pdftools (successor to Rpoppler, noted above), works great on Linux, Windows and OSX:

install.packages("pdftools")
library(pdftools)
download.file("http://arxiv.org/pdf/1403.2805.pdf", "1403.2805.pdf", mode = "wb")
txt <- pdf_text("1403.2805.pdf")

# first page text
cat(txt[1])

# second page text
cat(txt[2])

回复收藏 0 原文

雨的味道风的声音 2025-01-10 01:10:42

您还可以（现在）使用新的 (2015-07) Rpoppler pacakge：

Rpoppler::PDF_text(file)

它包含 3 个函数（实际上是 4 个函数，但其中一个只是为您提供 PDF 对象的 ptr）：

PDF_fonts PDF 字体信息
PDF_info PDF 文档信息
PDF_text PDF 文本提取

（作为答案发布以帮助新搜索者找到该包）。

You can also (now) use the new (2015-07) Rpoppler pacakge:

Rpoppler::PDF_text(file)

It includes 3 functions (4, really, but one just gets you a ptr to the PDF object):

PDF_fonts PDF font information
PDF_info PDF document information
PDF_text PDF text extraction

(posting as an answer to help new searchers find the package).

回复收藏 0 原文

盛夏已如深秋| 2025-01-10 01:10:42

根据 zx8754 ...以下内容在 Win7 中工作，工作目录中包含 pdftotext.exe：

library(tm)
uri = 'bmi_tbl.pdf'
pdf = readPDF(control = list(text = "-layout"))(elem = list(uri = uri),
                                                language = "en", id = "id1")

per zx8754 ... the following works in Win7 with pdftotext.exe in the working directory:

library(tm)
uri = 'bmi_tbl.pdf'
pdf = readPDF(control = list(text = "-layout"))(elem = list(uri = uri),
                                                language = "en", id = "id1")

回复收藏 0 原文

月亮坠入山谷 2025-01-10 01:10:42

这是另一个可以与 Acrobat Pro 一起使用的：

library(RDCOMClient)
acrobat_App <- COMCreate("AcroExch.App")
acrobat_PDDoc <- COMCreate("AcroExch.PDDoc")
acrobat_AVDoc <- COMCreate("AcroExch.AVDoc")
acrobat_PageContent <- COMCreate("AcroExch.HiliteList")
objADOStream <- COMCreate("ADODB.Stream")
acrobat_AVDoc$open("C:\\my_PDF.pdf", 1)
av_Doc <- acrobat_App$GetActiveDoc()
pdf_doc <- av_Doc$GetPDDoc()
pdf_doc$GetNumPages()
page_Number <- pdf_doc$AcquirePage(1)
acrobat_PageContent$Add(0, 9000)
sel_Text <- page_Number$CreatePageHilite(acrobat_PageContent)
index <- 0 : (sel_Text$GetNumText() - 1)
vec_Char <- rep("", length(index))

for(i in index)
{
  print(i)
  vec_Char[i] <- sel_Text$GetText(i)
}

Here is another that can be used with Acrobat Pro :

library(RDCOMClient)
acrobat_App <- COMCreate("AcroExch.App")
acrobat_PDDoc <- COMCreate("AcroExch.PDDoc")
acrobat_AVDoc <- COMCreate("AcroExch.AVDoc")
acrobat_PageContent <- COMCreate("AcroExch.HiliteList")
objADOStream <- COMCreate("ADODB.Stream")
acrobat_AVDoc$open("C:\\my_PDF.pdf", 1)
av_Doc <- acrobat_App$GetActiveDoc()
pdf_doc <- av_Doc$GetPDDoc()
pdf_doc$GetNumPages()
page_Number <- pdf_doc$AcquirePage(1)
acrobat_PageContent$Add(0, 9000)
sel_Text <- page_Number$CreatePageHilite(acrobat_PageContent)
index <- 0 : (sel_Text$GetNumText() - 1)
vec_Char <- rep("", length(index))

for(i in index)
{
  print(i)
  vec_Char[i] <- sel_Text$GetText(i)
}

回复收藏 0 原文

~没有更多了~