当前位置：文江博客话题详情

从 PDF 文件中提取文本数据

发布于 2024-09-26 14:06:38 字数 339 浏览 4 评论 0原文

是否可以在 R 中解析 PDF 文件中的文本数据？似乎没有用于此类提取的相关包，但有人尝试过或见过这是在 R 中完成的吗？

在 Python 中有 PDFMiner< /a>，但如果可能的话，我想将这个分析全部保留在 R 中。

有什么建议吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

寒冷纷飞旳雪 2024-10-03 14:06:38

Linux 系统有 pdftotext，我在这方面取得了一定的成功。默认情况下，它从给定的 foo.pdf 创建 foo.txt。

也就是说，文本挖掘包可能有转换器。快速 rseek.org 搜索似乎与您疯狂的搜索一致。

回复收藏 0 原文

梦纸 2024-10-03 14:06:38

这是一个非常古老的线程，但供将来参考： pdftools R 包提取文本来自 PDF。

回复收藏 0 原文

空‖城人不在 2024-10-03 14:06:38

一位同事向我介绍了这个方便的开源工具：http://tabula.nerdpower.org/。安装、上传PDF，选择PDF中需要数据化的表格。不是 R 中的直接解决方案，但肯定比体力劳动更好。

回复收藏 0 原文

提赋 2024-10-03 14:06:38

一个纯粹的 R 解决方案可能是：

library('tm')
file <- 'namefile.pdf'
Rpdf <- readPDF(control = list(text = "-layout"))
corpus <- VCorpus(URISource(file), 
      readerControl = list(reader = Rpdf))
corpus.array <- content(content(corpus)[[1]])

然后你将在数组中拥有 pdf 行。

A purely R solution could be:

library('tm')
file <- 'namefile.pdf'
Rpdf <- readPDF(control = list(text = "-layout"))
corpus <- VCorpus(URISource(file), 
      readerControl = list(reader = Rpdf))
corpus.array <- content(content(corpus)[[1]])

then you'll have pdf lines in an array.

回复收藏 0 原文

落在眉间の轻吻 2024-10-03 14:06:38

install.packages("pdftools")
library(pdftools)


download.file("http://www.nfl.com/liveupdate/gamecenter/56901/DEN_Gamebook.pdf", 
              "56901.DEN.Gamebook", mode = "wb")

txt <- pdf_text("56901.DEN.Gamebook")
cat(txt[1])

install.packages("pdftools")
library(pdftools)


download.file("http://www.nfl.com/liveupdate/gamecenter/56901/DEN_Gamebook.pdf", 
              "56901.DEN.Gamebook", mode = "wb")

txt <- pdf_text("56901.DEN.Gamebook")
cat(txt[1])

回复收藏 0 原文

送君千里 2024-10-03 14:06:38

tabula PDF 表格提取器应用程序基于基于 Java JAR 包的命令行应用程序，tabula-extractor。

R tabulizer 包提供了一个 R 包装器，可以轻松传递 PDF 文件的路径并获取数据从数据表中提取出来。

Tabula 可以很好地猜测表格的位置，但您也可以通过指定页面的目标区域来告诉它要查看页面的哪个部分。

可以从多个页面中提取数据，并且如果需要，可以为每个页面指定不同的区域。

有关示例用例，请参阅：当文档成为数据库时 – Tabula PDF 表提取器的 Tabulizer R 包装器。

回复收藏 0 原文

彩虹直至黑白 2024-10-03 14:06:38

我使用外部实用程序进行转换并从 R 调用它。所有文件都有一个包含所需信息的前导表

将路径设置为 pdftotxt.exe 并将 pdf 转换为文本

exeFile <- "C:/Projects/xpdfbin-win-3.04/bin64/pdftotext.exe"

for(i in 1:length(pdfFracList)){
    fileNumber <- str_sub(pdfFracList[i], start = 1, end = -5)
    pdfSource <- paste0(reportDir,"/", fileNumber, ".pdf")
    txtDestination <- paste0(reportDir,"/", fileNumber, ".txt")
    print(paste0("File number ", i, ", Processing file ", pdfSource))
    system(paste(exeFile, "-table" , pdfSource, txtDestination, sep = " "), wait = TRUE)
}

I used an external utility to do the conversion and called it from R. All files had a leading table with the desired information

Set path to pdftotxt.exe and convert pdf to text

exeFile <- "C:/Projects/xpdfbin-win-3.04/bin64/pdftotext.exe"

for(i in 1:length(pdfFracList)){
    fileNumber <- str_sub(pdfFracList[i], start = 1, end = -5)
    pdfSource <- paste0(reportDir,"/", fileNumber, ".pdf")
    txtDestination <- paste0(reportDir,"/", fileNumber, ".txt")
    print(paste0("File number ", i, ", Processing file ", pdfSource))
    system(paste(exeFile, "-table" , pdfSource, txtDestination, sep = " "), wait = TRUE)
}

回复收藏 0 原文

风吹过旳痕迹 2024-10-03 14:06:38

您还可以使用 Ghostscript（参见 https://www.ghostscript.com/）软件并调用它来自R如下：

system2("C:\\Program Files\\gs\\gs10.02.0\\bin\\gswin64c.exe", args="-sDEVICE=txtwrite -sOutputFile=- -q -dDOPDFMARKS -dNOPAUSE -dBATCH C:\\Annotated.pdf")

Ghostscript 有一个免费版本。

You can also use the Ghostscript (see https://www.ghostscript.com/) software and call it from R as follows :

system2("C:\\Program Files\\gs\\gs10.02.0\\bin\\gswin64c.exe", args="-sDEVICE=txtwrite -sOutputFile=- -q -dDOPDFMARKS -dNOPAUSE -dBATCH C:\\Annotated.pdf")

There is a free version of Ghostscript.

回复收藏 0 原文

~没有更多了~