pyPdf 无法从我的 PDF 中的某些页面提取文本
我正在尝试使用 pyPdf 从多页 PDF 中提取并打印页面。问题是,某些页面没有提取文本。
如果运行以下命令,前 81 页不会返回任何文本,而最后 11 页会正确提取。有人可以帮忙吗?
from pyPdf import PdfFileReader
input = PdfFileReader(file("forms.pdf", "rb"))
for page in input1.pages:
print page.extractText()
I'm trying to use pyPdf to extract and print pages from a multipage PDF. Problem is, text is not extracted from some pages.
If you run the following, the first 81 pages return no text, while the final 11 extract properly. Can anyone help?
from pyPdf import PdfFileReader
input = PdfFileReader(file("forms.pdf", "rb"))
for page in input1.pages:
print page.extractText()
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
请注意,
extractText()
在正确提取文本方面仍然存在问题。来自extractText()
的文档:既然是你想要的文本,你可以使用Linux命令
pdftotext
。要使用 Python 调用它,您可以执行以下操作:
从
forms.pdf
中提取文本并将其保存到output
。这适用于您的 PDF 文件并提取您想要的文本。
Note that
extractText()
still has problems extracting the text properly. From the documentation forextractText()
:Since it is the text you want, you can use the Linux command
pdftotext
.To invoke that using Python, you can do this:
The text is extracted from
forms.pdf
and saved tooutput
.This works in the case of your PDF file and extracts the text you want.
您还可以尝试 pdfminer 库(也在 python 中),并且看看它是否能更好地提取文本。然而,对于分割,您必须坚持使用 pyPdf,因为 pdfminer 不支持它。
You could also try the pdfminer library (also in python), and see if it's better at extracting the text. For splitting however, you will have to stick with pyPdf as pdfminer doesn't support that.
这并不是真正的答案,但 pyPdf 的问题是:它还不支持 CMap。 PDF 允许字体使用 CMap 将字符 ID(PDF 中的字节)映射到 Unicode 字符代码。当您的 PDF 包含非 ASCII 字符时,可能会使用 CMap,甚至有时在没有非 ASCII 字符时也是如此。当pyPdf遇到非标准Unicode编码的字符串时,它只是看到一堆字节码;它无法将这些字节转换为 Unicode,因此它只提供空字符串。实际上我也遇到了同样的问题,目前我正在研究源代码。这很耗时,但我希望在 2011 年中期左右的某个时间向维护者发送补丁。
This isn't really an answer, but the problem with pyPdf is this: it doesn't yet support CMaps. PDF allows fonts to use CMaps to map character IDs (bytes in the PDF) to Unicode character codes. When you have a PDF that contains non-ASCII characters, there's probably a CMap in use, and even sometimes when there's no non-ASCII characters. When pyPdf encounters strings that are not in standard Unicode encoding, it just sees a bunch of byte code; it can't convert those bytes to Unicode, so it just gives you empty strings. I actually had this same problem and I'm working on the source code at the moment. It's time consuming, but I hope to send a patch to the maintainer some time around mid-2011.
我发现有时将其转换为
ps
很有用(尝试使用pdf2ps
和pdftops
来了解潜在的差异),然后再转换回pdf(<代码>ps2pdf)。然后再次尝试原来的脚本。
I find it sometimes useful to convert it to
ps
(try withpdf2ps
andpdftops
for potential differences) then back topdf
(ps2pdf
). Then try your original script again.我对某些 pdf 和 Windows 也有类似的问题,这对我来说非常有用:
1.- 下载适用于 Windows 的 Xpdf 工具
2.- 将 pdftotext.exe 从 xpdf-tools-win-4.00\bin32 复制到 C:\Windows\System32以及 C:\Windows\SysWOW64
3.- 使用子进程从控制台运行命令:
I had similar problem with some pdfs and for windows, this is working excellent for me:
1.- Download Xpdf tools for windows
2.- copy pdftotext.exe from xpdf-tools-win-4.00\bin32 to C:\Windows\System32 and also to C:\Windows\SysWOW64
3.- use subprocess to run command from console:
我开始认为我应该采用一个混乱的两部分解决方案。 PDF 有两个部分,第 1-82 页有文本页面标签(pdftotext 可以提取),第 83-end 没有页面标签,但 pyPDF 可以提取并且它明确知道页面。
我想我需要将两者结合起来。笨重,但我看不到任何解决办法。遗憾的是我必须在 Windows 机器上执行此操作。
I'm starting to think I should adopt a messy two-part solution. there are two sections to the PDF, pp 1-82 which have text page labels (pdftotext can extract), and pp 83-end which have no page labels but pyPDF can extract and it explicitly knows pages.
I think I need to combine the two. Clunky, but I don't see any way round it. Sadly I'm having to do this on a Windows machine.