pyPdf 无法从我的 PDF 中的某些页面提取文本

发布于 2024-10-02 17:37:19 字数 274 浏览 0 评论 0原文

我正在尝试使用 pyPdf 从多页 PDF 中提取并打印页面。问题是，某些页面没有提取文本。

如果运行以下命令，前 81 页不会返回任何文本，而最后 11 页会正确提取。有人可以帮忙吗？

from pyPdf import PdfFileReader  
input = PdfFileReader(file("forms.pdf", "rb"))  
for page in input1.pages:  
    print page.extractText()

原文

I'm trying to use pyPdf to extract and print pages from a multipage PDF. Problem is, text is not extracted from some pages.

If you run the following, the first 81 pages return no text, while the final 11 extract properly. Can anyone help?

from pyPdf import PdfFileReader  
input = PdfFileReader(file("forms.pdf", "rb"))  
for page in input1.pages:  
    print page.extractText()

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

可是我不能没有你 2024-10-09 17:37:19

请注意，extractText() 在正确提取文本方面仍然存在问题。来自 extractText() 的文档：

这对于某些 PDF 文件效果很好，
但对其他人来说很糟糕，取决于
使用的发电机。这将是
未来将进一步完善。不要依赖
文本的顺序
函数，因为如果这个它会改变
功能更加完善。

既然是你想要的文本，你可以使用Linux命令pdftotext。

要使用 Python 调用它，您可以执行以下操作：

>>> import subprocess
>>> subprocess.call(['pdftotext', 'forms.pdf', 'output'])

从 forms.pdf 中提取文本并将其保存到 output。

这适用于您的 PDF 文件并提取您想要的文本。

Note that extractText() still has problems extracting the text properly. From the documentation for extractText():

This works well for some PDF files,
but poorly for others, depending on
the generator used. This will be
refined in the future. Do not rely on
the order of text coming out of this
function, as it will change if this
function is made more sophisticated.

Since it is the text you want, you can use the Linux command pdftotext.

To invoke that using Python, you can do this:

>>> import subprocess
>>> subprocess.call(['pdftotext', 'forms.pdf', 'output'])

The text is extracted from forms.pdf and saved to output.

This works in the case of your PDF file and extracts the text you want.

回复收藏 0 原文

魄砕の薆 2024-10-09 17:37:19

您还可以尝试 pdfminer 库（也在 python 中），并且看看它是否能更好地提取文本。然而，对于分割，您必须坚持使用 pyPdf，因为 pdfminer 不支持它。

回复收藏 0 原文

依靠 2024-10-09 17:37:19

这并不是真正的答案，但 pyPdf 的问题是：它还不支持 CMap。 PDF 允许字体使用 CMap 将字符 ID（PDF 中的字节）映射到 Unicode 字符代码。当您的 PDF 包含非 ASCII 字符时，可能会使用 CMap，甚至有时在没有非 ASCII 字符时也是如此。当pyPdf遇到非标准Unicode编码的字符串时，它只是看到一堆字节码；它无法将这些字节转换为 Unicode，因此它只提供空字符串。实际上我也遇到了同样的问题，目前我正在研究源代码。这很耗时，但我希望在 2011 年中期左右的某个时间向维护者发送补丁。

回复收藏 0 原文

自此以后，行同陌路 2024-10-09 17:37:19

我发现有时将其转换为 ps 很有用（尝试使用 pdf2ps 和 pdftops 来了解潜在的差异），然后再转换回 pdf（<代码>ps2pdf）。然后再次尝试原来的脚本。

回复收藏 0 原文

过度放纵 2024-10-09 17:37:19

我对某些 pdf 和 Windows 也有类似的问题，这对我来说非常有用：

1.- 下载适用于 Windows 的 Xpdf 工具

2.- 将 pdftotext.exe 从 xpdf-tools-win-4.00\bin32 复制到 C:\Windows\System32以及 C:\Windows\SysWOW64

3.- 使用子进程从控制台运行命令：

import subprocess

try:
    extInfo = subprocess.check_output('pdftotext.exe '+filePath + ' -',shell=True,stderr=subprocess.STDOUT).strip()
except Exception as e:
    print (e)

I had similar problem with some pdfs and for windows, this is working excellent for me:

1.- Download Xpdf tools for windows

2.- copy pdftotext.exe from xpdf-tools-win-4.00\bin32 to C:\Windows\System32 and also to C:\Windows\SysWOW64

3.- use subprocess to run command from console:

import subprocess

try:
    extInfo = subprocess.check_output('pdftotext.exe '+filePath + ' -',shell=True,stderr=subprocess.STDOUT).strip()
except Exception as e:
    print (e)

回复收藏 0 原文