如何使用 PyPdf 逐行阅读 pdf 文件?
我有一些代码可以从 pdf 文件中读取。有没有办法在 Windows 上使用 Pypdf、Python 2.6 从 pdf 文件(而不是页面)中逐行读取?
下面是阅读 pdf 页面的代码:
import pyPdf
def getPDFContent(path):
content = ""
num_pages = 10
p = file(path, "rb")
pdf = pyPdf.PdfFileReader(p)
for i in range(0, num_pages):
content += pdf.getPage(i).extractText() + "\n"
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content
更新:
调用代码是这样的:
f= open('test.txt','w')
pdfl = getPDFContent("test.pdf").encode("ascii", "ignore")
f.write(pdfl)
f.close()
I have some code to read from a pdf file. Is there a way to read line by line from the pdf file (not pages) using Pypdf, Python 2.6, on Windows?
Here is the code for reading the pdf pages:
import pyPdf
def getPDFContent(path):
content = ""
num_pages = 10
p = file(path, "rb")
pdf = pyPdf.PdfFileReader(p)
for i in range(0, num_pages):
content += pdf.getPage(i).extractText() + "\n"
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content
Update:
The call code is this:
f= open('test.txt','w')
pdfl = getPDFContent("test.pdf").encode("ascii", "ignore")
f.write(pdfl)
f.close()
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
看起来你拥有的是一大块你想要逐行解释的文本数据。
您可以使用 StringIO 类将该内容包装为可查找的类似文件对象:
在您的情况下,请执行以下操作:
Looks like what you have is a large chunk of text data that you want to interpret line-by-line.
You can use the StringIO class to wrap that content as a seekable file-like object:
In your case, do:
使用
yield
和 < code>PdfFileReader.pages 可以简化事情,此外,有些人可能会谷歌“python get pdf content text”所以这里是如何:(这个我就是这样来到这里的)
Using
yield
andPdfFileReader.pages
can simplify things,In addition, Some may google "python get pdf content text" so here's how: (this is how i got here)