如何使用 PyPdf 逐行阅读 pdf 文件?

发布于 2024-08-26 02:20:36 字数 627 浏览 3 评论 0原文

我有一些代码可以从 pdf 文件中读取。有没有办法在 Windows 上使用 Pypdf、Python 2.6 从 pdf 文件(而不是页面)中逐行读取?

下面是阅读 pdf 页面的代码:

import pyPdf

def getPDFContent(path):
    content = ""
    num_pages = 10
    p = file(path, "rb")
    pdf = pyPdf.PdfFileReader(p)
    for i in range(0, num_pages):
        content += pdf.getPage(i).extractText() + "\n"
    content = " ".join(content.replace(u"\xa0", " ").strip().split())
    return content

更新:

调用代码是这样的:

f= open('test.txt','w')
pdfl = getPDFContent("test.pdf").encode("ascii", "ignore")
f.write(pdfl)
f.close()

I have some code to read from a pdf file. Is there a way to read line by line from the pdf file (not pages) using Pypdf, Python 2.6, on Windows?

Here is the code for reading the pdf pages:

import pyPdf

def getPDFContent(path):
    content = ""
    num_pages = 10
    p = file(path, "rb")
    pdf = pyPdf.PdfFileReader(p)
    for i in range(0, num_pages):
        content += pdf.getPage(i).extractText() + "\n"
    content = " ".join(content.replace(u"\xa0", " ").strip().split())
    return content

Update:

The call code is this:

f= open('test.txt','w')
pdfl = getPDFContent("test.pdf").encode("ascii", "ignore")
f.write(pdfl)
f.close()

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

携君以终年 2024-09-02 02:20:36

看起来你拥有的是一大块你想要逐行解释的文本数据。

您可以使用 StringIO 类将该内容包装为可查找的类似文件对象:

>>> import StringIO
>>> content = 'big\nugly\ncontents\nof\nmultiple\npdf files'
>>> buf = StringIO.StringIO(content)
>>> buf.readline()
'big\n'
>>> buf.readline()
'ugly\n'
>>> buf.readline()
'contents\n'
>>> buf.readline()
'of\n'
>>> buf.readline()
'multiple\n'
>>> buf.readline()
'pdf files'
>>> buf.seek(0)
>>> buf.readline()
'big\n'

在您的情况下,请执行以下操作:

from StringIO import StringIO

# Read each line of the PDF
pdfContent = StringIO(getPDFContent("test.pdf").encode("ascii", "ignore"))
for line in pdfContent:
    doSomething(line.strip())

Looks like what you have is a large chunk of text data that you want to interpret line-by-line.

You can use the StringIO class to wrap that content as a seekable file-like object:

>>> import StringIO
>>> content = 'big\nugly\ncontents\nof\nmultiple\npdf files'
>>> buf = StringIO.StringIO(content)
>>> buf.readline()
'big\n'
>>> buf.readline()
'ugly\n'
>>> buf.readline()
'contents\n'
>>> buf.readline()
'of\n'
>>> buf.readline()
'multiple\n'
>>> buf.readline()
'pdf files'
>>> buf.seek(0)
>>> buf.readline()
'big\n'

In your case, do:

from StringIO import StringIO

# Read each line of the PDF
pdfContent = StringIO(getPDFContent("test.pdf").encode("ascii", "ignore"))
for line in pdfContent:
    doSomething(line.strip())
素年丶 2024-09-02 02:20:36

使用 yield< code>PdfFileReader.pages 可以简化事情,

from pyPdf import PdfFileReader

def get_pdf_content_lines(pdf_file_path):
    with open(pdf_file_path) as f:
        pdf_reader = PdfFileReader(f)
        for page in pdf_reader.pages: 
            for line in page.extractText().splitlines():
                yield line

for line in get_pdf_content_lines('/path/to/file.pdf'):
    print line

此外,有些人可能会谷歌“python get pdf content text”所以这里是如何:(这个我就是这样来到这里的)

from pyPdf import PdfFileReader

def get_pdf_content(pdf_file_path):
    with open(pdf_file_path) as f:
        pdf_reader = PdfFileReader(f)
        content = "\n".join(page.extractText().strip() for page in pdf_reader.pages)
        content = ' '.join(content.split())
        return content


print get_pdf_content('/path/to/file.pdf')

Using yield and PdfFileReader.pages can simplify things,

from pyPdf import PdfFileReader

def get_pdf_content_lines(pdf_file_path):
    with open(pdf_file_path) as f:
        pdf_reader = PdfFileReader(f)
        for page in pdf_reader.pages: 
            for line in page.extractText().splitlines():
                yield line

for line in get_pdf_content_lines('/path/to/file.pdf'):
    print line

In addition, Some may google "python get pdf content text" so here's how: (this is how i got here)

from pyPdf import PdfFileReader

def get_pdf_content(pdf_file_path):
    with open(pdf_file_path) as f:
        pdf_reader = PdfFileReader(f)
        content = "\n".join(page.extractText().strip() for page in pdf_reader.pages)
        content = ' '.join(content.split())
        return content


print get_pdf_content('/path/to/file.pdf')
撑一把青伞 2024-09-02 02:20:36
import pyPdf  
def getPDFContent(path):
    content = ""
    num_pages = 10
    p = file(path, "rb")
    pdf = pyPdf.PdfFileReader(p)
    for i in range(0, num_pages):
        content += pdf.getPage(i).extractText() + "\n"
    content = " ".join(content.replace(u"\xa0", " ").strip().split())     
    return content 
import pyPdf  
def getPDFContent(path):
    content = ""
    num_pages = 10
    p = file(path, "rb")
    pdf = pyPdf.PdfFileReader(p)
    for i in range(0, num_pages):
        content += pdf.getPage(i).extractText() + "\n"
    content = " ".join(content.replace(u"\xa0", " ").strip().split())     
    return content 
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文