pypdf python 工具

发布于 2024-09-26 03:45:00 字数 1159 浏览 0 评论 0原文

使用 pypdf python 模块如何读取以下 pdf 文件 http://www.envis-icpe. com/pointcounterpointbook/Hindi_Book.pdf

# -*- coding: utf-8 -*-
from pyPdf import PdfFileWriter, PdfFileReader
import pyPdf

def getPDFContent(path):
   content = ""
   # Load PDF into pyPDF
   pdf = pyPdf.PdfFileReader(file(path, "rb"))
   # Iterate pages
   for i in range(0, pdf.getNumPages()):
      # Extract text from page and add to content
      content += pdf.getPage(i).extractText() + "\n"
   # Collapse whitespace
   content = " ".join(content.replace(u"\xa0", " ").strip().split())
   return content

print getPDFContent("/home/tom/Desktop/Hindi_Book.pdf").encode("ascii", "xmlcharrefreplace")

上面仅打印二进制

以及如何从下面的代码打印内容

from pyPdf import PdfFileWriter, PdfFileReader
import sys
import pyPdf

from pyPdf import PdfFileWriter, PdfFileReader

output = PdfFileWriter()
input1 = PdfFileReader(file("/home/tom/Desktop/Hindi_Book.pdf", "rb"))

# print the title of document1.pdf
print "title = %s" % (input1.getDocumentInfo().title)

Using pypdf python module how to read the following pdf file http://www.envis-icpe.com/pointcounterpointbook/Hindi_Book.pdf

# -*- coding: utf-8 -*-
from pyPdf import PdfFileWriter, PdfFileReader
import pyPdf

def getPDFContent(path):
   content = ""
   # Load PDF into pyPDF
   pdf = pyPdf.PdfFileReader(file(path, "rb"))
   # Iterate pages
   for i in range(0, pdf.getNumPages()):
      # Extract text from page and add to content
      content += pdf.getPage(i).extractText() + "\n"
   # Collapse whitespace
   content = " ".join(content.replace(u"\xa0", " ").strip().split())
   return content

print getPDFContent("/home/tom/Desktop/Hindi_Book.pdf").encode("ascii", "xmlcharrefreplace")

The above prints only binary

And how to print the contents from the below code

from pyPdf import PdfFileWriter, PdfFileReader
import sys
import pyPdf

from pyPdf import PdfFileWriter, PdfFileReader

output = PdfFileWriter()
input1 = PdfFileReader(file("/home/tom/Desktop/Hindi_Book.pdf", "rb"))

# print the title of document1.pdf
print "title = %s" % (input1.getDocumentInfo().title)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

天冷不及心凉 2024-10-03 03:45:01

如果您想从 pdf 文件中写入特定文本,可以使用 exctractText() ,如下所示:

from path_to_folder import main_path as my_text
import os
from PyPDF2 import PdfFileReader

my_pdf_path = os.path.join(my_text, "my_pdf.pdf")

with open(os.path.join(my_text, "out_put.txt"), 'w') as out_text:
    pdf_read = PdfFileReader(open(my_pdf_path, "rb"))
    out_text.write(pdf_read.getDocumentInfo().title)
    for pages in range(pdf_read.getNumPages()):
        text = pdf_read.getPage(pages).extractText()
        out_text.write(text)

在上面的示例中,我只是从每个页面中提取文本并将其写入文本文件。你可以选择任何东西。
如果您需要将特定页面作为 pdf,您可以使用以下代码:

from pyPdf import PdfFileWriter, PdfFileReader
import os, sys
main_path = "/home/tom/Desktop/"
output_file = PdfFileWriter()
input_file = PdfFileReader(file("/home/tom/Desktop/Hindi_Book.pdf", "rb"))
for page_number in range(input_file.getNumPages()):
    output_file.addPage(input_file.getPage(page_number))

new_file = os.path.join(main_path, "Out_folder/new_pdf.pdf")
out_fil1 = open(new_file, "wb")
output_file.write(out_fil1)
output_file.close()

您提供的链接不起作用,这就是为什么我无法查看文件的原因。

If you want to write specific text from the pdf file you can use exctractText() as in below:

from path_to_folder import main_path as my_text
import os
from PyPDF2 import PdfFileReader

my_pdf_path = os.path.join(my_text, "my_pdf.pdf")

with open(os.path.join(my_text, "out_put.txt"), 'w') as out_text:
    pdf_read = PdfFileReader(open(my_pdf_path, "rb"))
    out_text.write(pdf_read.getDocumentInfo().title)
    for pages in range(pdf_read.getNumPages()):
        text = pdf_read.getPage(pages).extractText()
        out_text.write(text)

In the example above I just extracted text from the each page and wrote that to the text file. You can choose anything.
If you need to take specific pages as pdf you can use below code:

from pyPdf import PdfFileWriter, PdfFileReader
import os, sys
main_path = "/home/tom/Desktop/"
output_file = PdfFileWriter()
input_file = PdfFileReader(file("/home/tom/Desktop/Hindi_Book.pdf", "rb"))
for page_number in range(input_file.getNumPages()):
    output_file.addPage(input_file.getPage(page_number))

new_file = os.path.join(main_path, "Out_folder/new_pdf.pdf")
out_fil1 = open(new_file, "wb")
output_file.write(out_fil1)
output_file.close()

The link which you provided doesn't work, that's why I couldn't look to file sorry.

冰葑 2024-10-03 03:45:00

请注意,您引用的 pdf 文档的大部分“文本”根本不是真正的文本:它主要是图像。当我尝试时,实际文本似乎被正确提取(尽管我必须承认,除了首页和页码上的一些片段之外,我无法阅读它;-))。

至于第二个问题:我不确定你在问什么。

Note that most of the "text" of the pdf document you refer to isn't real text at all: it's mostly images. The actual text seems to get extracted correctly when I try it (although I must admit that apart from some snippets on the front page and the page numbers, I can't read it ;-)).

As for the second question: I'm not sure what you're asking there.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文