pypdf python 工具
使用 pypdf python 模块如何读取以下 pdf 文件 http://www.envis-icpe. com/pointcounterpointbook/Hindi_Book.pdf
# -*- coding: utf-8 -*-
from pyPdf import PdfFileWriter, PdfFileReader
import pyPdf
def getPDFContent(path):
content = ""
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
# Collapse whitespace
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content
print getPDFContent("/home/tom/Desktop/Hindi_Book.pdf").encode("ascii", "xmlcharrefreplace")
上面仅打印二进制
以及如何从下面的代码打印内容
from pyPdf import PdfFileWriter, PdfFileReader
import sys
import pyPdf
from pyPdf import PdfFileWriter, PdfFileReader
output = PdfFileWriter()
input1 = PdfFileReader(file("/home/tom/Desktop/Hindi_Book.pdf", "rb"))
# print the title of document1.pdf
print "title = %s" % (input1.getDocumentInfo().title)
Using pypdf python module how to read the following pdf file http://www.envis-icpe.com/pointcounterpointbook/Hindi_Book.pdf
# -*- coding: utf-8 -*-
from pyPdf import PdfFileWriter, PdfFileReader
import pyPdf
def getPDFContent(path):
content = ""
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
# Collapse whitespace
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content
print getPDFContent("/home/tom/Desktop/Hindi_Book.pdf").encode("ascii", "xmlcharrefreplace")
The above prints only binary
And how to print the contents from the below code
from pyPdf import PdfFileWriter, PdfFileReader
import sys
import pyPdf
from pyPdf import PdfFileWriter, PdfFileReader
output = PdfFileWriter()
input1 = PdfFileReader(file("/home/tom/Desktop/Hindi_Book.pdf", "rb"))
# print the title of document1.pdf
print "title = %s" % (input1.getDocumentInfo().title)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果您想从 pdf 文件中写入特定文本,可以使用 exctractText() ,如下所示:
在上面的示例中,我只是从每个页面中提取文本并将其写入文本文件。你可以选择任何东西。
如果您需要将特定页面作为 pdf,您可以使用以下代码:
您提供的链接不起作用,这就是为什么我无法查看文件的原因。
If you want to write specific text from the pdf file you can use exctractText() as in below:
In the example above I just extracted text from the each page and wrote that to the text file. You can choose anything.
If you need to take specific pages as pdf you can use below code:
The link which you provided doesn't work, that's why I couldn't look to file sorry.
请注意,您引用的 pdf 文档的大部分“文本”根本不是真正的文本:它主要是图像。当我尝试时,实际文本似乎被正确提取(尽管我必须承认,除了首页和页码上的一些片段之外,我无法阅读它;-))。
至于第二个问题:我不确定你在问什么。
Note that most of the "text" of the pdf document you refer to isn't real text at all: it's mostly images. The actual text seems to get extracted correctly when I try it (although I must admit that apart from some snippets on the front page and the page numbers, I can't read it ;-)).
As for the second question: I'm not sure what you're asking there.