We don’t allow questions seeking recommendations for software libraries, tutorials, tools, books, or other off-site resources. You can edit the question so it can be answered with facts and citations.
Closed 9 years ago.
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
接受
或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
发布评论
评论(13)
PDFMiner 包自 codeape 发布。
编辑(再次):
PDFMiner 已在版本
20100213
中再次更新,您可以使用以下命令检查已安装的版本:
这是更新的版本(带有对我更改的内容的评论) /添加):
编辑(再次):
这是
编辑(再一次):
更新版本
20110515
(感谢 Oeufcoque Penteano!):The PDFMiner package has changed since codeape posted.
EDIT (again):
PDFMiner has been updated again in version
20100213
You can check the version you have installed with the following:
Here's the updated version (with comments on what I changed/added):
Edit (yet again):
Here is an update for the latest version in pypi,
20100619p1
. In short I replacedLTTextItem
withLTChar
and passed an instance of LAParams to the CsvConverter constructor.EDIT (one more time):
Updated for version
20110515
(thanks to Oeufcoque Penteano!):尝试 PDFMiner。 它可以从 PDF 文件中提取 HTML、SGML 或“标记 PDF”格式的文本。
带标签的 PDF 格式似乎是最干净的,去掉 XML 标签就只剩下裸露的文本。
Python 3 版本位于:
Try PDFMiner. It can extract text from PDF files as HTML, SGML or "Tagged PDF" format.
The Tagged PDF format seems to be the cleanest, and stripping out the XML tags leaves just the bare text.
A Python 3 version is available under:
PDFminer 在我尝试使用的 pdf 文件的每一页上都给了我一行 [第 1 页,共 7 页...]。
到目前为止我得到的最好的答案是 pdftoipe,或者它基于 Xpdf 的 C++ 代码。
请参阅我的问题了解以下内容的输出pdftoipe 看起来像。
PDFminer gave me perhaps one line [page 1 of 7...] on every page of a pdf file I tried with it.
The best answer I have so far is pdftoipe, or the c++ code it's based on Xpdf.
see my question for what the output of pdftoipe looks like.
此外,还有 PDFTextStream,它是一个商业 Java 库,也可以在 Python 中使用。
Additionally there is PDFTextStream which is a commercial Java library that can also be used from Python.
我已将
pdftohtml
与-xml
参数一起使用,使用subprocess.Popen()
读取结果,这将为您提供 x 坐标、y 坐标pdf 中每个文本片段的宽度、高度和字体。 我认为这也是“evince”可能使用的,因为会出现相同的错误消息。如果您需要处理柱状数据,它会变得稍微复杂一些,因为您必须发明一种适合您的 pdf 文件的算法。 问题在于,制作 PDF 文件的程序实际上并不一定以任何逻辑格式排列文本。 您可以尝试简单的排序算法,有时它会起作用,但可能会有一些“落后者”和“流浪者”,即没有按照您想象的顺序排列的文本片段。 所以你必须发挥创意。
我花了大约 5 个小时才为我正在处理的 pdf 找到一份。 但现在效果很好。 祝你好运。
I have used
pdftohtml
with the-xml
argument, read the result withsubprocess.Popen()
, that will give you x coord, y coord, width, height, and font, of every snippet of text in the pdf. I think this is what 'evince' probably uses too because the same error messages spew out.If you need to process columnar data, it gets slightly more complicated as you have to invent an algorithm that suits your pdf file. The problem is that the programs that make PDF files don't really necessarily lay out the text in any logical format. You can try simple sorting algorithms and it works sometimes, but there can be little 'stragglers' and 'strays', pieces of text that don't get put in the order you thought they would. So you have to get creative.
It took me about 5 hours to figure out one for the pdf's I was working on. But it works pretty good now. Good luck.
今天找到了这个解决方案。 对我来说效果很好。 甚至将 PDF 页面渲染为 PNG 图像。
http://www.swftools.org/gfx_tutorial.html
Found that solution today. Works great for me. Even rendering PDF pages to PNG images.
http://www.swftools.org/gfx_tutorial.html
重新利用pdfminer自带的pdf2txt.py代码; 您可以创建一个函数来获取 pdf 的路径; 可选,输出类型 (txt|html|xml|tag) 并选择命令行 pdf2txt {'-o': '/path/to/outfile.txt' ...}。 默认情况下,您可以调用:
将创建一个文本文件,它是文件系统上原始 pdf 的同级文件。
Repurposing the pdf2txt.py code that comes with pdfminer; you can make a function that will take a path to the pdf; optionally, an outtype (txt|html|xml|tag) and opts like the commandline pdf2txt {'-o': '/path/to/outfile.txt' ...}. By default, you can call:
A text file will be created, a sibling on the filesystem to the original pdf.
由于这些解决方案都不支持最新版本的 PDFMiner,因此我编写了一个简单的解决方案,它将使用 PDFMiner 返回 pdf 的文本。 这适用于那些使用
process_pdf
遇到导入错误的人,请参阅以下适用于 Python 3 的代码:
Since none for these solutions support the latest version of PDFMiner I wrote a simple solution that will return text of a pdf using PDFMiner. This will work for those who are getting import errors with
process_pdf
See below code that works for Python 3:
Pdftotext 一个开源程序(Xpdf的一部分),您可以从Python调用它(不是您所要求的)但可能有用)。 我用过没有任何问题。 我认为谷歌在谷歌桌面中使用它。
Pdftotext An open source program (part of Xpdf) which you could call from python (not what you asked for but might be useful). I've used it with no problems. I think google use it in google desktop.
pyPDF 工作正常(假设您正在使用格式良好的 PDF)。 如果您想要的只是文本(带空格),您可以这样做:
您还可以轻松访问元数据、图像数据等。
extractText 代码中的注释指出:
这是否是一个问题取决于您对文本执行的操作(例如,如果顺序无关紧要,那就没问题,或者如果生成器按照显示的顺序将文本添加到流中,那就没问题) 。 我日常使用pyPdf提取代码,没有任何问题。
pyPDF works fine (assuming that you're working with well-formed PDFs). If all you want is the text (with spaces), you can just do:
You can also easily get access to the metadata, image data, and so forth.
A comment in the extractText code notes:
Whether or not this is a problem depends on what you're doing with the text (e.g. if the order doesn't matter, it's fine, or if the generator adds text to the stream in the order it will be displayed, it's fine). I have pyPdf extraction code in daily use, without any problems.
您还可以非常轻松地使用 pdfminer 作为库。 您可以访问 pdf 的内容模型,并且可以创建您自己的文本提取。 我这样做是为了使用下面的代码将 pdf 内容转换为分号分隔的文本。
该函数只是根据 y 和 x 坐标对 TextItem 内容对象进行排序,并将 y 坐标相同的项目输出为一个文本行,并使用“;”分隔同一行上的对象。 人物。
使用这种方法,我能够从 pdf 中提取文本,而其他工具无法从中提取适合进一步解析的内容。 我尝试过的其他工具包括 pdftotext、ps2ascii 和在线工具 pdftextonline.com。
pdfminer 是一个非常有用的 pdf 抓取工具。
更新:
上面的代码是针对旧版本的 API 编写的,请参阅下面的评论。
You can also quite easily use pdfminer as a library. You have access to the pdf's content model, and can create your own text extraction. I did this to convert pdf contents to semi-colon separated text, using the code below.
The function simply sorts the TextItem content objects according to their y and x coordinates, and outputs items with the same y coordinate as one text line, separating the objects on the same line with ';' characters.
Using this approach, I was able to extract text from a pdf that no other tool was able to extract content suitable for further parsing from. Other tools I tried include pdftotext, ps2ascii and the online tool pdftextonline.com.
pdfminer is an invaluable tool for pdf-scraping.
UPDATE:
The code above is written against an old version of the API, see my comment below.
slate
是一个可以非常简单地使用库中的 PDFMiner 的项目:slate
is a project that makes it very simple to use PDFMiner from a library:我需要在 python 模块中将特定的 PDF 转换为纯文本。 在阅读了他们的 pdf2txt.py 工具我写了这个简单的片段:
I needed to convert a specific PDF to plain text within a python module. I used PDFMiner 20110515, after reading through their pdf2txt.py tool I wrote this simple snippet: