我正在尝试从pdf文档中提取一张表(包括结构)(示例)。这不是扫描/图像,因此请专注于非OCR解决方案。 OCR表提取在这里。 简单的文本提取在这里
我尝试了pdf - > html->提取表。我上面提到的PDF转换为HTML会产生垃圾,也许由于字体,文档不使用英语。
使用X和Y坐标提取PDF不是一个选项,因为该解决方案需要从上面提到的URL提及的将来的PDF工作,该pdf将具有表格,但并非总是处于相同的位置。
I am trying to extract a table (including the structure) from a PDF document (example). It's not a scan/an image, so please focus on non-OCR solutions. OCR table extraction is here. Simple text extraction is here
I tried the route of pdf -> html -> extract table. The pdf that I mentioned above when converted to html produces garbage, maybe because of the font, the document is not in English.
Extracting the pdf using x and y coordinate is not an option as this solution needs to work for future pdf from the url mention above which will have the table but not always in the same position.
发布评论
评论(4)
从PDF文档中提取表非常困难,因为PDF不包含语义层。
Camelot
您可以尝试
camelot
,即使与其网络结合接口
excalibur
:另请参阅 python-camelot
tabula
可以
通过仅是Java项目的包装纸。
它是这样使用的:
另请参阅:
aws textract
我最近没有尝试过,但是 aws textract 索赔:
pdfplumber
pdfplubmer表提取方法:
另请参见
Extracting tables from PDF documents is extremely hard as PDF does not contain a semantic layer.
Camelot
You can try
camelot
, maybe even in combination with its web interfaceexcalibur
:See also python-camelot
Tabula
tabula
can be installed viaBut it requires Java, as
tabula-py
is only a wrapper for the Java project.It's used like this:
See also:
AWS Textract
I haven't tried it recently, but AWS Textract claims:
PdfPlumber
pdfplubmer table extraction methods:
See also
PDF不包含明确的表数据。它仅包含我们倾向于将其解释为表的线条和字形。因此,您的任务涉及将我们的人桌识别功能纳入代码,这是一项非常重要的任务。
一般来说,如果您肯定会以非常相似的方式生成同一软件的未来PDF,那么 可能值得花时间调查文件,以便一些易于遵循的提示以识别内容单个领域。
但是,您的特定文档还有其他缺点:它不包含直接文本提取的所需信息!您可以尝试复制&从Adobe Reader粘贴,您(至少我会做)Winansi系列的半随机角色。
这是由于该文档中的所有字体声称它们都使用winansiencoding的事实,即使以这种方式引用的角色并非来自winansi角色选择。
因此,毕竟不可能从文档中提取可靠的文本提取!
(尝试从Adobe读取器中粘贴的复制&通常是一个很好的首次测试,是否完全可行;读取器的文本提取方法已经开发了很多年,因此已经变得相当不错。如果您不能提取任何内容对于Acrobat Reader来说,文本提取确实是一项非常艰巨的任务。)
The PDF does not contain explicit table data. It only contains lines and character glyphs which we tend to interpret as tables. Thus your task involves putting our human table recognition capabilities into code which is quite a task.
Generally speaking, if you are sure enough future PDFs will be generated by the same software in a very similar manner, it might be worth the time to investigate the file for some easy to follow hints to recognize the contents of individual fields.
Your specific document, though, has an additional shortcoming: It does not contain the required information for direct text extraction! You can try copying & pasting from Adobe Reader and you'll get (at least I do) semi-random characters from the WinAnsi range.
This is due to the fact that all fonts in the document claim that they use WinAnsiEncoding even though the characters referenced this way definitively are not from the WinAnsi character selection.
Thus reliable text extraction from your document without OCR is impossible after all!
(Trying copy&paste from Adobe Reader generally is a good first test whether text extraction is feasible at all; the text extraction methods of the Reader have been developed for many many years and, therefore, have become quite good. If you cannot extract anything sensible with Acrobat Reader, text extraction will be a very difficult task indeed.)
您可以使用Tabula:
http://tabula.nerdpower.org
它是免费的,有点易于使用
You could use Tabula:
http://tabula.nerdpower.org
It's free and kinda easy to use
一种选项是使用pdf-table-extract:。
One option is to use pdf-table-extract: https://github.com/ashima/pdf-table-extract.