如何从PDF文档中提取表作为结构化数据?

发布于 2025-02-13 11:54:32 字数 495 浏览 2 评论 0 原文

我正在尝试从pdf文档中提取一张表(包括结构)(示例)。这不是扫描/图像,因此请专注于非OCR解决方案。 OCR表提取在这里简单的文本提取在这里

我尝试了pdf - > html->提取表。我上面提到的PDF转换为HTML会产生垃圾,也许由于字体,文档不使用英语。

使用X和Y坐标提取PDF不是一个选项,因为该解决方案需要从上面提到的URL提及的将来的PDF工作,该pdf将具有表格,但并非总是处于相同的位置。

I am trying to extract a table (including the structure) from a PDF document (example). It's not a scan/an image, so please focus on non-OCR solutions. OCR table extraction is here. Simple text extraction is here

I tried the route of pdf -> html -> extract table. The pdf that I mentioned above when converted to html produces garbage, maybe because of the font, the document is not in English.

Extracting the pdf using x and y coordinate is not an option as this solution needs to work for future pdf from the url mention above which will have the table but not always in the same position.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

不喜欢何必死缠烂打 2025-02-20 11:54:32

从PDF文档中提取表非常困难,因为PDF不包含语义层。

Camelot

您可以尝试 camelot ,即使与其网络结合接口 excalibur

>>> import camelot
>>> tables = camelot.read_pdf('foo.pdf')
>>> tables
<TableList n=1>
>>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html, markdown, sqlite
>>> tables[0]
<Table shape=(7, 7)>
>>> tables[0].parsing_report
{
    'accuracy': 99.02,
    'whitespace': 12.24,
    'order': 1,
    'page': 1
}
>>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html, to_markdown, to_sqlite
>>> tables[0].df # get a pandas DataFrame!

另请参阅 python-camelot

tabula

可以

pip install tabula-py

通过仅是Java项目的包装纸。

它是这样使用的:

import tabula

# Read pdf into list of DataFrame
dfs = tabula.read_pdf("test.pdf", pages='all')

另请参阅:

aws textract

我最近没有尝试过,但是 aws textract 索赔:

Amazon Xtract可以在文档中提取表,并在表中提取单元格,合并的单元格和柱标头。

pdfplumber

pdfplubmer表提取方法:

import pdfplumber

pdf = pdfplumber.open("example.pdf")
page = pdf.pages[0]
page.extract_table()

另请参见

Extracting tables from PDF documents is extremely hard as PDF does not contain a semantic layer.

Camelot

You can try camelot, maybe even in combination with its web interface excalibur:

>>> import camelot
>>> tables = camelot.read_pdf('foo.pdf')
>>> tables
<TableList n=1>
>>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html, markdown, sqlite
>>> tables[0]
<Table shape=(7, 7)>
>>> tables[0].parsing_report
{
    'accuracy': 99.02,
    'whitespace': 12.24,
    'order': 1,
    'page': 1
}
>>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html, to_markdown, to_sqlite
>>> tables[0].df # get a pandas DataFrame!

See also

Tabula

tabula can be installed via

pip install tabula-py

But it requires Java, as tabula-py is only a wrapper for the Java project.

It's used like this:

import tabula

# Read pdf into list of DataFrame
dfs = tabula.read_pdf("test.pdf", pages='all')

See also:

AWS Textract

I haven't tried it recently, but AWS Textract claims:

Amazon Textract can extract tables in a document, and extract cells, merged cells, and column headers within a table.

PdfPlumber

pdfplubmer table extraction methods:

import pdfplumber

pdf = pdfplumber.open("example.pdf")
page = pdf.pages[0]
page.extract_table()

See also

小草泠泠 2025-02-20 11:54:32

PDF不包含明确的表数据。它仅包含我们倾向于将其解释为表的线条和字形。因此,您的任务涉及将我们的人桌识别功能纳入代码,这是一项非常重要的任务。

一般来说,如果您肯定会以非常相似的方式生成同一软件的未来PDF,那么 可能值得花时间调查文件,以便一些易于遵循的提示以识别内容单个领域。

但是,您的特定文档还有其他缺点:它不包含直接文本提取的所需信息!您可以尝试复制&amp;从Adobe Reader粘贴,您(至少我会做)Winansi系列的半随机角色。

这是由于该文档中的所有字体声称它们都使用winansiencoding的事实,即使以这种方式引用的角色并非来自winansi角色选择。

因此,毕竟不可能从文档中提取可靠的文本提取!

(尝试从Adobe读取器中粘贴的复制&amp;通常是一个很好的首次测试,是否完全可行;读取器的文本提取方法已经开发了很多年,因此已经变得相当不错。如果您不能提取任何内容对于Acrobat Reader来说,文本提取确实是一项非常艰巨的任务。)

The PDF does not contain explicit table data. It only contains lines and character glyphs which we tend to interpret as tables. Thus your task involves putting our human table recognition capabilities into code which is quite a task.

Generally speaking, if you are sure enough future PDFs will be generated by the same software in a very similar manner, it might be worth the time to investigate the file for some easy to follow hints to recognize the contents of individual fields.

Your specific document, though, has an additional shortcoming: It does not contain the required information for direct text extraction! You can try copying & pasting from Adobe Reader and you'll get (at least I do) semi-random characters from the WinAnsi range.

This is due to the fact that all fonts in the document claim that they use WinAnsiEncoding even though the characters referenced this way definitively are not from the WinAnsi character selection.

Thus reliable text extraction from your document without OCR is impossible after all!

(Trying copy&paste from Adobe Reader generally is a good first test whether text extraction is feasible at all; the text extraction methods of the Reader have been developed for many many years and, therefore, have become quite good. If you cannot extract anything sensible with Acrobat Reader, text extraction will be a very difficult task indeed.)

南…巷孤猫 2025-02-20 11:54:32

您可以使用Tabula:
http://tabula.nerdpower.org
它是免费的,有点易于使用

You could use Tabula:
http://tabula.nerdpower.org
It's free and kinda easy to use

请你别敷衍 2025-02-20 11:54:32

一种选项是使用pdf-table-extract:

One option is to use pdf-table-extract: https://github.com/ashima/pdf-table-extract.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文