如何从PDF文档中提取表作为结构化数据？

不喜欢何必死缠烂打 2025-02-20 11:54:32

从PDF文档中提取表非常困难，因为PDF不包含语义层。

Camelot

您可以尝试 camelot ，即使与其网络结合接口 excalibur ：

>>> import camelot
>>> tables = camelot.read_pdf('foo.pdf')
>>> tables
<TableList n=1>
>>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html, markdown, sqlite
>>> tables[0]
<Table shape=(7, 7)>
>>> tables[0].parsing_report
{
    'accuracy': 99.02,
    'whitespace': 12.24,
    'order': 1,
    'page': 1
}
>>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html, to_markdown, to_sqlite
>>> tables[0].df # get a pandas DataFrame!

另请参阅 python-camelot

tabula

可以

pip install tabula-py

通过仅是Java项目的包装纸。

它是这样使用的：

import tabula

# Read pdf into list of DataFrame
dfs = tabula.read_pdf("test.pdf", pages='all')

另请参阅：

aws textract

我最近没有尝试过，但是 aws textract 索赔：

Amazon Xtract可以在文档中提取表，并在表中提取单元格，合并的单元格和柱标头。

pdfplumber

pdfplubmer表提取方法：

import pdfplumber

pdf = pdfplumber.open("example.pdf")
page = pdf.pages[0]
page.extract_table()

另请参见

stackoverflow.com/q/61387304/562769"> tabula vs Camelot

Extracting tables from PDF documents is extremely hard as PDF does not contain a semantic layer.

Camelot

You can try camelot, maybe even in combination with its web interface excalibur:

>>> import camelot
>>> tables = camelot.read_pdf('foo.pdf')
>>> tables
<TableList n=1>
>>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html, markdown, sqlite
>>> tables[0]
<Table shape=(7, 7)>
>>> tables[0].parsing_report
{
    'accuracy': 99.02,
    'whitespace': 12.24,
    'order': 1,
    'page': 1
}
>>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html, to_markdown, to_sqlite
>>> tables[0].df # get a pandas DataFrame!

Tabula

tabula can be installed via

pip install tabula-py

But it requires Java, as tabula-py is only a wrapper for the Java project.

It's used like this:

import tabula

# Read pdf into list of DataFrame
dfs = tabula.read_pdf("test.pdf", pages='all')

AWS Textract

I haven't tried it recently, but AWS Textract claims:

Amazon Textract can extract tables in a document, and extract cells, merged cells, and column headers within a table.

PdfPlumber

pdfplubmer table extraction methods:

import pdfplumber

pdf = pdfplumber.open("example.pdf")
page = pdf.pages[0]
page.extract_table()

（尝试从Adobe读取器中粘贴的复制＆amp;通常是一个很好的首次测试，是否完全可行；读取器的文本提取方法已经开发了很多年，因此已经变得相当不错。如果您不能提取任何内容对于Acrobat Reader来说，文本提取确实是一项非常艰巨的任务。）

回复收藏 0 原文

南…巷孤猫 2025-02-20 11:54:32

您可以使用Tabula：
http://tabula.nerdpower.org
它是免费的，有点易于使用

回复收藏 0 原文

请你别敷衍 2025-02-20 11:54:32

一种选项是使用pdf-table-extract：。

回复收藏 0 原文

如何从PDF文档中提取表作为结构化数据？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

Camelot

tabula

aws textract

pdfplumber

另请参见

Camelot

Tabula

AWS Textract

PdfPlumber

See also

关于作者

相关话题

热门标签

推荐作者

櫻之舞

弥枳

m2429

寻找一个思念的角度

野却迷人

我怀念的。

友情链接

如何从PDF文档中提取表作为结构化数据？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

Camelot

tabula

aws textract

pdfplumber

另请参见

Camelot

Tabula

AWS Textract

PdfPlumber

See also

关于作者

相关话题

热门标签

推荐作者

櫻之舞

弥枳

m2429

寻找一个思念的角度

野却迷人

我怀念的。

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。