Python 库 Camelot 未读取一页中的所有表格

发布于 2025-01-18 06:42:07 字数 2113 浏览 1 评论 0 原文

我正在使用Camelot Python库来阅读PDF文档页面中的所有

表/Auto-trend/2022/auto-trend0122.pdf“ rel =“ nofollow noreferrer”> pdf

我试图调试绘制页面的调试,如果我改变了味道,我注意到了一些味道:

这是有风味的 lattice

这是带有味道

问题是,如果我使用晶格味,它将无法正确阅读桌子 一个示例在这里

如果我使用float ='stream',它将正确读取数据,但仅读取一个表: 输出是这样的。

我尝试使用table_area/table_rigions来检测带有风味='stream'的两个表,但它不起作用。 我在这里粘贴代码。

带有晶格的代码:

import camelot

file = "2022/Auto-trend0122.pdf" 
tables = camelot.read_pdf(file,pages='10',flavor='lattice',edge_tool=1500) 
print("Total tables extracted:", tables.n) 
print(tables[0].df) camelot.plot(tables[0],filename="try_plot.png", kind='contour') 
print(tables[1].df)

带流的代码,不带table_area/table_rigions:

import camelot

file = "2022/Auto-trend0122.pdf"
tables = camelot.read_pdf(file,pages='10',flavor='stream', edge_tool=1500)
print("Total tables extracted:", tables.n)
print(tables[0].df)
camelot.plot(tables[0],filename="try_plot.png", kind='contour')

带有流的代码,带有table_area:

import camelot

file = "2022/Auto-trend0122.pdf"
tables = camelot.read_pdf(file,pages='10',flavor='stream',edge_tool=1500,table_area=['10,450,550,50','10,750,550,450'])
print("Total tables extracted:", tables.n)
print(tables[0].df)
camelot.plot(tables[0],filename="try_plot.png", kind='contour')

带有流的代码,带有table_rigions:

import camelot

file = "2022/Auto-trend0122.pdf"
tables = camelot.read_pdf(file,pages='10',flavor='stream',edge_tool=1500,table_regions=['10,450,550,50','10,750,550,450'])
print("Total tables extracted:", tables.n)
print(tables[0].df)
camelot.plot(tables[0],filename="try_plot.png", kind='contour')

table_ 区域的输出/table/table _area/note ness nes nes nes nes nes。

I'm using Camelot Python Library to read all tables in a page of pdf document

I'm tring to read all tables at page 10 in this pdf

I tried to debug plotting the page and I noticed something if I change the flavor:

This is with flavor lattice

This is with flavor stream

The problem is if I use lattice flavor it will not read properly the tables
an example here

If I use flavor='stream', It will read data properly but just of one table:
The output is somenthing like this.

I tried to use table_area/table_regions for detect the two tables with flavor='stream', but it didn't work.
I paste the code down here.

Code with lattice:

import camelot

file = "2022/Auto-trend0122.pdf" 
tables = camelot.read_pdf(file,pages='10',flavor='lattice',edge_tool=1500) 
print("Total tables extracted:", tables.n) 
print(tables[0].df) camelot.plot(tables[0],filename="try_plot.png", kind='contour') 
print(tables[1].df)

Code with stream, without table_area/table_regions:

import camelot

file = "2022/Auto-trend0122.pdf"
tables = camelot.read_pdf(file,pages='10',flavor='stream', edge_tool=1500)
print("Total tables extracted:", tables.n)
print(tables[0].df)
camelot.plot(tables[0],filename="try_plot.png", kind='contour')

Code with stream, with table_area:

import camelot

file = "2022/Auto-trend0122.pdf"
tables = camelot.read_pdf(file,pages='10',flavor='stream',edge_tool=1500,table_area=['10,450,550,50','10,750,550,450'])
print("Total tables extracted:", tables.n)
print(tables[0].df)
camelot.plot(tables[0],filename="try_plot.png", kind='contour')

Code with stream, with table_regions:

import camelot

file = "2022/Auto-trend0122.pdf"
tables = camelot.read_pdf(file,pages='10',flavor='stream',edge_tool=1500,table_regions=['10,450,550,50','10,750,550,450'])
print("Total tables extracted:", tables.n)
print(tables[0].df)
camelot.plot(tables[0],filename="try_plot.png", kind='contour')

The output for table_regions/table_area/without is the same.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

水波映月 2025-01-25 06:42:07

问题是您使用的是 table_area 而不是正确的参数 table_areas (请阅读 文档)。

以下命令完美运行:

tables =camlot.read_pdf(file,pages='10',flavor='stream',edge_tool=1500,table_areas=['10,450,550,50','10,750,550,450'])

table_areas 和 table_regions 之间的差异

table_areas 应该当您知道桌子的确切位置时可以使用。相反,table_regions 使检测引擎仅在这些通用页面区域中查找表。

The problem is that you are using table_area instead of the correct parameter table_areas (read the docs).

The following command works perfectly:

tables = camelot.read_pdf(file,pages='10', flavor='stream', edge_tool=1500, table_areas=['10,450,550,50','10,750,550,450'])

Difference between table_areas and table_regions

table_areas should be used when you know the exact position of the table. Conversely, table_regions makes the detection engine look for tables only in those generic page regions.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文