如何使用Python在PDF文件中提取表单元格的背景颜色?
我一直在使用Tabula-py,pypdf2和tika模块,但是它们似乎都没有检测到pdf文件中的表单元格的背景颜色。
这些有色单元格在我的问题背景下意味着重要信息。我知道,对于Tabula-Py来说,Tabula-Py是Tabula-Java的包装器,并且该包装器没有提供彩色的单元格信息。 Python中是否有一些易于遵循的解决方案?
提前致谢。
I've been using tabula-py, PyPDF2 and tika modules, but none of them seems to detect the background color of a table cell, which is within a PDF file.
These colored cells mean important information in the context of my problem. I know, for exemple, that tabula-py is a wrapper from tabula-java and this one does not provided colored cell information. Is there some easy-to-follow solution in Python out there?
Thanks in advance.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
免责声明:我是库
borb的作者
在此答案中使用的关于pdf :pdf并不是“您所看到的是您得到的”格式,是用于渲染说明的容器。这意味着一张桌子实际上只是渲染指令的集合,这些说明绘制了我们人类将其解释为桌子的内容。类似:
请务必记住这是一种启发式。它基于一些假设。例如“表往往具有大量在90度角相交的线”。
我建议您在
borb
中查看tableDeTectionBylines
。这是一个集体,可以收集上述渲染说明并吐出PDF文档中表的位置。您将使用它:
如下所示,此类不会跟踪中风/填充颜色。但是,您可以轻松地将其子类对其进行修改。
For this, I would start at this particular line 。
disclaimer: I am the author of the library
borb
used in this answerabout PDF: PDF is not so much a "what you see is what you get" format, as it is a container for rendering instructions. That means a table is in fact just a collection of rendering instructions that draws something we humans interpret as a table. Something like:
Whenever a PDF library is extracting tables from a PDF, it's important to keep in mind this is a heuristic. It's based on some assumptions. Such as "tables tend to have a large number of lines that intersect at 90-degree angles".
I suggest you have a look at
TableDetectionByLines
inborb
. It's a class that gathers the aforementioned rendering instructions and spits out the locations of tables in the PDF document.You would use it as such:
As it stands, this class does not track the stroke/fill colour. But you can easily subclass it, and modify it so it does.
For this, I would start at this particular line.
某些用户报告了我以前的解决方案的运行不佳。
确实如此,因为PDFPLUMBER的
page.Rects
并不总是正确地检测到表中的单元格。有时它只检测到线条,行,上校。
所以我在这里提出了另一种解决方案。
Some kind user reported my previous solution did not work well.
It's true because pdfplumber's
page.rects
does not always detect cells in table correctly.Sometimes it only detects lines, rows, cols.
So I propose another solution here.
我使用
I found a solution using pdfplumber.
Here is rough sample code.