如何使用Python在PDF文件中提取表单元格的背景颜色?

发布于 2025-01-30 02:54:53 字数 196 浏览 6 评论 0原文

我一直在使用Tabula-py,pypdf2和tika模块,但是它们似乎都没有检测到pdf文件中的表单元格的背景颜色。

这些有色单元格在我的问题背景下意味着重要信息。我知道,对于Tabula-Py来说,Tabula-Py是Tabula-Java的包装器,并且该包装器没有提供彩色的单元格信息。 Python中是否有一些易于遵循的解决方案?

提前致谢。

I've been using tabula-py, PyPDF2 and tika modules, but none of them seems to detect the background color of a table cell, which is within a PDF file.

These colored cells mean important information in the context of my problem. I know, for exemple, that tabula-py is a wrapper from tabula-java and this one does not provided colored cell information. Is there some easy-to-follow solution in Python out there?

Thanks in advance.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

长伴 2025-02-06 02:54:53

免责声明:我是库borb的作者在此答案中使用的

关于pdf :pdf并不是“您所看到的是您得到的”格式,是用于渲染说明的容器。这意味着一张桌子实际上只是渲染指令的集合,这些说明绘制了我们人类将其解释为桌子的内容。类似:

  • 转到位置x,y
  • 将当前的笔触颜色设置为黑色
  • 设置当前填充颜色为蓝色
  • 将字体设置为helvetica,尺寸12
  • 绘制一条线向x,y
  • 移动笔向上
  • 移动到x,y
  • 渲染字符串 每当PDF库从PDF中提取表时, “ Hello World”,

请务必记住这是一种启发式。它基于一些假设。例如“表往往具有大量在90度角相交的线”。

我建议您在borb中查看tableDeTectionBylines。这是一个集体,可以收集上述渲染说明并吐出PDF文档中表的位置。

您将使用它:

from borb.pdf.canvas.layout.table.table import Table, TableCell
from borb.pdf.document.document import Document
from borb.pdf.page.page import Page
from borb.pdf.pdf import PDF
from borb.toolkit.table.table_detection_by_lines import TableDetectionByLines

doc: typing.Optional[Document] = None
with open(input_file, "rb") as input_pdf_handle:
    l: TableDetectionByLines = TableDetectionByLines()
    doc = PDF.loads(input_pdf_handle, [l])

assert doc is not None
tables: typing.List[Table] = l.get_tables_for_page(0)

如下所示,此类不会跟踪中风/填充颜色。但是,您可以轻松地将其子类对其进行修改。

For this, I would start at this particular line

disclaimer: I am the author of the library borb used in this answer

about PDF: PDF is not so much a "what you see is what you get" format, as it is a container for rendering instructions. That means a table is in fact just a collection of rendering instructions that draws something we humans interpret as a table. Something like:

  • go to location x, y
  • set the current stroke colour to black
  • set the current fill colour to blue
  • set the font to Helvetica, size 12
  • draw a line to x, y
  • move the pen up
  • go to x, y
  • render the string "Hello World"

Whenever a PDF library is extracting tables from a PDF, it's important to keep in mind this is a heuristic. It's based on some assumptions. Such as "tables tend to have a large number of lines that intersect at 90-degree angles".

I suggest you have a look at TableDetectionByLines in borb. It's a class that gathers the aforementioned rendering instructions and spits out the locations of tables in the PDF document.

You would use it as such:

from borb.pdf.canvas.layout.table.table import Table, TableCell
from borb.pdf.document.document import Document
from borb.pdf.page.page import Page
from borb.pdf.pdf import PDF
from borb.toolkit.table.table_detection_by_lines import TableDetectionByLines

doc: typing.Optional[Document] = None
with open(input_file, "rb") as input_pdf_handle:
    l: TableDetectionByLines = TableDetectionByLines()
    doc = PDF.loads(input_pdf_handle, [l])

assert doc is not None
tables: typing.List[Table] = l.get_tables_for_page(0)

As it stands, this class does not track the stroke/fill colour. But you can easily subclass it, and modify it so it does.

For this, I would start at this particular line.

哎呦我呸! 2025-02-06 02:54:53

某些用户报告了我以前的解决方案的运行不佳。
确实如此,因为PDFPLUMBER的page.Rects并不总是正确地检测到表中的单元格。
有时它只检测到线条,行,上校。
所以我在这里提出了另一种解决方案。

import pdfplumber
from collections import Counter
    

def get_cell_color(image, cell:tuple[int, int, int, int]):
    cropped_image = image.crop(cell)
    pixels = list(cropped_image.convert('RGB').getdata())
    color_counts = Counter(pixels)
    most_common = color_counts.most_common(1)
    return most_common[0][0]


def demo(page):
    """example method: print colored cells information"""
    page_image = page.to_image().original
    tables = page.find_tables()
    
    for table in tables:
        extracted_table = table.extract()
        for row_idx, row in enumerate(table.rows):
            for cell_idx, cell in enumerate(row.cells):
                cell_color = get_cell_color(page_image, cell)
                if cell_color != (255, 255, 255):
                    print(f"cell color: {cell_color}")
                    print(f"cell location: {cell}")
                    print(f"cell content: {extracted_table[row_idx][cell_idx]}")


pdf = pdfplumber.open("/path/to/target.pdf")
page = pdf.pages[0]
demo(page)

Some kind user reported my previous solution did not work well.
It's true because pdfplumber's page.rects does not always detect cells in table correctly.
Sometimes it only detects lines, rows, cols.
So I propose another solution here.

import pdfplumber
from collections import Counter
    

def get_cell_color(image, cell:tuple[int, int, int, int]):
    cropped_image = image.crop(cell)
    pixels = list(cropped_image.convert('RGB').getdata())
    color_counts = Counter(pixels)
    most_common = color_counts.most_common(1)
    return most_common[0][0]


def demo(page):
    """example method: print colored cells information"""
    page_image = page.to_image().original
    tables = page.find_tables()
    
    for table in tables:
        extracted_table = table.extract()
        for row_idx, row in enumerate(table.rows):
            for cell_idx, cell in enumerate(row.cells):
                cell_color = get_cell_color(page_image, cell)
                if cell_color != (255, 255, 255):
                    print(f"cell color: {cell_color}")
                    print(f"cell location: {cell}")
                    print(f"cell content: {extracted_table[row_idx][cell_idx]}")


pdf = pdfplumber.open("/path/to/target.pdf")
page = pdf.pages[0]
demo(page)
踏月而来 2025-02-06 02:54:53

我使用

from typing import Optional

import pdfplumber
from pdfplumber.page import Page, Table


def cmyk_to_rgb(cmyk: tuple[float, float, float, float]):
    r = 255 * (1.0 - (cmyk[0] + cmyk[3]))
    g = 255 * (1.0 - (cmyk[1] + cmyk[3]))
    b = 255 * (1.0 - (cmyk[2] + cmyk[3]))
    return r, g, b


def to_bbox(rect: dict) -> tuple[float, float, float, float]:
    return (rect["x0"], rect["top"], rect["x1"], rect["bottom"])


def is_included(cell_box: tuple[float, float, float, float], rect_box: tuple[float, float, float, float]):
    c_left, c_top, c_right, c_bottom = cell_box
    r_left, r_top, r_right, r_bottom = rect_box
    return c_left >= r_left and c_top >= r_top and c_right <= r_right and c_bottom <= r_bottom


def find_rect_for_cell(cell: tuple[float, float, float, float], rects: list[dict]) -> Optional[dict]:
    return next((r for r in rects if is_included(cell, to_bbox(r))), None)


def get_cell_color(cell: tuple[float, float, float, float], page: Page) -> tuple[float, float, float]:
    rect = find_rect_for_cell(cell, page.rects) if cell else None
    return cmyk_to_rgb(rect["non_stroking_color"]) if rect else (255, 255, 255)


pdf = pdfplumber.open("/path/to/target.pdf")
page = pdf.pages[0]
tables: list[Table] = page.find_tables()

# get RGB color of first(= top-left) cell of first table
print(get_cell_color(tables[0].rows[0].cells[0], page)) # => (r, g, b)

I found a solution using pdfplumber.
Here is rough sample code.

from typing import Optional

import pdfplumber
from pdfplumber.page import Page, Table


def cmyk_to_rgb(cmyk: tuple[float, float, float, float]):
    r = 255 * (1.0 - (cmyk[0] + cmyk[3]))
    g = 255 * (1.0 - (cmyk[1] + cmyk[3]))
    b = 255 * (1.0 - (cmyk[2] + cmyk[3]))
    return r, g, b


def to_bbox(rect: dict) -> tuple[float, float, float, float]:
    return (rect["x0"], rect["top"], rect["x1"], rect["bottom"])


def is_included(cell_box: tuple[float, float, float, float], rect_box: tuple[float, float, float, float]):
    c_left, c_top, c_right, c_bottom = cell_box
    r_left, r_top, r_right, r_bottom = rect_box
    return c_left >= r_left and c_top >= r_top and c_right <= r_right and c_bottom <= r_bottom


def find_rect_for_cell(cell: tuple[float, float, float, float], rects: list[dict]) -> Optional[dict]:
    return next((r for r in rects if is_included(cell, to_bbox(r))), None)


def get_cell_color(cell: tuple[float, float, float, float], page: Page) -> tuple[float, float, float]:
    rect = find_rect_for_cell(cell, page.rects) if cell else None
    return cmyk_to_rgb(rect["non_stroking_color"]) if rect else (255, 255, 255)


pdf = pdfplumber.open("/path/to/target.pdf")
page = pdf.pages[0]
tables: list[Table] = page.find_tables()

# get RGB color of first(= top-left) cell of first table
print(get_cell_color(tables[0].rows[0].cells[0], page)) # => (r, g, b)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文