如何在记忆中修剪PDF文档的底部空间

发布于 2025-01-17 13:17:18 字数 1832 浏览 2 评论 0原文

我正在使用wkhtmltopdf将HTML文档(django-templated)渲染为单页PDF文件。我想立即以正确的高度渲染(我还没有做到这一点),或者将其渲染不正确并修剪。我正在使用Python。

尝试类型1:

  • 呈现为非常非常长的单页PDF,并使用很多额外的空间
  • wkhtmltopdf使用使用- page-Height使用pdfcropmargins to to to:作物([“ - p4”,“ 100”,“ 0”,“ 100”,“ 100”,“ -A4”,“ 0”,“ -28”,“ 0”,“,”,“” 0“,” input.pdf”])

PDF在底部用28个保证金完美地渲染,但是我必须使用文件系统来执行crop命令。该工具似乎期望输入文件和输出文件,并且还会在中途创建临时文件。所以我不能使用它。

尝试类型2:

  • wkhtmltopdf渲染到具有默认参数的多页PDF
  • 使用pypdf4(或pypdf2)以读取文件并将页面组合到一个 在大多数情况下,长长的单页

pdf呈现精细,有时,如果最后一个PDF页面的内容很少,则有时可以在底部看到很多额外的空白。

理想场景:

理想的方案将涉及一个函数,该功能将HTML带入单页的PDF中,其底部预期的白空间量。我很高兴使用wkhtmltopdf呈现PDF,因为它返回字节,然后再处理这些字节以删除任何额外的白色空间。但是我不想将文件系统涉及到其中,因为相反,我想在内存中执行所有操作。也许我可以以某种方式直接检查PDF并手动卸下空白空间,或者做一些HTML魔术以提前确定渲染高度?

我现在在做什么:

请注意,pdfkitwkhtmltopdf wrapper

# This is not a valid HTML (includes Django-specific stuff)
template: Template = get_template("some-django-template.html")

# This is now valid HTML
rendered = template.render({
    "foo": "bar",
})

# This first renders PDF from HTML normally (multiple pages)
# Then counts how many pages were created and determines the required single-page height
# Then renders a single-page PDF from HTML using the page height and width arguments
return pdfkit.from_string(rendered, options={
    "page-height": f"{297 * PdfFileReader(BytesIO(pdfkit.from_string(rendered))).getNumPages()}mm",
    "page-width": "210mm"
})

它等同于尝试类型2,除了i请勿在此处使用PYDPF4将页面缝合在一起,而是使用wkhtmltopdf再次使用预算页面高度渲染。

I am using wkhtmltopdf to render a (Django-templated) HTML document to a single-page PDF file. I would like to either render it immediately with the correct height (which I've failed to do so far) or render it incorrectly and trim it. I'm using Python.

Attempt type 1:

  • wkhtmltopdf render to a very, very long single-page PDF with a lot of extra space using --page-height
  • Use pdfCropMargins to trim: crop(["-p4", "100", "0", "100", "100", "-a4", "0", "-28", "0", "0", "input.pdf"])

The PDF is rendered perfectly with 28 units of margin at the bottom, but I had to use the filesystem to execute the crop command. It seems that the tool expects an input file and output file, and also creates temporary files midway through. So I can't use it.

Attempt type 2:

  • wkhtmltopdf render to multi-page PDF with default parameters
  • Use PyPDF4 (or PyPDF2) to read the file and combine pages into a long, single page

The PDF is rendered fine-ish in most cases, however, sometimes a lot of extra white space can be seen on the bottom if by chance the last PDF page had very little content.

Ideal scenario:

The ideal scenario would involve a function that takes HTML and renders it into a single-page PDF with the expected amount of white space at the bottom. I would be happy with rendering the PDF using wkhtmltopdf, since it returns bytes, and later processing these bytes to remove any extra white space. But I don't want to involve the file system in this, as instead, I want to perform all operations in memory. Perhaps I can somehow inspect the PDF directly and remove the white space manually, or do some HTML magic to determine the render height before-hand?

What am I doing now:

Note that pdfkit is a wkhtmltopdf wrapper

# This is not a valid HTML (includes Django-specific stuff)
template: Template = get_template("some-django-template.html")

# This is now valid HTML
rendered = template.render({
    "foo": "bar",
})

# This first renders PDF from HTML normally (multiple pages)
# Then counts how many pages were created and determines the required single-page height
# Then renders a single-page PDF from HTML using the page height and width arguments
return pdfkit.from_string(rendered, options={
    "page-height": f"{297 * PdfFileReader(BytesIO(pdfkit.from_string(rendered))).getNumPages()}mm",
    "page-width": "210mm"
})

It's equivalent to Attempt type 2, except I don't use PyDPF4 here to stitch the pages together, but instead render again with wkhtmltopdf using precomputed page height.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

流年里的时光 2025-01-24 13:17:18

可能有更好的方法可以做到这一点,但这至少可以起作用。

我假设您可以自己裁剪PDF,而我在这里做的只是确定您仍然拥有内容的最后一页的距离。如果这个假设是错误的,我可能会弄清楚如何裁剪PDF。或者,只需裁剪图像(易于枕头),然后将其转换为PDF?

另外,如果您有一个大PDF,则可能需要弄清楚整个pdf的距离如何结束。我只是找出内容结束的最后一页上的距离。但是从一个转换为另一个只是一个简单的算术问题。

经过测试的代码:

import pdfkit
from PyPDF2 import PdfFileReader
from io import BytesIO

# This library isn't named fitz on pypi,
# obtain this library with `pip install PyMuPDF==1.19.4`
import fitz

# `pip install Pillow==8.3.1`
from PIL import Image

import numpy as np

# However you arrive at valid HTML, it makes no difference to the solution.
rendered = "<html><head></head><body><h3>Hello World</h3><p>hello</p></body></html>"

# This first renders PDF from HTML normally (multiple pages)
# Then counts how many pages were created and determines the required single-page height
# Then renders a single-page PDF from HTML using the page height and width arguments
pdf_bytes = pdfkit.from_string(rendered, options={
    "page-height": f"{297 * PdfFileReader(BytesIO(pdfkit.from_string(rendered))).getNumPages()}mm",
    "page-width": "210mm"
})

# convert the pdf into an image.
pdf = fitz.open(stream=pdf_bytes, filetype="pdf")
last_page = pdf[pdf.pageCount-1]
matrix = fitz.Matrix(1, 1)
image_pixels = last_page.get_pixmap(matrix=matrix, colorspace="GRAY")

image = Image.frombytes("L", [image_pixels.width, image_pixels.height], image_pixels.samples)

#Uncomment if you want to see.
#image.show()

# Now figure out where the end of the text is:

# First binarize. This might not be the most efficient way to do this.
# But it's how I do it.
THRESHOLD = 100
# I wrote this code ages ago and don't remember the details but
# basically, we treat every pixel > 100 as a white pixel, 
# We convert the result to a true/false matrix 
# And then invert that. 
# The upshot is that, at the end, a value of "True" 
# in the matrix will represent a black pixel in that location.
binary_matrix = np.logical_not(image.point( lambda p: 255 if p > THRESHOLD else 0 ).convert("1"))

# Now find last white row, starting at the bottom
row_count, column_count = binary_matrix.shape

last_row = 0
for i, row in enumerate(reversed(binary_matrix)):
    if any(row):
        last_row = i
        break
    else:
        continue 

percentage_from_top = (1 - last_row / row_count) * 100
print(percentage_from_top)

# Now you know where the page ends.
# Go back and crop the PDF accordingly.

There might be better ways to do this, but this at least works.

I'm assuming that you are able to crop the PDF yourself, and all I'm doing here is determining how far down on the last page you still have content. If that assumption is wrong, I could probably figure out how to crop the PDF. Or otherwise, just crop the image (easy in Pillow) and then convert that to PDF?

Also, if you have one big PDF, you might need to figure how how far down on the whole PDF the text ends. I'm just finding out how far down on the last page the content ends. But converting from one to the other is like just an easy arithmetic problem.

Tested code:

import pdfkit
from PyPDF2 import PdfFileReader
from io import BytesIO

# This library isn't named fitz on pypi,
# obtain this library with `pip install PyMuPDF==1.19.4`
import fitz

# `pip install Pillow==8.3.1`
from PIL import Image

import numpy as np

# However you arrive at valid HTML, it makes no difference to the solution.
rendered = "<html><head></head><body><h3>Hello World</h3><p>hello</p></body></html>"

# This first renders PDF from HTML normally (multiple pages)
# Then counts how many pages were created and determines the required single-page height
# Then renders a single-page PDF from HTML using the page height and width arguments
pdf_bytes = pdfkit.from_string(rendered, options={
    "page-height": f"{297 * PdfFileReader(BytesIO(pdfkit.from_string(rendered))).getNumPages()}mm",
    "page-width": "210mm"
})

# convert the pdf into an image.
pdf = fitz.open(stream=pdf_bytes, filetype="pdf")
last_page = pdf[pdf.pageCount-1]
matrix = fitz.Matrix(1, 1)
image_pixels = last_page.get_pixmap(matrix=matrix, colorspace="GRAY")

image = Image.frombytes("L", [image_pixels.width, image_pixels.height], image_pixels.samples)

#Uncomment if you want to see.
#image.show()

# Now figure out where the end of the text is:

# First binarize. This might not be the most efficient way to do this.
# But it's how I do it.
THRESHOLD = 100
# I wrote this code ages ago and don't remember the details but
# basically, we treat every pixel > 100 as a white pixel, 
# We convert the result to a true/false matrix 
# And then invert that. 
# The upshot is that, at the end, a value of "True" 
# in the matrix will represent a black pixel in that location.
binary_matrix = np.logical_not(image.point( lambda p: 255 if p > THRESHOLD else 0 ).convert("1"))

# Now find last white row, starting at the bottom
row_count, column_count = binary_matrix.shape

last_row = 0
for i, row in enumerate(reversed(binary_matrix)):
    if any(row):
        last_row = i
        break
    else:
        continue 

percentage_from_top = (1 - last_row / row_count) * 100
print(percentage_from_top)

# Now you know where the page ends.
# Go back and crop the PDF accordingly.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文