如何在记忆中修剪PDF文档的底部空间
我正在使用wkhtmltopdf
将HTML文档(django-templated)渲染为单页PDF文件。我想立即以正确的高度渲染(我还没有做到这一点),或者将其渲染不正确并修剪。我正在使用Python。
尝试类型1:
- 呈现为非常非常长的单页PDF,并使用很多额外的空间
wkhtmltopdf
使用使用
- page-Height
使用pdfcropmargins to to to:
作物([“ - p4”,“ 100”,“ 0”,“ 100”,“ 100”,“ -A4”,“ 0”,“ -28”,“ 0”,“,”,“” 0“,” input.pdf”])
PDF在底部用28个保证金完美地渲染,但是我必须使用文件系统来执行crop
命令。该工具似乎期望输入文件和输出文件,并且还会在中途创建临时文件。所以我不能使用它。
尝试类型2:
wkhtmltopdf
渲染到具有默认参数的多页PDF- 使用
pypdf4
(或pypdf2
)以读取文件并将页面组合到一个 在大多数情况下,长长的单页
pdf呈现精细,有时,如果最后一个PDF页面的内容很少,则有时可以在底部看到很多额外的空白。
理想场景:
理想的方案将涉及一个函数,该功能将HTML带入单页的PDF中,其底部预期的白空间量。我很高兴使用wkhtmltopdf
呈现PDF,因为它返回字节,然后再处理这些字节以删除任何额外的白色空间。但是我不想将文件系统涉及到其中,因为相反,我想在内存中执行所有操作。也许我可以以某种方式直接检查PDF并手动卸下空白空间,或者做一些HTML魔术以提前确定渲染高度?
我现在在做什么:
请注意,pdfkit
是wkhtmltopdf
wrapper
# This is not a valid HTML (includes Django-specific stuff)
template: Template = get_template("some-django-template.html")
# This is now valid HTML
rendered = template.render({
"foo": "bar",
})
# This first renders PDF from HTML normally (multiple pages)
# Then counts how many pages were created and determines the required single-page height
# Then renders a single-page PDF from HTML using the page height and width arguments
return pdfkit.from_string(rendered, options={
"page-height": f"{297 * PdfFileReader(BytesIO(pdfkit.from_string(rendered))).getNumPages()}mm",
"page-width": "210mm"
})
它等同于尝试类型2
,除了i请勿在此处使用PYDPF4
将页面缝合在一起,而是使用wkhtmltopdf
再次使用预算页面高度渲染。
I am using wkhtmltopdf
to render a (Django-templated) HTML document to a single-page PDF file. I would like to either render it immediately with the correct height (which I've failed to do so far) or render it incorrectly and trim it. I'm using Python.
Attempt type 1:
wkhtmltopdf
render to a very, very long single-page PDF with a lot of extra space using--page-height
- Use
pdfCropMargins
to trim:crop(["-p4", "100", "0", "100", "100", "-a4", "0", "-28", "0", "0", "input.pdf"])
The PDF is rendered perfectly with 28 units of margin at the bottom, but I had to use the filesystem to execute the crop
command. It seems that the tool expects an input file and output file, and also creates temporary files midway through. So I can't use it.
Attempt type 2:
wkhtmltopdf
render to multi-page PDF with default parameters- Use
PyPDF4
(orPyPDF2
) to read the file and combine pages into a long, single page
The PDF is rendered fine-ish in most cases, however, sometimes a lot of extra white space can be seen on the bottom if by chance the last PDF page had very little content.
Ideal scenario:
The ideal scenario would involve a function that takes HTML and renders it into a single-page PDF with the expected amount of white space at the bottom. I would be happy with rendering the PDF using wkhtmltopdf
, since it returns bytes, and later processing these bytes to remove any extra white space. But I don't want to involve the file system in this, as instead, I want to perform all operations in memory. Perhaps I can somehow inspect the PDF directly and remove the white space manually, or do some HTML magic to determine the render height before-hand?
What am I doing now:
Note that pdfkit
is a wkhtmltopdf
wrapper
# This is not a valid HTML (includes Django-specific stuff)
template: Template = get_template("some-django-template.html")
# This is now valid HTML
rendered = template.render({
"foo": "bar",
})
# This first renders PDF from HTML normally (multiple pages)
# Then counts how many pages were created and determines the required single-page height
# Then renders a single-page PDF from HTML using the page height and width arguments
return pdfkit.from_string(rendered, options={
"page-height": f"{297 * PdfFileReader(BytesIO(pdfkit.from_string(rendered))).getNumPages()}mm",
"page-width": "210mm"
})
It's equivalent to Attempt type 2
, except I don't use PyDPF4
here to stitch the pages together, but instead render again with wkhtmltopdf
using precomputed page height.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
可能有更好的方法可以做到这一点,但这至少可以起作用。
我假设您可以自己裁剪PDF,而我在这里做的只是确定您仍然拥有内容的最后一页的距离。如果这个假设是错误的,我可能会弄清楚如何裁剪PDF。或者,只需裁剪图像(易于枕头),然后将其转换为PDF?
另外,如果您有一个大PDF,则可能需要弄清楚整个pdf的距离如何结束。我只是找出内容结束的最后一页上的距离。但是从一个转换为另一个只是一个简单的算术问题。
经过测试的代码:
There might be better ways to do this, but this at least works.
I'm assuming that you are able to crop the PDF yourself, and all I'm doing here is determining how far down on the last page you still have content. If that assumption is wrong, I could probably figure out how to crop the PDF. Or otherwise, just crop the image (easy in Pillow) and then convert that to PDF?
Also, if you have one big PDF, you might need to figure how how far down on the whole PDF the text ends. I'm just finding out how far down on the last page the content ends. But converting from one to the other is like just an easy arithmetic problem.
Tested code: