当前位置：文江博客话题详情

pdf-generation scalable

生成 PDF 的方法有哪些？

发布于 2024-10-09 17:18:40 字数 547 浏览 13 评论 0 原文

我有一个应用程序的想法，它可以获取一些包含图形和图像（例如各种几何形状和多边形）以及一些随机图像的 Flash 内容，并将它们转换为 PDF。

另外，由于我设想这个应用程序将被多个用户使用，所以我希望这个过程快速且可扩展。我能想到的一种可能的解决方案是拥有一个小型闪存客户端，能够组装上述图形和图像。生成某种 XML，将其发送到运行 Java 进程的服务器，该进程可以使用 iText 呈现 PDF。

我想知道还有哪些其他可能的方法或最佳实践。技术不是问题；开源或商业。

我知道图像上传等将花费不同的时间，因此请考虑图像是随时可用的。以下是我在 PDF 渲染解决方案中寻找的标准：

由于 PDF 渲染引擎，对 Flash 客户端没有任何限制。
可扩展到多个用户
速度和效率
最少的序列化/反序列化

如果您能分享您的技术堆栈想法，我将不胜感激。多谢！

PS：如果您没有陷入我的 Flash 困境，我将不胜感激>> XML>> Java 方法。我相信这是可以采取的众多方法之一。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

莫言歌 2024-10-16 17:18:40

如果可以选择使用 Flash 在浏览器中生成 PDF，则可以考虑使用 AlivePdf。如果没有，请查看 XSL:FO，我们使用它在服务器端转换为 PDF。

回复收藏 0 原文

再可℃爱ぅ一点好了 2024-10-16 17:18:40

我相信 iText 用 Java 代码生成 PDF。它可能会也可能不会使用 XML 作为其数据源； POJO 也可以。

另一种方式是 XSL-FO。它需要 XML 数据源和 XSL-FO 样式表来转换 XML 并生成 PDF。 Apache 的 Xalan（或任何其他 XSL-T 库）可以为您做到这一点。

“快速”和“可扩展”可能需要的不仅仅是这些。上传大量图像是一个有自己的时间尺度和优化的过程，与 PDF 无关。

回复收藏 0 原文

泛泛之交 2024-10-16 17:18:40

有用于 PHP 的 pdflib 和 FPDF（也适用于 PHP）。

回复收藏 0 原文

慢慢从新开始 2024-10-16 17:18:40

那么您也愿意考虑其他客户吗？听起来您有一个儿童绘画应用程序，并且想要生成一些可以保留他们当时绘画状态的东西。

让我们面对现实吧，XML 的效率并不高。这不是它的目的。它既是机器可读的，也是人类可读的、可验证的等等。

相反，基于的网页如何以 JSON 格式将该画布的状态提交给服务器（字节数更少，工作量更少）来构建它们）。然后服务器可以使用它想要的任何库/语言来工作。那里有很多 JSON->我的语言库。

您对 PDF 库的选择仅受服务器上安装的内容的限制。您还说您想尽可能少地阅读/写作。

最有效的设置是将只读部分 PDF 加载到内存中，以最大限度地减少画布更改（包括图像）的影响。每个会话都会复制部分 PDF，将 JSON 转换为 PDF 图形命令，然后保存 PDF。

为了最大程度地减少对 PDF 的结构更改，您需要使用内联图像。 PDF 中没有新对象意味着您根本不需要更改交叉引用表（除非您添加字体或想要重用现有图像）。您可以构建“文档信息”字典，并在对象之间填充特定数量的空格，这样您就可以在不更改任何字节偏移量的情况下填充它（这将迫使您重新计算外部参照表）。

您可能需要也可能不需要弄乱页面大小......我们只是在这里讨论一页，对吧？

所以 PDF 看起来像......

%%PDF-1.6
<3-4 random high order bytes to convince folks that we're a binary stream>
1 0 obj
<</Type/Catalog/Pages 2 0 R>>
endobj
2 0 obj
<</Type/Pages/Count 1/Kids[3 0 R]>>
endobj
3 0 obj
<</Type/Page/Contents 4 0 R/MediaBox[0 0 612 792]/Parent 2 0 R>>
endobj
5 0 obj
<</Type/DocInfo/Author()  --<insert big whitespace gap here>-- 
/Title() --<ditto>--
/Subject() --<ditto>--
/Keywords() --<ditto>--
/Creator(My app's Name)
/Producer(My pdf library's name)
/CreationDate(encodedDateWhenThisTemplateWasBuilt) D:YYYYMMDDHHMMSS-timeZoneOffset
/ModDate() --<another, smaller whitespace gap>--
>>
4 0 obj
<</Filter/SeveralDifferentFiltersAvailable/Length --<byte length of the stream in this file>-->>
stream

并且您的模板就到此为止。您将有一个类似的“PDF 结尾”模板，看起来像这样：

endstream
endobj
xref
0 6
0000000000 65535 f 
0000000010 00000 n
0000000025 00000 n
0000000039 00000 n
0000000097 00000 n
0000000050 00000 n
trailer
<</Root 1 0 R/Size 6/Info 5 0 R>>
startxref
--<some white space>--
%%EOF

末尾的数字列都是错误的。第一列是该特定对象的字节偏移量（我现在不准备计算字节，谢谢）。第二列很大程度上无关紧要。

PDF 填充应用程序需要知道：

您想要在第一个模板中填充的所有内容的字节偏移量。
1. 所有“文档信息”字段，顺便说一下，它们都是可选的。 /Info 键及其指向的字典就此而言是可选的。如果你愿意的话，你可以把它们拽下来。
2. 内容流的 /Length 键。这需要是流本身的过滤后字节长度。
如何将 JSON 转换为 pdf 绘图命令。如果你想作弊一点，你可以使用 iText[Sharp] 的 PdfContentByte 类，使用它的绘图命令，然后获取完成的字节流并将其放入 PDF 中。确保使用内联图像，否则整个方案就会消失。如果您觉得有需要，可能还有其他库可以类似地进行消化。或者您可以阅读 PDF 规范并推出自己的规范。您将坚持使用 PDF 内容语法的相当有限的子集。
单词“xref”相对于文件开头的字节偏移量。您可以这样计算：LengthOfInitialTemplate + LengthOfContentStream + OffsetFromStartOf2ndTemplateTo'xref'。
“startxref”下面的行的字节偏移量，这是您写入前面计算的“xref”字节偏移量的位置

。您不会获得比这更高效的结果。您已经读过一次模板。读取/计算一次您需要的字节偏移量。

So you're also willing to consider other clients? It sounds like you've got a kids drawing app and want to generate something that'll preserve the state of their drawing at the time.

Lets face it, XML isn't that efficient. That's not its purpose. It's both machine and human readable, validatable, etc etc.

Instead, how about a <Canvas> based web page that submitted the state of that canvas to the server in JSON (fewer bytes, and less work to build them). The server can then work in whatever the hell library/language it wants. Lots of JSON->my-language libraries floating around out there.

Your choice in PDF libraries is then limited only by what is you have installed on your server. You also said you wanted to do as little reading/writing as possible.

The most efficient possible setup would be to have a read-only partial PDF already loaded into memory tailored to minimize the impact of canvas changes (including images). Each session would dupe that partial PDF, convert the JSON to PDF graphic commands, and save the PDF.

To minimize structural changes to the PDF you'd want to use Inline Images. No new objects in the PDF means you don't need to change your cross reference table at all (until you add fonts or want to reuse an existing image). You could build the "doc info" dictionary padded with a specific amount of spaces between objects so you could fill it in without changing any byte offsets (which would force you to recompute the xref table).

You may or may not need to mess with the page size... we are just talking about one page here, right?

So the PDF would look something like...

%%PDF-1.6
<3-4 random high order bytes to convince folks that we're a binary stream>
1 0 obj
<</Type/Catalog/Pages 2 0 R>>
endobj
2 0 obj
<</Type/Pages/Count 1/Kids[3 0 R]>>
endobj
3 0 obj
<</Type/Page/Contents 4 0 R/MediaBox[0 0 612 792]/Parent 2 0 R>>
endobj
5 0 obj
<</Type/DocInfo/Author()  --<insert big whitespace gap here>-- 
/Title() --<ditto>--
/Subject() --<ditto>--
/Keywords() --<ditto>--
/Creator(My app's Name)
/Producer(My pdf library's name)
/CreationDate(encodedDateWhenThisTemplateWasBuilt) D:YYYYMMDDHHMMSS-timeZoneOffset
/ModDate() --<another, smaller whitespace gap>--
>>
4 0 obj
<</Filter/SeveralDifferentFiltersAvailable/Length --<byte length of the stream in this file>-->>
stream

And your template stops there. You'd have a similar "end of the PDF" template that would look something like this:

endstream
endobj
xref
0 6
0000000000 65535 f 
0000000010 00000 n
0000000025 00000 n
0000000039 00000 n
0000000097 00000 n
0000000050 00000 n
trailer
<</Root 1 0 R/Size 6/Info 5 0 R>>
startxref
--<some white space>--
%%EOF

The columns of numbers at the end are all wrong. The first column is the byte offset of that particular object (and I'm not up for counting bytes just now thank you). The second column is largely irrelevant.

PDF filling app will need to know:

The byte offsets of everything you intend to fill in within the first template.
1. All the "doc info" fields, which are all optional by the way. The /Info key and the dictionary it points to are optional for that matter. You could yank 'em if you cared to.
2. the /Length key of the content stream. That needs to be the post-filter byte length of the stream itself.
How to convert the JSON into pdf drawing commands. If you wanted to cheat a bit you could use iText[Sharp]'s PdfContentByte class, use its drawing commands, and then get the finished byte stream and slap that into your PDF. Be sure you use Inline Images or this whole scheme goes right out the window. There are probably other libraries you could gut similarly if you felt the need. Or you could just read up on the PDF spec and roll your own. You'll be sticking to a fairly limited subset of PDF's content syntax.
The byte offset of the word "xref" from the start of the file. You can calculate this: LengthOfInitialTemplate + LengthOfContentStream + OffsetFromStartOf2ndTemplateTo'xref'.
The byte offset of the line below "startxref", which is where you write the aforecalculated byte offset of 'xref'

You're not going to get much more efficient than that. You'd read in your templates once. Read/calculate the byte offsets you needed once.

回复收藏 0 原文

~没有更多了~