将无数 ODT 文件转换为 PDF 的可靠且快速的方法？

发布于 2024-09-03 01:09:51 字数 1549 浏览 5 评论 0原文

我需要从带有嵌入字体的简单模板（几页和表格）预先生成一百万或两个 PDF 文件。通常，在这种情况下，我会保持较低的水平，并使用像 ReportLab 这样的库来编写所有内容，但我加入该项目的时间较晚。

目前，我有一个 template.odt 并使用 content.xml 文件中的标记来填充数据库中的数据。我可以顺利地创建 ODT 文件，它们看起来总是正确的。

对于 ODT 到 PDF 的转换，我在服务器模式下使用 openoffice （以及 PyODConverter w/命名管道），但它不是很可靠：在一批文档中，最终有一个点，之后所有处理的文件都会转换为垃圾（错误的字体和字母遍布整个页面）。

问题不可预测地重现（不依赖于数据），发生在 OOo 2.3 和 3.2 中，在 Ubuntu、XP、Server 2003 和 Windows 7 中。我的 Heisenbug 检测器正在滴答作响。

我尝试减少批次的大小并在每个批次后重新启动 OOo；仍然有一小部分文件都搞砸了。

当然，我会在 Ooo 邮件列表上写下这一点，但与此同时，我已经交付了，已经浪费了太多时间。

我该去哪里？

完全避免 ODT 格式并采用其他模板系统。
- 建议？任何需要几秒钟运行的事情都太慢了。 OOo 大约需要一秒钟，总共需要 15 天的处理时间。我必须编写一个程序来将作业聚集到多个客户端。
保留格式，但使用其他工具/程序进行转换。
- 哪一个？ Windows 的共享软件或商业存储库中有许多应用程序，但尝试每一个应用程序都是一项艰巨的任务。有些太慢，有些如果不先购买就无法批量运行，有些无法从命令行运行，等等。
- 开源工具往往不会重新发明轮子，并且通常依赖于 openoffice。
转换为中间 .DOC 格式有助于避免 OOo 错误，但它会使处理时间加倍并使本来就很复杂的任务变得复杂。
尝试生成两次 PDF 并进行比较，如果有问题则丢弃整个批次。
- 虽然文档看起来相同，但我知道无法比较二进制内容。
处理每个文档后重新启动 OOo。
- 生产它们需要更多时间
- 这会降低错误文件的百分比，并使识别它们变得非常困难。
转到 ReportLab 并以编程方式重新创建页面。这是我将在几分钟后尝试的方法。
了解如何正确设置项目符号列表的格式

非常感谢。

编辑：看来我根本无法使用ReportLab，它不允许我嵌入字体。我的字体有 TrueType 和 OpenType 版本。

TrueType 说“TTFError：字体不允许子集/嵌入（0100）”。

OpenType 版本显示“不支持 TTFError[...] postscript 轮廓”。

非常非常有趣。

原文

I need to pre-produce a million or two PDF files from a simple template (a few pages and tables) with embedded fonts. Usually, I would stay low level in a case like this, and compose everything with a library like ReportLab, but I joined late in the project.

Currently, I have a template.odt and use markers in the content.xml files to fill with data from a DB. I can smoothly create the ODT files, they always look rigth.

For the ODT to PDF conversion, I'm using openoffice in server mode (and PyODConverter w/ named pipe), but it's not very reliable: in a batch of documents, there is eventually a point after which all the processed files are converted into garbage (wrong fonts and letters sprawled all over the page).

Problem is not predictably reproducible (does not depend on the data), happens
in OOo 2.3 and 3.2, in Ubuntu, XP, Server 2003 and Windows 7. My Heisenbug detector is ticking.

I tried to reduce the size of batches and restarting OOo after each one; still, a small percentage of the documents
are messed up.

Of course I'll write about this on the Ooo mailing lists, but in the meanwhile, I have a delivery and lost too much time already.

Where do I go?

Completely avoid the ODT format and go for another template system.
- Suggestions? Anything that takes a few seconds to run is way too slow. OOo takes around a second and it sums to 15 days of processing time. I had to write a program for clustering the jobs over several clients.
Keep the format but go for another tool/program for the conversion.
- Which one? There are many apps in the shareware or commercial repositories for windows, but trying each one is a daunting task.
  Some are too slow, some cannot be run in batch without buying it first, some cannot work from command line, etc.
- Open source tools tend not to reinvent the wheel and often depend on openoffice.
Converting to an intermediate .DOC format could help to avoid the OOo bug, but it would double the processing time and complicate a task that is already too hairy.
Try to produce the PDFs twice and compare them, discarding the whole batch if there's something wrong.
- Although the documents look equal, I know of no way to compare the binary content.
Restart OOo after processing each document.
- it would take a lot more time to produce them
- it would lower the percentage of the wrong files, and make it very hard to identify them.
Go for ReportLab and recreate the pages programmatically. This is the approach I'm going to try in a few minutes.
Learn to properly format bulleted lists