将无数 ODT 文件转换为 PDF 的可靠且快速的方法?

发布于 2024-09-03 01:09:51 字数 1549 浏览 5 评论 0原文

我需要从带有嵌入字体的简单模板(几页和表格)预先生成一百万或两个 PDF 文件。通常,在这种情况下,我会保持较低的水平,并使用像 ReportLab 这样的库来编写所有内容,但我加入该项目的时间较晚。

目前,我有一个 template.odt 并使用 content.xml 文件中的标记来填充数据库中的数据。我可以顺利地创建 ODT 文件,它们看起来总是正确的。

对于 ODT 到 PDF 的转换,我在服务器模式下使用 openoffice (以及 PyODConverter w/命名管道),但它不是很可靠:在一批文档中,最终有一个点,之后所有处理的文件都会转换为垃圾(错误的字体和字母遍布整个页面)。

问题不可预测地重现(不依赖于数据),发生 在 OOo 2.3 和 3.2 中,在 Ubuntu、XP、Server 2003 和 Windows 7 中。我的 Heisenbug 检测器正在滴答作响。

我尝试减少批次的大小并在每个批次后重新启动 OOo;仍然有一小部分文件 都搞砸了。

当然,我会在 Ooo 邮件列表上写下这一点,但与此同时,我已经交付了,已经浪费了太多时间。

我该去哪里?

  1. 完全避免 ODT 格式并采用其他模板系统。

    • 建议?任何需要几秒钟运行的事情都太慢了。 OOo 大约需要一秒钟,总共需要 15 天的处理时间。我必须编写一个程序来将作业聚集到多个客户端。
  2. 保留格式,但使用其他工具/程序进行转换。

    • 哪一个? Windows 的共享软件或商业存储库中有许多应用程序,但尝试每一个应用程序都是一项艰巨的任务。 有些太慢,有些如果不先购买就无法批量运行,有些无法从命令行运行,等等。
    • 开源工具往往不会重新发明轮子,并且通常依赖于 openoffice。
  3. 转换为中间 .DOC 格式有助于避免 OOo 错误,但它会使处理时间加倍并使本来就很复杂的任务变得复杂。

  4. 尝试生成两次 PDF 并进行比较,如果有问题则丢弃整个批次。

    • 虽然文档看起来相同,但我知道无法比较二进制内容。
  5. 处理每个文档后重新启动 OOo。

    • 生产它们需要更多时间
    • 这会降低错误文件的百分比,并使识别它们变得非常困难。
  6. 转到 ReportLab 并以编程方式重新创建页面。这是我将在几分钟后尝试的方法。

  7. 了解如何正确设置项目符号列表的格式

非常感谢。

编辑:看来我根本无法使用ReportLab,它不允许我嵌入字体。 我的字体有 TrueType 和 OpenType 版本。

TrueType 说“TTFError:字体不允许子集/嵌入(0100)”。

OpenType 版本显示“不支持 TTFError[...] postscript 轮廓”。

非常非常有趣。

I need to pre-produce a million or two PDF files from a simple template (a few pages and tables) with embedded fonts. Usually, I would stay low level in a case like this, and compose everything with a library like ReportLab, but I joined late in the project.

Currently, I have a template.odt and use markers in the content.xml files to fill with data from a DB. I can smoothly create the ODT files, they always look rigth.

For the ODT to PDF conversion, I'm using openoffice in server mode (and PyODConverter w/ named pipe), but it's not very reliable: in a batch of documents, there is eventually a point after which all the processed files are converted into garbage (wrong fonts and letters sprawled all over the page).

Problem is not predictably reproducible (does not depend on the data), happens
in OOo 2.3 and 3.2, in Ubuntu, XP, Server 2003 and Windows 7. My Heisenbug detector is ticking.

I tried to reduce the size of batches and restarting OOo after each one; still, a small percentage of the documents
are messed up.

Of course I'll write about this on the Ooo mailing lists, but in the meanwhile, I have a delivery and lost too much time already.

Where do I go?

  1. Completely avoid the ODT format and go for another template system.

    • Suggestions? Anything that takes a few seconds to run is way too slow. OOo takes around a second and it sums to 15 days of processing time. I had to write a program for clustering the jobs over several clients.
  2. Keep the format but go for another tool/program for the conversion.

    • Which one? There are many apps in the shareware or commercial repositories for windows, but trying each one is a daunting task.
      Some are too slow, some cannot be run in batch without buying it first, some cannot work from command line, etc.
    • Open source tools tend not to reinvent the wheel and often depend on openoffice.
  3. Converting to an intermediate .DOC format could help to avoid the OOo bug, but it would double the processing time and complicate a task that is already too hairy.

  4. Try to produce the PDFs twice and compare them, discarding the whole batch if there's something wrong.

    • Although the documents look equal, I know of no way to compare the binary content.
  5. Restart OOo after processing each document.

    • it would take a lot more time to produce them
    • it would lower the percentage of the wrong files, and make it very hard to identify them.
  6. Go for ReportLab and recreate the pages programmatically. This is the approach I'm going to try in a few minutes.

  7. Learn to properly format bulleted lists

Thanks a lot.

Edit: it seems like I cannot use ReportLab at all, it won't let me embed the font.
My font comes in TrueType and OpenType versions.

The TrueType one says "TTFError: Font does not allow subsetting/embedding (0100) ".

The OpenType version says "TTFError[...] postscript outlines are not supported".

Very very funny.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

日记撕了你也走了 2024-09-10 01:09:51

对于创建如此大量的 PDF 文件,OpenOffice 似乎是一个错误的产品。您应该使用真正的报告解决方案,该解决方案针对创建大量 PDF 文件进行了优化。有许多不同的工具。我推荐i-net Clear Reports(用于称为 i-net Crystal-Clear)。

  • 我希望使用 OpenOffice 可以更快地创建一个 PDF 文件。
  • 创建 2 个 PDF 文件并进行比较会消耗大量速度。
  • 它可以嵌入 True Type 字体。
  • 使用 API,您可以循环工作。
  • 使用试用许可证,您可以在批次上工作 90 天。

缺点是您必须重新开始开发。

For creating such large amount of PDF files OpenOffice seems me the wrong product. You should use a real reporting solution which is optimized for creating large amount of PDF files. There many different tools. I would recommended i-net Clear Reports (used to be called i-net Crystal-Clear).

  • I would expect that one PDF file is faster created as with OpenOfice.
  • Creating 2 PDF files and comparing it will cost a lot of speed.
  • It can embedded True Type Fonts.
  • With the API you can work in a loop.
  • With a trial license you can work for 90 days on your batch

The disadvantages is that you must restart your development.

情魔剑神 2024-09-10 01:09:51

我可能最终会找到某种方法来确定批处理何时失控,然后重新处理失败前不久的所有内容。如何判断它何时失控?这将需要分析一些正确的 PDF 和一些失败的 PDF,以寻找它们之间的相似之处:

  • 与源文件相比,生成的文件大小不正确
  • 字体名称)
  • 文件不包含某些字符串(例如 转换回文本时数据不在预期位置
  • ,转换为位图时它们不包含模板中的预期数据
  • ,文本不在正确位置

我怀疑将它们转换回文本并查找预期字符串将是最准确的解决方案,但也很慢。如果对每个文件运行速度太慢,请每隔 1/100 左右运行一次,然后在最后一个已知的正确文件之后重新转换每个文件。

I would probably end up finding some way to determine when the batch processing goes haywire, then reprocess everything from shortly before it failed. How to determine when it goes haywire? That will require analyzing some correct PDFs and some failed ones, to look for similarities among them:

  • generated files aren't the right size compared to their source
  • the files don't contain some string (like the name of your font)
  • some bit of data is not in the expected place
  • when converted back to text, they don't contain expected data from the template
  • when converted to a bitmap, text isn't in the right place

I suspect that converting them back to text and looking for expected strings is going to be the most accurate solution, but also slow. If it's too slow to run on every file, run it on every 1/100th or so, and just reconvert every file after the last known good one.

怪异←思 2024-09-10 01:09:51

对于您的情况,Reportlab PLUS 似乎非常适合,包括模板和电话支持,可帮助您快速上手。

For your scenario it seems that Reportlab PLUS is a good fit, including templates and phone support to get you going fast.

婴鹅 2024-09-10 01:09:51

非常有趣的问题。既然您已经将其编写为跨多台机器的集群,为什么不使用双生产方法并将其分布在 EC2 节点上。这会花费一点额外的费用,但您可以使用 md5 或 sha 哈希值来比较内容,如果两个版本相同,您可以继续。

Very interesting problem. Since you have already written it to cluster across several machines why not use the double production approach and spread it on EC2 nodes. It will cost a bit extra but you can compare stuff using md5 or sha hashes and if 2 versions are the same you can move on.

自找没趣 2024-09-10 01:09:51

为了比较 2 个 pdf 文件,我推荐 i-net PDF 内容比较器 。它可以很好地比较 2 个目录的 PDF 文件。我们在回归测试系统中使用它。

For comparing 2 pdf files I would recommended i-net PDF content comparer. It can compare 2 directories of PDF files very good. We are use it in our regression test system.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文