PDFBOX命令行合并工具警告

发布于 2025-01-30 09:31:26 字数 2997 浏览 4 评论 0原文

运行PDFBox命令行工具合并功能时,我会看到问题。

PDFBox-App-2.0.26.Jar和PDFBox-3.0.0.0-snapshot.jar报告了相同的潜在问题。我正在尝试合并,例如,通过html格式的电子邮件创建的132个PDF文件。 HTML格式的电子邮件由免费的Mbox Viewer应用程序生成。

PDFBox将记录超过65,000个警告:

May 19, 2022 12:07:23 PM org.apache.pdfbox.multipdf.PDFMergerUtility appendDocument
May 19, 2022 12:07:23 PM org.apache.pdfbox.multipdf.PDFMergerUtility mergeIDTree
WARNING: key node00000001 already exists in destination IDTree
May 19, 2022 12:07:23 PM org.apache.pdfbox.multipdf.PDFMergerUtility mergeIDTree
WARNING: key node00000002 already exists in destination IDTree
May 19, 2022 12:07:23 PM org.apache.pdfbox.multipdf.PDFMergerUtility mergeIDTree
WARNING: key node00000009 already exists in destination IDTree

生成的合并文件似乎还可以,但是很难知道而无需将单个PDF文件与合并文件进行比较。

当上述警告被PDFBox记录时,我想了解潜在的问题。我从GitHub克隆了PDFBox的V3,安装了Maven并构建了PDFBox-3.0.0.0-snapshot.jar,没有任何问题。 文件中生成警告

pdfbox\pdfbox\src\main\java\org\apache\pdfbox\multipdf\PDFMergerUtility.java by mergeIDTree() function 
(see at the bottom).

由于PDFBox是一个非常大而复杂的项目,因此我正在寻找帮助的 。在合并过程中,似乎为每个PDF文档创建的ID合并到一个目标容器中。我不确定,但看来IDS代表COS对象。例如,每个对象都有一个相关的唯一键,例如Node00000001的形式。我不知道如何将项目导入到视觉调试器中以浏览并运行代码。我是Java的新手。

看来相同的密钥可能出现在不同的PDF文档中。不确定如何为每个PDF文档生成密钥。问题是不同对象最终是否具有相同的密钥?听起来可能是这种情况,否则PDFBox会记录警告的原因。如果密钥的碰撞是一个问题,为什么PDFBox不会将唯一的密钥范围分配给每个PDF文档,例如Node0000000001-node00000100到第一个PDF文档,node000001011node00000200向第二个文档等等。我怀疑这会导致巨大的IDTREE结构,并可能增加生成的PDF的大小,但有可能解决警告。这只是我的猜测,因为我对合并过程不了解,我可能不应该推测。

private void mergeIDTree(PDFCloneUtility cloner,
            PDStructureTreeRoot srcStructTree,
            PDStructureTreeRoot destStructTree) throws IOException
    {
        PDNameTreeNode<PDStructureElement> srcIDTree = srcStructTree.getIDTree();
......
        Map<String, PDStructureElement> srcNames = getIDTreeAsMap(srcIDTree);
        Map<String, PDStructureElement> destNames = getIDTreeAsMap(destIDTree);
        for (Map.Entry<String, PDStructureElement> entry : srcNames.entrySet())
        {
            if (destNames.containsKey(entry.getKey()))
            {
                LOG.warn("key " + entry.getKey() + " already exists in destination IDTree");
            }
            else
            {
                destNames.put(entry.getKey(),
                              new PDStructureElement((COSDictionary) cloner.cloneForNewDocument(entry.getValue().getCOSObject())));
            }
        }.......

I am seeing issues when running PDFBox command line tool merge capability.

Both pdfbox-app-2.0.26.jar and pdfbox-app-3.0.0-SNAPSHOT.jar report the same potential issue. I am trying to merge, for example, 132 PDF files created by Edge browser from emails in HTML format. Emails in HTML format are generated by the free MBox Viewer application.

PDFBox will log more than 65,000 of warnings as below:

May 19, 2022 12:07:23 PM org.apache.pdfbox.multipdf.PDFMergerUtility appendDocument
May 19, 2022 12:07:23 PM org.apache.pdfbox.multipdf.PDFMergerUtility mergeIDTree
WARNING: key node00000001 already exists in destination IDTree
May 19, 2022 12:07:23 PM org.apache.pdfbox.multipdf.PDFMergerUtility mergeIDTree
WARNING: key node00000002 already exists in destination IDTree
May 19, 2022 12:07:23 PM org.apache.pdfbox.multipdf.PDFMergerUtility mergeIDTree
WARNING: key node00000009 already exists in destination IDTree

The generated merge file seems to be ok but it is hard to know without comparing individual PDF files with the merged file.

I would like to understand potential issues when the above warnings are logged by the PDFBox. I have cloned v3 of PDFBox from github, installed maven and built the pdfbox-app-3.0.0-SNAPSHOT.jar without any problems. The warnings are generated in the file

pdfbox\pdfbox\src\main\java\org\apache\pdfbox\multipdf\PDFMergerUtility.java by mergeIDTree() function 
(see at the bottom).

I am looking for help to understand implications of the warnings since the PDFBox is a very large and complex project. During the merging process, it appears that IDs created for each PDF document are merged into a single destination container. I am not sure, but it appears that IDs represent COS objects. Each object has an associated unique key in the form of node00000001 for example. I don't know how to import the project into a visual debugger to browse and run the code. I am new to Java.

It appears that the same key may appear in different PDF documents. Not sure how keys are generated for each PDF document. Question is whether different objects end up having the same key ? Sounds like that might be the case otherwise why PDFBox would log warnings. If the collision of keys is an issue, why PDFBox would not assigned unique key ranges to each PDF document, such as node00000001-node00000100 to the first PDF document, node00000101-node00000200 to the second document and so on. I suspect that this would result in a huge IDTree structure and likely increase the size of the generated PDF, but could potentially address the warnings. This Is just my guess since I have no understanding of the merging process and I probably should not speculate.

private void mergeIDTree(PDFCloneUtility cloner,
            PDStructureTreeRoot srcStructTree,
            PDStructureTreeRoot destStructTree) throws IOException
    {
        PDNameTreeNode<PDStructureElement> srcIDTree = srcStructTree.getIDTree();
......
        Map<String, PDStructureElement> srcNames = getIDTreeAsMap(srcIDTree);
        Map<String, PDStructureElement> destNames = getIDTreeAsMap(destIDTree);
        for (Map.Entry<String, PDStructureElement> entry : srcNames.entrySet())
        {
            if (destNames.containsKey(entry.getKey()))
            {
                LOG.warn("key " + entry.getKey() + " already exists in destination IDTree");
            }
            else
            {
                destNames.put(entry.getKey(),
                              new PDStructureElement((COSDictionary) cloner.cloneForNewDocument(entry.getValue().getCOSObject())));
            }
        }.......

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文