PDF 交叉引用流

发布于 2024-10-09 23:02:21 字数 808 浏览 0 评论 0原文

我正在开发一个 PDF 解析器/编写器,但我一直致力于生成交叉引用流。 我的程序读取 这个 文件,然后删除其线性化,并解压缩对象流中的所有对象。最后它构建 PDF 文件并保存它。

当我使用普通的交叉引用和交叉引用时,这非常有效。预告片,如您在文件中看到的那样。

当我尝试生成交叉引用流对象(这会生成 this 文件时,Adobe Reader 可以有

没有人有使用 PDF 的经验,可以帮助我搜索问题是什么?

请注意,交叉引用是文件 2 和文件 3 之间的唯一区别。

如果有人需要 的内容, 前 34127 字节是相同的。解码的参考流,下载这个文件并在十六进制编辑器中打开它我已经检查过这个。一遍又一遍地参考表格,但我找不到任何问题。但是,

非常感谢您的帮助!

更新

我现在已经完全解决了这个问题。可以在此处找到新的 PDF。

I'm developing a PDF parser/writer, but I'm stuck at generating cross reference streams.
My program reads this file and then removes its linearization, and decompresses all objects in object streams. Finally it builds the PDF file and saves it.

This works really well when I use the normal cross reference & trailer, as you can see in this file.

When I try to generate a cross reference stream object instead (which results in this file, Adobe Reader can't view it.

Has anyone experience with PDF's and can help me search what the Problem is?

Note that the cross reference is the ONLY difference between file 2 and file 3. The first 34127 bytes are the same.

If someone needs the content of the decoded reference stream, download this file and open it in a HEX editor. I've checked this reference table again and again but I could not find anything wrong. But the dictionary seems to be OK, too.

Thanks so much for your help!!!

Update

I've now completely solved the problem. You can find the new PDF here.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

懷念過去 2024-10-16 23:02:21

我看到两个问题(不查看流数据本身。

  1. "Size 整数(必需)比本节或任何应使用的最高对象编号大 1 的数字是一个更新。它应该相当于预告片字典中的 Size 条目。”

    你的尺寸应该是... 14.

  2. "Index 数组(可选)包含本节中每个小节的一对整数的数组。第一个整数应是数组中的第一个对象编号小节;第二个整数应是小节中的条目数
    该数组应按对象编号升序排序。小节不能重叠;一个对象编号在一个部分中最多可以有一个条目。
    默认值:[0 大小]。”

    您的索引可能应该稍微跳过一下。您没有对象 2-4 或 7。索引数组需要反映这一点。

  3. 您的数据也不正确(我只是学会了读取外部参照流。是的。)

00 00 00  
01 00 0a  
01 00 47  
01 01 01  
01 01 70  
01 02 fd  
01 76 f1  
01 84 6b  
01 84 a1  
01 85 4f

根据此数据,这是因为您的“不” index" 被解释为对象编号 0 到 9,具有以下偏移量:

0 is unused.  Fine.  
1 is at 0x0a.  Yep, sure is  
2 is at 0x47.  Nope.  That lands near the beginning of "1 0"'s stream. This probably isn't a coincidence.  
3 is at 0x101.  Nope.  0x101 is still within "1 0"'s stream.  
4 is at 0x170.  Ditto  
5 is at 0x2fd.  Ditto  
6 is at 0x76f1. Nope, and this time buried inside that image's stream.

我想您明白了。因此,即使您有正确的 \Index,您的偏移量也是错误的(并且与 resultNormal.pdf 中的内容完全不同,甚至允许十进制-十六进制混淆)。

你想要的可以在 resultNormal 的外部引用中找到:

xref  
0 2  
0000000000 65535 f  
0000000010 00000 n  
5 2  
0000003460 00000 n  
0000003514 00000 n  
8 5  
0000003688 00000 n  
0000003749 00000 n  
0000003935 00000 n  
0000004046 00000 n  
0000004443 00000 n  

所以你的索引应该是(如果我没读错的话): \Index[0 2 5 2 8 5]和数据:

0 0 0  
1 0 a  
1 3460 (that's decimal)  
1 3514 (ditto)  
1 3688  
etc

有趣的是,PDF 规范规定大小必须既是该外部参照中的条目数,又是比使用的最高对象数大一的数字

。强制执行,但我不会惊讶地发现外部参照流比正常的交叉引用表更具保留性。可能是处理两者的相同代码,也可能不是。


@mtraut:

这是我所看到的:

13 0 obj <</Size 10/Length 44/Filter /FlateDecode/DecodeParms <</Columns 3/Predictor 12>>/W [1 2 0]/Type /XRef/Root 8 0 R>>
stream  
...  
endstream  
endobj  

Two problems I see (without looking at the stream data itself.

  1. "Size integer (Required) The number one greater than the highest object number used in this section or in any section for which this shall be an update. It shall be equivalent to the Size entry in a trailer dictionary."

    your size should be... 14.

  2. "Index array (Optional) An array containing a pair of integers for each subsection in this section. The first integer shall be the first object number in the subsection; the second integer shall be the number of entries in the subsection
    The array shall be sorted in ascending order by object number. Subsections cannot overlap; an object number may have at most one entry in a section.
    Default value: [0 Size]."

    Your index should probably skip around a bit. You have no objects 2-4 or 7. The index array needs to reflect that.

  3. Your data Ain't Right either (and I just learned out to read an xref stream. Yay me.)

00 00 00  
01 00 0a  
01 00 47  
01 01 01  
01 01 70  
01 02 fd  
01 76 f1  
01 84 6b  
01 84 a1  
01 85 4f

According to this data, which because of your "no index" is interpreted as object numbers 0 through 9, have the following offset:

0 is unused.  Fine.  
1 is at 0x0a.  Yep, sure is  
2 is at 0x47.  Nope.  That lands near the beginning of "1 0"'s stream. This probably isn't a coincidence.  
3 is at 0x101.  Nope.  0x101 is still within "1 0"'s stream.  
4 is at 0x170.  Ditto  
5 is at 0x2fd.  Ditto  
6 is at 0x76f1. Nope, and this time buried inside that image's stream.

I think you get the idea. So even if you had a correct \Index, your offsets are all wrong (and completely different from what's in resultNormal.pdf, even allowing for dec-hex confusion).

What you want can be found in resultNormal's xref:

xref  
0 2  
0000000000 65535 f  
0000000010 00000 n  
5 2  
0000003460 00000 n  
0000003514 00000 n  
8 5  
0000003688 00000 n  
0000003749 00000 n  
0000003935 00000 n  
0000004046 00000 n  
0000004443 00000 n  

So your index should be (if I'm reading this right): \Index[0 2 5 2 8 5]. And the data:

0 0 0  
1 0 a  
1 3460 (that's decimal)  
1 3514 (ditto)  
1 3688  
etc

Interestingly, the PDF spec says that the size must be BOTH the number of entries in this and all previous XRefs AND the number one higher than the highest object number in use.

I don't think the later part is ever enforced, but I wouldn't be surprised to find that xref streams are more retentive than the normal cross reference tables. Might be the same code handling both, might not.


@mtraut:

Here's what I see:

13 0 obj <</Size 10/Length 44/Filter /FlateDecode/DecodeParms <</Columns 3/Predictor 12>>/W [1 2 0]/Type /XRef/Root 8 0 R>>
stream  
...  
endstream  
endobj  
剑心龙吟 2024-10-16 23:02:21

“resultstream.pdf”没有有效的交叉引用流。

如果我在查看器中打开它,他会尝试将对象“ 13 0 ”读取为交叉引用流,但它是一个普通字典(流标签和数据丢失)。

有点偏离主题:你用什么语言开发?至少在 Java 中知道三个有价值的选择(PDFBox、iText 和 jPod,我个人作为开发人员之一选择了 jPod,非常干净的实现:-)。如果这不适合您的平台,也许您至少可以看看算法和数据结构。

编辑

好吧 - 如果“resultstream.pdf”是有问题的文档,那么这就是我的编辑器(SCITE)所看到的

...
13 0 obj
<</Size 0/W [1 2 0]/Type /XRef/Root 8 0 R>>
endobj
startxref
34127
%%EOF

没有流。

The "resultstream.pdf" does not have a valid cross ref stream.

if i open it in my viewer, he tries to read object " 13 0 " as a cross ref stream, but its a plain dictionary (stream tags and data is missing).

A little out of topic: What language are you developing in? At least in Java a know of three valuable choices (PDFBox, iText and jPod, where i personally as one of the developers opt for jPod, very clean implementation :-). If this does not fit your platform, maybe you can at least have a look at algorithms and data structures.

EDIT

Well - if "resultstream.pdf" is the document in question then this is what my editor (SCITE) sees

...
13 0 obj
<</Size 0/W [1 2 0]/Type /XRef/Root 8 0 R>>
endobj
startxref
34127
%%EOF

There is no stream.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文