PDF:具有不同 ToUnicode Cmap 的重复字体名称
我正在解析 PDF 文件并提取一些文本,并且遇到了一种名为“C2_0”的字体字典的情况,其中包含带有 ToUnicode
CMap。所以,没问题 - 我有工具来解析 ToUnicode
CMap 并将 2 字节字符代码映射到 Unicode 值。
但 PDF 文件稍后包含另一个字体字典对象,也称为“C2_0”,其中包含不同 ToUnicode CMap.我真的不知道应该如何处理第二个 CMap,所以我只是猜测并组合了两个 CMap 中的条目。这确实有效,并且正确提取了文本。
但是,我在 PDF 参考手册中找不到任何说明这是允许的,甚至找不到解决这种情况的内容。我本以为重复的字体名称会导致未指定的行为,或者至少让第二个字体覆盖第一个字体或其他内容。我只是尝试将它们结合起来作为一个不太可能的猜测 - 令人惊讶的是它实际上有效。
有人有这方面的经验吗?有谁知道 PDF 是否允许有重复的字体名称,这些字体名称引用具有不同 CMap 的不同对象,这些 CMap 在由 Tf
运算符调用时“组合”?
I'm parsing a PDF file and extracting some of the text, and I've run into a situation where I encounter a font dictionary named "C2_0", which contains a CIDFont (Type 0) with a ToUnicode
CMap. So, no problem - I have tools to parse the ToUnicode
CMap and map the 2-byte character codes to Unicode values.
But the PDF file later includes another font dictionary object, which is also called "C2_0", which contains a different ToUnicode
CMap. I didn't really how I should handle the second CMap, so I just guessed and combined the entries from both CMaps. This actually worked, and extracted the text correctly.
But, I can't find anything in the PDF Reference Manual that says this is allowed, or even addresses this situation. I would have thought that duplicate font names would lead to unspecified behavior, or at least have the second override the first or something. I only tried combining them as a longshot guess - and was surprised it actually worked.
Does anyone have experience with this? Does anyone know if a PDF is allowed to have duplicate font names that refer to different objects with different CMaps that "combine" when invoked by a Tf
operator?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
C2_0是/Font资源字典中的符号名称,具有本地作用域,仅在资源字典所属的内容流中使用。如果 C2_0 也出现在另一个 /Font 资源字典中,那不是问题。
在同一个 /Font 资源字典中有 2 个 C2_0 条目:
/C2_0 X 0 R
/C2_0 是 0 R
那么你就会遇到问题,因为行为未定义,如何处理这种情况取决于你。
符号名称解析的工作原理如下:如果您位于页面内容流中,请在页面的资源字典中搜索字体符号名称(Tf 操作数)。如果找不到它,请在页面树中向上查找资源字典(如果存在)以查找每个父页面节点。如果到达页面树的顶部但没有找到字体,则行为未定义。此时您可以实施各种回退策略:您可以使用默认字体,您可以在页面上搜索以XObjects形式包含的资源,您可以在其他页面中搜索资源字典。
C2_0 is a symbolic name in the /Font resource dictionary and it has local scope, it is used only in the content stream the resource dictionary belongs to. If C2_0 appears also in another /Font resource dictionary, that's not a problem.
In you have in the same /Font resource dictionary 2 C2_0 entries:
/C2_0 X 0 R
/C2_0 Y 0 R
then you have a problem because the behavior is undefined and it is up to you how to handle the situation.
The symbolic name resolution works like this: if you are in a page content stream, search the font symbolic name (the Tf operand) in the page's resources dictionary. If you cannot locate it, go up in the page tree and search the resources dictionary (if they exist) for each parent page node. If you reached the top of the page tree and you did not find the font, the behavior is undefined. At this moment you can implement various fallback strategies: you can use a default font, you can search the resources included in the form XObjects on the page, you can search the resources dictionaries in the other pages.
不幸的是,您发现许多 PDF 文件并不“完美”...
如果您使用 pdftk 等工具将 2 个 PDF 文件连接成 1 个,您所描述的情况可能很容易发生。
重复的字体名称不一定会导致未指定的行为——取决于 PDF 阅读器的聪明程度。 PDF 阅读器在渲染内容时可以考虑每种字体的对象 ID...或者仅依靠字体名称来搞乱它。
It is unfortunate that you find many PDF files in the wild which are not "perfect"...
What you describe may easily happen if you concatenate 2 PDF files into 1 with tools like
pdftk
.Duplicate font names not necessarily lead to unspecified behavior -- depending on the cleverness of the PDF reader. The PDF reader can take into account the object ID of each font when rendering the content... or mess it up by relying on fontnames only.