从 PDF 中读取字体颜色信息
我正在开发一款软件,可以分析 PDF 文件并根据它们生成 HTML。有很多东西已经做到了这一点,所以我知道这是可能的,出于商业原因我必须自己编写。
我已经设法从 PDF 中获取所有文本信息、位置、字体,但我很难读出文本的颜色。我目前正在使用 PDFMiner 来分析 PDF,但我开始认为我需要编写自己的 PDFReader,即便如此,我也无法弄清楚文本的颜色信息在文档中的位置保存!我什至阅读了 PDF 规范,但找不到我需要的信息。
我用谷歌搜索了一下,没有任何乐趣。
提前致谢!
I am working on a piece of software that analyses PDF files and generates HTML based on them. There are a number of things out there that already do this so I know it is possible, I have to write my own for business reasons.
I have managed to get all the text information, positions, fonts out of the PDF but I am struggling to read out the colour of the text. I am currently using PDFMiner to analyse the PDF but am beginning to think I will need to write my own PDFReader, even so, I can't figure out where in the document the Colour information for text is even kept! I have even read the PDF spec but cannot find the information I need.
I have scoured google, with no joy.
Thanks in advance!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
使用 PDF 文件内容流对象中的
g
、rg
或k
运算符之一设置文本和其他填充图形的颜色,如 PDF 参考手册中4.5.7 颜色运算符部分所述。参考手册中的示例G.3简单图形示例显示了这些运算符用于描边和填充某些形状(但不是文本)。
http://www.adobe.com/devnet/pdf/pdf_reference.html
当您自己解析 PDF 文件时,您首先要阅读预告片
在包含文件偏移量的文件末尾
交叉引用表。该表包含文件偏移量
PDF 文件中的每个对象。对象位于带有引用的树结构中
到其他对象。其中一个物体将是
内容流。这在 3.4 文件结构 部分中进行了描述
以及 PDF 参考手册中的3.6 文档结构。
可以自己解析 PDF 文件,但这是
相当多的工作。内容
流可以被压缩,包含对其他对象的引用,
包含评论等,您必须处理所有这些情况。
PDFMiner 软件已经在读取内容流。或许它
扩展 PDFMiner 来报告颜色会更容易
文本也?
The colour for text and other filled graphics is set using one of the
g
,rg
ork
operators in the content stream object in the PDF file, as described in section 4.5.7 Color Operators in the PDF reference manual.The example G.3 Simple Graphics Example in the reference manual shows these operators being used to stroke and fill some shapes (but not text).
http://www.adobe.com/devnet/pdf/pdf_reference.html
When parsing a PDF file yourself you start by reading the trailer
at the end of the file which contains the file offset of the
cross reference table. This table contains the file offset of
each object in the PDF file. The objects are in a tree structure with references
to other objects. One of the objects will be
the content stream. This is described in sections 3.4 File Structure
and 3.6 Document Structure in the PDF reference manual.
It is possible to parse the PDF file yourself but this is
quite a lot of work. The content
stream may be compressed, contain references to other objects,
contain comments, etc. and you must handle all of these cases.
The PDFMiner software is already reading the content stream. Perhaps it
would be easier to extend PDFMiner to report the colour
of the text too?