使用 PDFBox 获取文本颜色
我刚刚开始使用 PDFBox,提取文本等。我感兴趣的一件事是我正在提取的文本本身的颜色。但是我似乎找不到任何获取该信息的方法。
是否有可能使用 PDFBox 来获取文档的颜色信息?如果可以,我将如何做?
非常感谢。
I have just started working with PDFBox, extracting text and so on. One thing I am interested in is the colour of the text itself that I am extracting. However I cannot seem to find any way of getting that information.
Is it possible at all to use PDFBox to get the colour information of a document and if so, how would I go about doing so?
Many thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
所有颜色信息应存储在 PDGraphicsState 类中,并且使用的颜色(描边/非描边等)取决于使用的文本渲染模式(通过 pdfbox 邮件列表)。
这是我尝试过的一个小示例:
创建仅一行的 pdf(用
RGB=[146,208,80]
编写的“示例”)后,以下程序将输出:代码如下:
查看
PageDrawer.properties
以了解 PDF 运算符如何映射到 Java 类。据我了解,当
PDFStreamEngine
处理页面流时,它会根据当前正在处理的运算符设置各种变量状态。因此,当它遇到绿色文本时,它将更改 PDGraphicsState,因为它将遇到适当的运算符。因此,对于 CS,它调用由映射 CS=org.apache.pdfbox.util.operator.SetStrokingColorSpace 定义的 org.apache.pdfbox.util.operator.SetStrokingColorSpace 在.properties
文件中。RG
映射到org.apache.pdfbox.util.operator.SetStrokingRGBColor
等等。在本例中,
PDGraphicsState
没有更改,因为文档只有文本,并且其中的文本只有一种样式。对于更高级的功能,您需要扩展PDFStreamEngine
(就像PageDrawer
、PDFTextStripper
和其他类所做的那样)以在颜色变化时执行某些操作。您还可以在自己的.properties
文件中编写自己的映射。All color informations should be stored in the class
PDGraphicsState
and the used color (stroking/nonstroking etc.) depends on the used text rendering mode (via pdfbox mailing list).Here is a small sample I tried:
After creating a pdf with just one line ("Sample" written in
RGB=[146,208,80]
), the following program will output:Here's the code:
Take a look at
PageDrawer.properties
to see how PDF operators are mapped to Java classes.As I understand it, as
PDFStreamEngine
processes a page stream, it sets various variable states depending on what operators it is processing at the moment. So when it hits green text, it will change the PDGraphicsState because it will encounter appropriate operators. So forCS
it callsorg.apache.pdfbox.util.operator.SetStrokingColorSpace
as defined by mappingCS=org.apache.pdfbox.util.operator.SetStrokingColorSpace
in the.properties
file.RG
is mapped toorg.apache.pdfbox.util.operator.SetStrokingRGBColor
and so on.In this case, the
PDGraphicsState
hasn't changed because the document has just text and the text it has is in just one style. For something more advanced, you would need to extendPDFStreamEngine
(just likePageDrawer
,PDFTextStripper
and other classes do) to do something when color changes. You could also write your own mappings in your own.properties
file.