用于间接对象提取的 pyPdf

发布于 2024-07-11 20:19:19 字数 1121 浏览 7 评论 0原文

按照这个示例,我现在可以将所有元素列出到 pdf 文件中

import pyPdf
pdf = pyPdf.PdfFileReader(open("pdffile.pdf"))
list(pdf.pages) # Process all the objects.
print pdf.resolvedObjects

,我需要从 pdf 文件中提取非标准对象。

我的对象是名为 MYOBJECT 的对象,它是一个字符串。

与我有关的 python 脚本打印的部分是:

{'/MYOBJECT': IndirectObject(584, 0)}

pdf 文件是这样的:

558 0 obj
<</Contents 583 0 R/CropBox[0 0 595.22 842]/MediaBox[0 0 595.22 842]/Parent 29 0 R/Resources
  <</ColorSpace <</CS0 563 0 R>>
    /ExtGState <</GS0 568 0 R>>
    /Font<</TT0 559 0 R/TT1 560 0 R/TT2 561 0 R/TT3 562 0 R>>
    /ProcSet[/PDF/Text/ImageC]
    /Properties<</MC0<</MYOBJECT 584 0 R>>/MC1<</SubKey 582 0 R>> >>
    /XObject<</Im0 578 0 R>>>>
  /Rotate 0/StructParents 0/Type/Page>>
endobj
...
...
...
584 0 obj
<</Length 8>>stream

1_22_4_1     --->>>>  this is the string I need to extract from the object

endstream
endobj

如何遵循 584 值来引用我的字符串(当然在 pyPdf 下)?

Following this example, I can list all elements into a pdf file

import pyPdf
pdf = pyPdf.PdfFileReader(open("pdffile.pdf"))
list(pdf.pages) # Process all the objects.
print pdf.resolvedObjects

now, I need to extract a non-standard object from the pdf file.

My object is the one named MYOBJECT and it is a string.

The piece printed by the python script that concernes me is:

{'/MYOBJECT': IndirectObject(584, 0)}

The pdf file is this:

558 0 obj
<</Contents 583 0 R/CropBox[0 0 595.22 842]/MediaBox[0 0 595.22 842]/Parent 29 0 R/Resources
  <</ColorSpace <</CS0 563 0 R>>
    /ExtGState <</GS0 568 0 R>>
    /Font<</TT0 559 0 R/TT1 560 0 R/TT2 561 0 R/TT3 562 0 R>>
    /ProcSet[/PDF/Text/ImageC]
    /Properties<</MC0<</MYOBJECT 584 0 R>>/MC1<</SubKey 582 0 R>> >>
    /XObject<</Im0 578 0 R>>>>
  /Rotate 0/StructParents 0/Type/Page>>
endobj
...
...
...
584 0 obj
<</Length 8>>stream

1_22_4_1     --->>>>  this is the string I need to extract from the object

endstream
endobj

How can I follow the 584 value in order to refer to my string (under pyPdf of course)??

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

花海 2024-07-18 20:19:19

pdf.pages 中的每个元素都是一个字典,因此假设它位于第 1 页,pdf.pages[0]['/MYOBJECT'] 应该是您想要的元素。

您可以尝试单独打印它,或者在 python 提示符下使用 helpdir 来查看它,以了解有关如何获取所需字符串的更多信息

编辑:

收到副本后pdf,我在 pdf.resolvedObjects[0][558]['/Resources']['/Properties']['/MC0']['/MYOBJECT'] 找到了该对象,并且可以通过 getData() 检索值,

以下函数提供了一种更通用的方法,通过递归查找有问题的键来解决此问题

import types
import pyPdf
pdf = pyPdf.PdfFileReader(open('file.pdf'))
pages = list(pdf.pages)

def findInDict(needle,haystack):
    for key in haystack.keys():
        try:
            value = haystack[key]
        except:
            continue
        if key == needle:
            return value
        if type(value) == types.DictType or isinstance(value,pyPdf.generic.DictionaryObject):  
            x = findInDict(needle,value)
            if x is not None:
                return x

answer = findInDict('/MYOBJECT',pdf.resolvedObjects).getData()

each element in pdf.pages is a dictionary, so assuming it's on page 1, pdf.pages[0]['/MYOBJECT'] should be the element you want.

You can try to print that individually or poke at it with help and dir in a python prompt for more about how to get the string you want

Edit:

after receiving a copy of the pdf, i found the object at pdf.resolvedObjects[0][558]['/Resources']['/Properties']['/MC0']['/MYOBJECT'] and the value can be retrieved via getData()

the following function gives a more generic way to solve this by recursively looking for the key in question

import types
import pyPdf
pdf = pyPdf.PdfFileReader(open('file.pdf'))
pages = list(pdf.pages)

def findInDict(needle,haystack):
    for key in haystack.keys():
        try:
            value = haystack[key]
        except:
            continue
        if key == needle:
            return value
        if type(value) == types.DictType or isinstance(value,pyPdf.generic.DictionaryObject):  
            x = findInDict(needle,value)
            if x is not None:
                return x

answer = findInDict('/MYOBJECT',pdf.resolvedObjects).getData()
染火枫林 2024-07-18 20:19:19

IndirectObject 指的是实际对象(它类似于链接或别名,以便当相同内容出现在多个位置时可以减小 PDF 的总大小)。 getObject 方法将为您提供实际的对象。

如果对象是文本对象,那么只需在对象上执行 str() 或 unicode() 即可获取其中的数据。

或者,pyPdf 将对象存储在resolvedObjects 属性中。 例如,包含此对象的 PDF:

13 0 obj
<< /Type /Catalog /Pages 3 0 R >>
endobj

可以使用以下命令读取:

>>> import pyPdf
>>> pdf = pyPdf.PdfFileReader(open("pdffile.pdf"))
>>> pages = list(pdf.pages)
>>> pdf.resolvedObjects
{0: {2: {'/Parent': IndirectObject(3, 0), '/Contents': IndirectObject(4, 0), '/Type': '/Page', '/Resources': IndirectObject(6, 0), '/MediaBox': [0, 0, 595.2756, 841.8898]}, 3: {'/Kids': [IndirectObject(2, 0)], '/Count': 1, '/Type': '/Pages', '/MediaBox': [0, 0, 595.2756, 841.8898]}, 4: {'/Filter': '/FlateDecode'}, 5: 147, 6: {'/ColorSpace': {'/Cs1': IndirectObject(7, 0)}, '/ExtGState': {'/Gs2': IndirectObject(9, 0), '/Gs1': IndirectObject(10, 0)}, '/ProcSet': ['/PDF', '/Text'], '/Font': {'/F1.0': IndirectObject(8, 0)}}, 13: {'/Type': '/Catalog', '/Pages': IndirectObject(3, 0)}}}
>>> pdf.resolvedObjects[0][13]
{'/Type': '/Catalog', '/Pages': IndirectObject(3, 0)}

An IndirectObject refers to an actual object (it's like a link or alias so that the total size of the PDF can be reduced when the same content appears in multiple places). The getObject method will give you the actual object.

If the object is a text object, then just doing a str() or unicode() on the object should get you the data inside of it.

Alternatively, pyPdf stores the objects in the resolvedObjects attribute. For example, a PDF that contains this object:

13 0 obj
<< /Type /Catalog /Pages 3 0 R >>
endobj

Can be read with this:

>>> import pyPdf
>>> pdf = pyPdf.PdfFileReader(open("pdffile.pdf"))
>>> pages = list(pdf.pages)
>>> pdf.resolvedObjects
{0: {2: {'/Parent': IndirectObject(3, 0), '/Contents': IndirectObject(4, 0), '/Type': '/Page', '/Resources': IndirectObject(6, 0), '/MediaBox': [0, 0, 595.2756, 841.8898]}, 3: {'/Kids': [IndirectObject(2, 0)], '/Count': 1, '/Type': '/Pages', '/MediaBox': [0, 0, 595.2756, 841.8898]}, 4: {'/Filter': '/FlateDecode'}, 5: 147, 6: {'/ColorSpace': {'/Cs1': IndirectObject(7, 0)}, '/ExtGState': {'/Gs2': IndirectObject(9, 0), '/Gs1': IndirectObject(10, 0)}, '/ProcSet': ['/PDF', '/Text'], '/Font': {'/F1.0': IndirectObject(8, 0)}}, 13: {'/Type': '/Catalog', '/Pages': IndirectObject(3, 0)}}}
>>> pdf.resolvedObjects[0][13]
{'/Type': '/Catalog', '/Pages': IndirectObject(3, 0)}
一页 2024-07-18 20:19:19

如果到处寻找物体,耶希亚的方法就很好。 我的猜测(查看 PDF)是它总是在同一个位置(第一页,在“MC0”属性中),因此查找字符串的更简单的方法是:

import pyPdf
pdf = pyPdf.PdfFileReader(open("file.pdf"))
pdf.getPage(0)['/Resources']['/Properties']['/MC0']['/MYOBJECT'].getData()

Jehiah's method is good if looking everywhere for the object. My guess (looking at the PDF) is that it is always in the same place (the first page, in the 'MC0' property), and so a much simpler method of finding the string would be:

import pyPdf
pdf = pyPdf.PdfFileReader(open("file.pdf"))
pdf.getPage(0)['/Resources']['/Properties']['/MC0']['/MYOBJECT'].getData()
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文