无论页数如何,pyPdf 输出文件的大小都相同

发布于 2025-01-06 18:35:04 字数 2979 浏览 0 评论 0原文

我正在尝试使用 pyPdf 将大型 pdf 中的几页提取到单独的文件中。每当我这样做时,生成的文件大小几乎与源文件相同。我认为这与文件内的书签有关,因为如果页面不包含任何链接,输出文件的大小会非常小。我不知道如何从输出文件中排除书签。

from pyPdf import PdfFileWriter as writer, PdfFileReader as reader
w = writer()
r = reader(open('9.pdf'))

for p in xrange(5):
    w.addPage(r.getPage(p))
with open('out.pdf', 'wb') as stream:
    w.write(stream)

w._objects
# prints:

{'/Kids': [IndirectObject(4, 0), IndirectObject(5, 0), IndirectObject(6, 0), IndirectObject(7, 0), IndirectObject(8, 0)], '/Type': '/Pages', '/Count': 5}
{'/Producer': u'Python PDF Library - http://pybrary.net/pyPdf/'}
{'/Type': '/Catalog', '/Pages': IndirectObject(1, 0)}
{'/Parent': IndirectObject(1, 0), '/Rotate': 0, '/Contents': IndirectObject(4307, 0), '/Resources': {'/ColorSpace': {'/CS1': IndirectObject(4309, 0), '/CS0': IndirectObject(4305, 0)}, '/XObject': {'/Im0': IndirectObject(4312, 0)}, '/ExtGState': {'/GS2': IndirectObject(4324, 0), '/GS1': IndirectObject(4323, 0), '/GS0': IndirectObject(4306, 0)}, '/Font': {'/T1_2': IndirectObject(4308, 0), '/T1_0': IndirectObject(4303, 0), '/T1_1': IndirectObject(4304, 0)}, '/ProcSet': ['/PDF', '/Text', '/ImageB']}, '/CropBox': [0, 0, 612, 792], '/BCLPrivAnnots': {'/BCLC_BCL_Jade': []}, '/MediaBox': [0, 0, 612, 792], '/Annots': IndirectObject(4301, 0), '/Type': '/Page'}
{'/Parent': IndirectObject(1, 0), '/Contents': IndirectObject(2, 0), '/Resources': {'/ColorSpace': {'/CS1': IndirectObject(4309, 0), '/CS0': IndirectObject(4305, 0)}, '/ExtGState': {'/GS2': IndirectObject(3417, 0), '/GS1': IndirectObject(3412, 0), '/GS0': IndirectObject(4306, 0)}, '/Font': {'/T1_2': IndirectObject(3413, 0), '/T1_0': IndirectObject(3415, 0), '/T1_1': IndirectObject(3416, 0)}, '/ProcSet': ['/PDF', '/Text']}, '/Rotate': 0, '/CropBox': [0, 0, 612, 792], '/BCLPrivAnnots': {'/BCLC_BCL_Jade': []}, '/MediaBox': [0, 0, 612, 792], '/Thumb': IndirectObject(3920, 0), '/Type': '/Page'}
{'/Parent': IndirectObject(1, 0), '/Contents': IndirectObject(4, 0), '/Resources': {'/ColorSpace': {'/CS0': IndirectObject(4305, 0)}, '/ExtGState': {'/GS0': IndirectObject(4306, 0)}, '/Font': {'/T1_2': IndirectObject(3425, 0), '/T1_3': IndirectObject(3428, 0), '/T1_0': IndirectObject(3426, 0), '/T1_1': IndirectObject(3427, 0)}, '/ProcSet': ['/PDF', '/Text']}, '/Rotate': 0, '/CropBox': [0, 0, 612, 792], '/BCLPrivAnnots': {'/BCLC_BCL_Jade': []}, '/MediaBox': [0, 0, 612, 792], '/Thumb': IndirectObject(3921, 0), '/Type': '/Page'}
{'/Parent': IndirectObject(1, 0), '/Contents': IndirectObject(6, 0), '/Resources': {}, '/Rotate': 0, '/CropBox': [0, 0, 612, 792], '/BCLPrivAnnots': {'/BCLC_BCL_Jade': []}, '/MediaBox': [0, 0, 612, 792], '/Thumb': IndirectObject(3922, 0), '/Type': '/Page'}
{'/Parent': IndirectObject(1, 0), '/Contents': IndirectObject(9, 0), '/Resources': IndirectObject(8, 0), '/Rotate': 0, '/CropBox': [0, 0, 612, 792], '/BCLPrivAnnots': {'/BCLC_BCL_Jade': []}, '/MediaBox': [0, 0, 612, 792], '/Thumb': IndirectObject(3923, 0), '/Type': '/Page'}

I'm trying to use pyPdf to extract a few pages from a large pdf to a separate file. Whenever I do, the resulting filesize is nearly identical to the source file. I think it has something to do with the bookmarks inside the files, because it the output file size is very small if the page doesn't contain any links. I can't figure out how to exclude the bookmarks from the output file.

from pyPdf import PdfFileWriter as writer, PdfFileReader as reader
w = writer()
r = reader(open('9.pdf'))

for p in xrange(5):
    w.addPage(r.getPage(p))
with open('out.pdf', 'wb') as stream:
    w.write(stream)

w._objects
# prints:

{'/Kids': [IndirectObject(4, 0), IndirectObject(5, 0), IndirectObject(6, 0), IndirectObject(7, 0), IndirectObject(8, 0)], '/Type': '/Pages', '/Count': 5}
{'/Producer': u'Python PDF Library - http://pybrary.net/pyPdf/'}
{'/Type': '/Catalog', '/Pages': IndirectObject(1, 0)}
{'/Parent': IndirectObject(1, 0), '/Rotate': 0, '/Contents': IndirectObject(4307, 0), '/Resources': {'/ColorSpace': {'/CS1': IndirectObject(4309, 0), '/CS0': IndirectObject(4305, 0)}, '/XObject': {'/Im0': IndirectObject(4312, 0)}, '/ExtGState': {'/GS2': IndirectObject(4324, 0), '/GS1': IndirectObject(4323, 0), '/GS0': IndirectObject(4306, 0)}, '/Font': {'/T1_2': IndirectObject(4308, 0), '/T1_0': IndirectObject(4303, 0), '/T1_1': IndirectObject(4304, 0)}, '/ProcSet': ['/PDF', '/Text', '/ImageB']}, '/CropBox': [0, 0, 612, 792], '/BCLPrivAnnots': {'/BCLC_BCL_Jade': []}, '/MediaBox': [0, 0, 612, 792], '/Annots': IndirectObject(4301, 0), '/Type': '/Page'}
{'/Parent': IndirectObject(1, 0), '/Contents': IndirectObject(2, 0), '/Resources': {'/ColorSpace': {'/CS1': IndirectObject(4309, 0), '/CS0': IndirectObject(4305, 0)}, '/ExtGState': {'/GS2': IndirectObject(3417, 0), '/GS1': IndirectObject(3412, 0), '/GS0': IndirectObject(4306, 0)}, '/Font': {'/T1_2': IndirectObject(3413, 0), '/T1_0': IndirectObject(3415, 0), '/T1_1': IndirectObject(3416, 0)}, '/ProcSet': ['/PDF', '/Text']}, '/Rotate': 0, '/CropBox': [0, 0, 612, 792], '/BCLPrivAnnots': {'/BCLC_BCL_Jade': []}, '/MediaBox': [0, 0, 612, 792], '/Thumb': IndirectObject(3920, 0), '/Type': '/Page'}
{'/Parent': IndirectObject(1, 0), '/Contents': IndirectObject(4, 0), '/Resources': {'/ColorSpace': {'/CS0': IndirectObject(4305, 0)}, '/ExtGState': {'/GS0': IndirectObject(4306, 0)}, '/Font': {'/T1_2': IndirectObject(3425, 0), '/T1_3': IndirectObject(3428, 0), '/T1_0': IndirectObject(3426, 0), '/T1_1': IndirectObject(3427, 0)}, '/ProcSet': ['/PDF', '/Text']}, '/Rotate': 0, '/CropBox': [0, 0, 612, 792], '/BCLPrivAnnots': {'/BCLC_BCL_Jade': []}, '/MediaBox': [0, 0, 612, 792], '/Thumb': IndirectObject(3921, 0), '/Type': '/Page'}
{'/Parent': IndirectObject(1, 0), '/Contents': IndirectObject(6, 0), '/Resources': {}, '/Rotate': 0, '/CropBox': [0, 0, 612, 792], '/BCLPrivAnnots': {'/BCLC_BCL_Jade': []}, '/MediaBox': [0, 0, 612, 792], '/Thumb': IndirectObject(3922, 0), '/Type': '/Page'}
{'/Parent': IndirectObject(1, 0), '/Contents': IndirectObject(9, 0), '/Resources': IndirectObject(8, 0), '/Rotate': 0, '/CropBox': [0, 0, 612, 792], '/BCLPrivAnnots': {'/BCLC_BCL_Jade': []}, '/MediaBox': [0, 0, 612, 792], '/Thumb': IndirectObject(3923, 0), '/Type': '/Page'}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

寂寞清仓 2025-01-13 18:35:04

使用 PyPdf 时,输出文件几乎包含所有格式。

具体来说,替换:

with open('out.pdf', 'wb') as stream:
    w.write(stream)

for:

stream = file('out.pdf', 'wb')
w.write(stream)
stream.close()

然后查看最终结果。

也是一个很好的做法。

fin = open('9.pdf')
r = reader(fin)
fin.close()

写 :而不是:

r = reader(open('9.pdf'))

When using PyPdf, the output file is striped of almost all formatting.

specifically, replace:

with open('out.pdf', 'wb') as stream:
    w.write(stream)

for:

stream = file('out.pdf', 'wb')
w.write(stream)
stream.close()

then look at the end result.

It is also good practice to write :

fin = open('9.pdf')
r = reader(fin)
fin.close()

and not:

r = reader(open('9.pdf'))
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文