如何读取我的PDF中JBIG2算法使用的符号字典的属性？

发布于 2025-01-31 21:41:00 字数 611 浏览 3 评论 0原文

我有一个包含长列表编号的PDF，该列表使用JBIG2算法被压缩。当我查找文件的内部文件结构时，我可以找到我的页面，并使用两个不同的Xobject构建：内部结构。”>

（图是Adobe Acrobat Preflight - ＆gt;内部结构。）

我可以轻松地查看第一个称为“ Xiplayer0”的细节如果我愿意，请点一点信息。第二个是我对Tho感兴趣的那个。在其中，我可以看到图像是使用2个“符号词典”（第一个标记为灰色）构建的。是否可以在此词典中看到不同的条目？或者甚至只为其中一个得到一些元数据？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

凡间太子 2025-02-07 21:41:00

这与PDF无关，PDF只是JBIG2格式及其符号字典的容器，这是您真正感兴趣的。

但是，作为第一步，您需要获得JBIG2 Images pdf的：

从pdf，从pdf，提取图像如何处理JBIG2编码

这样就提到了Poppler，而Poppler确实具有Python binding/包装器：

https://pypi.org/project/python-poppler/

一旦获得了这些JBIG2文件，也许这可以有所帮助：

jbig2_symbol_dict.c

较大的项目具有一个具有“转储”选项的命令行util “ https://github.com/artifexsoftware/jbig2dec/blob/master/jbig2dec.c#l604” rel =“ nofollow noreferrer”>^1 ：

case dump:
    fprintf(stderr, "Sorry, segment dump not yet implemented\n");
    break;

因此，如果您只是好奇/这是一个学术问题，答案看起来“不是真的”。如果您需要阅读文本，OCR呢？

This is not really about PDF, PDF is just the container for the JBIG2 format and its symbols dictionary, which is what you're really interested in.

But, as a first step, you'll need to get the JBIG2 images out of the PDF:

Extract images from PDF, how to handle JBIG2 encoded

That SO mentions poppler, and poppler does have a Python binding/wrapper:

https://pypi.org/project/python-poppler/

Once you get those JBIG2 files, maybe this can help:

jbig2_symbol_dict.c

The bigger project has a command-line util which has a "dump" option, but the source says it's not implemented^1:

case dump:
    fprintf(stderr, "Sorry, segment dump not yet implemented\n");
    break;

So if you're just curious/this is an academic question, the answer looks like "not really". If you need to read the text, how about OCR?

回复收藏 0 原文

猫七 2025-02-07 21:41:00

该文件存在一个已知问题，因为扫描作为JBIG2被认为是高度压缩的清洁像素扫描，而没有JPEG在低质量时可能引入的一些问题。但是，某些商业扫描仪使用的格式可以臭名昭著地填充6看起来像8，如本顺序从第1页所示。请参见 https://en.wikipedia.org/wiki/wiki/jbig2#disadvantages

。

某些组织提出了建议，不适用于将图像保真度需要由更常规的TIFF GIF或PNG单色扫描产生的关键文档。

要提取此类图像需要2行代码，使用2个库

poppler \ bin＆gt; pdfimages-all 7535-7pt.pdf out

和在这种情况下为001-81 for loop for 001-81 for 243到

jbig2 \ liblary \ bin＆gt; jbig2dec -o out-001 -t pbm out-001.jb2g out-001.jb2e

在这里可以看到前3页的元数据已经使用了）

23.01.0\Library\bin>pdfimages -list 7535-7pt.pdf  

page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    1184   832  gray    1   8  jpeg   no         6  0   100   100  554B 0.1%
   1     1 stencil  1967  1230  -       1   1  jbig2  no         8  0   200   200 7885B 2.6%
   2     2 image    1184   832  gray    1   8  jpeg   no        13  0   100   100  573B 0.1%
   2     3 stencil  1966  1200  -       1   1  jbig2  no        15  0   200   200 7415B 2.5%
   3     4 image    1184   832  gray    1   8  jpeg   no        19  0   100   100  552B 0.1%
   3     5 stencil  1967  1201  -       1   1  jbig2  no        21  0   200   200 7829B 2.7%

81 PBM将是通常忠实变量输入的忠实副本（

/MediaBox [0 0 842 596] /Rotate 270 
/Image
/BitsPerComponent 1
/Width 1967
/Height 1230
/ImageMask true
/Filter
/JBIG2Decode

），并且可以丢弃旧的243张图像（无论如何，PDF文件都应该被丢弃，并且在更高分辨率上撤销的纸质源），因为图像是没有的除了显示上述错误外，请使用。

The File in question has a known problem in that the scan as JBIG2 is supposed to be highly compressed clean pixel scan without some of the issues that a jpeg may introduce when its low quality. However the format as used by some commercial scanners can notoriously infill 6 to look like 8 as seen in this sequence from page 1. see https://en.wikipedia.org/wiki/JBIG2#Disadvantages

For several reasons it is suggested by some organisations it not be used for critical documents where image fidelity needs to be as generated by more conventional TIFF GIF or PNG Monochrome scans.

To extract such an image requires 2 lines of code using 2 libraries

poppler\bin>pdfimages -all 7535-7pt.pdf out

and a for loop in this case 001-81 for the 243 out-puts similar to

jbig2\Library\bin>jbig2dec -o out-001 -t pbm out-001.jb2g out-001.jb2e

Meta data for first 3 pages can be seen here (where a poor 200 dpi equivalence had been used)

23.01.0\Library\bin>pdfimages -list 7535-7pt.pdf  

page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    1184   832  gray    1   8  jpeg   no         6  0   100   100  554B 0.1%
   1     1 stencil  1967  1230  -       1   1  jbig2  no         8  0   200   200 7885B 2.6%
   2     2 image    1184   832  gray    1   8  jpeg   no        13  0   100   100  573B 0.1%
   2     3 stencil  1966  1200  -       1   1  jbig2  no        15  0   200   200 7415B 2.5%
   3     4 image    1184   832  gray    1   8  jpeg   no        19  0   100   100  552B 0.1%
   3     5 stencil  1967  1201  -       1   1  jbig2  no        21  0   200   200 7829B 2.7%

the 81 pbm's will be a faithful copy of the poor variable inputs typically (

/MediaBox [0 0 842 596] /Rotate 270 
/Image
/BitsPerComponent 1
/Width 1967
/Height 1230
/ImageMask true
/Filter
/JBIG2Decode

) and the old 243 images can be discarded (PDF file should have been discarded anyway, and paper source rescanned at higher resolution) as images are of no use except to show the errors as above.

回复收藏 0 原文

~没有更多了~