如何读取我的PDF中JBIG2算法使用的符号字典的属性?
我有一个包含长列表编号的PDF,该列表使用JBIG2算法被压缩。 当我查找文件的内部文件结构时,我可以找到我的页面,并使用两个不同的Xobject构建: 内部结构。”>
(图是Adobe Acrobat Preflight - >内部结构。)
我可以轻松地查看第一个称为“ Xiplayer0”的细节如果我愿意,请点一点信息。第二个是我对Tho感兴趣的那个。在其中,我可以看到图像是使用2个“符号词典”(第一个标记为灰色)构建的。是否可以在此词典中看到不同的条目?或者甚至只为其中一个得到一些元数据?
I have a PDF that contains a long list numbers, that was compressed using the JBIG2 algorithm.
When I look up the the internal file structure of my file I can find, that my pages are being built with two different XObjects:
(Pictured is Adobe Acrobat Preflight -> Internal structure.)
I can easily look at the specifics of the first one called "XIPLAYER0" (not pictured) it even gives me the information bit by bit if I want to. The second one is the one I am interested in tho. In it I can see that the image is built using 2 "Symbol Dictionaries" (first one marked grey). Is it possible to see the different entries in this dictionary? Or maybe even get some metadata for just one of them?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这与PDF无关,PDF只是JBIG2格式及其符号字典的容器,这是您真正感兴趣的。
但是,作为第一步,您需要获得JBIG2 Images pdf的:
从pdf,从pdf,提取图像如何处理JBIG2编码
这样就提到了Poppler,而Poppler确实具有Python binding/包装器:
https://pypi.org/project/python-poppler/
一旦获得了这些JBIG2文件,也许这可以有所帮助:
jbig2_symbol_dict.c
较大的项目具有一个具有“转储”选项的命令行util “ https://github.com/artifexsoftware/jbig2dec/blob/master/jbig2dec.c#l604” rel =“ nofollow noreferrer”>^1 :
因此,如果您只是好奇/这是一个学术问题,答案看起来“不是真的”。如果您需要阅读文本,OCR呢?
This is not really about PDF, PDF is just the container for the JBIG2 format and its symbols dictionary, which is what you're really interested in.
But, as a first step, you'll need to get the JBIG2 images out of the PDF:
Extract images from PDF, how to handle JBIG2 encoded
That SO mentions poppler, and poppler does have a Python binding/wrapper:
https://pypi.org/project/python-poppler/
Once you get those JBIG2 files, maybe this can help:
jbig2_symbol_dict.c
The bigger project has a command-line util which has a "dump" option, but the source says it's not implemented^1:
So if you're just curious/this is an academic question, the answer looks like "not really". If you need to read the text, how about OCR?
该文件存在一个已知问题,因为扫描作为JBIG2被认为是高度压缩的清洁像素扫描,而没有JPEG在低质量时可能引入的一些问题。但是,某些商业扫描仪使用的格式可以臭名昭著地填充
6
看起来像8
,如本顺序从第1页所示。请参见 https://en.wikipedia.org/wiki/wiki/jbig2#disadvantages。
某些组织提出了建议,不适用于将图像保真度需要由更常规的TIFF GIF或PNG单色扫描产生的关键文档。
要提取此类图像需要2行代码,使用2个库
poppler \ bin> pdfimages-all 7535-7pt.pdf out
和在这种情况下为001-81 for loop for 001-81 for 243到
jbig2 \ liblary \ bin> jbig2dec -o out-001 -t pbm out-001.jb2g out-001.jb2e
在这里可以看到前3页的元数据已经使用了)
81 PBM将是通常忠实变量输入的忠实副本(
),并且可以丢弃旧的243张图像(无论如何,PDF文件都应该被丢弃,并且在更高分辨率上撤销的纸质源),因为图像是没有的除了显示上述错误外,请使用。
The File in question has a known problem in that the scan as JBIG2 is supposed to be highly compressed clean pixel scan without some of the issues that a jpeg may introduce when its low quality. However the format as used by some commercial scanners can notoriously infill
6
to look like8
as seen in this sequence from page 1. see https://en.wikipedia.org/wiki/JBIG2#DisadvantagesFor several reasons it is suggested by some organisations it not be used for critical documents where image fidelity needs to be as generated by more conventional TIFF GIF or PNG Monochrome scans.
To extract such an image requires 2 lines of code using 2 libraries
poppler\bin>pdfimages -all 7535-7pt.pdf out
and a for loop in this case 001-81 for the 243 out-puts similar to
jbig2\Library\bin>jbig2dec -o out-001 -t pbm out-001.jb2g out-001.jb2e
Meta data for first 3 pages can be seen here (where a poor 200 dpi equivalence had been used)
the 81 pbm's will be a faithful copy of the poor variable inputs typically (
) and the old 243 images can be discarded (PDF file should have been discarded anyway, and paper source rescanned at higher resolution) as images are of no use except to show the errors as above.