如何进一步处理Tika / pdfbox无法解析但可以通过Evince / Libre Office Draw的越野车 /错误的PDF?

发布于 2025-01-28 19:31:49 字数 2272 浏览 3 评论 0 原文

我的程序是使用Tika 2.24阅读文档以提取其内容。

然而,尽管Evince,Libre Office Draw甚至GIMP都可以打开它们,但PDFBox无法处理一些PDF(也许是越野车或畸形)。

我不能共享这些PDF,但是我能说的是它们用来触发使用PDFBox 2.0.25,现在用PDFBox 2.0.26触发IOException:

引起的:java.io.ioexception:检测到对象29 0

时可能检测到的递归

因此,检测到可能的递归。

我阅读 pdfbox提供了一种方法来处理错误的pdfs 通过设置 setLenient(true)(true)(true)解析器,但找不到在蒂卡(Tika)设置这种宽大处理的方法。

顺便说一句,我跟随解决方案都带有两个 setLenient(true and and code and code),但IOException仍然出现。

编辑:按照KJ的建议,我运行了pdftotext,其中输出以下警告:

语法错误(5602):对象'29 0 obj' 语法错误(5603):流中的“长度”属性不良 语法错误(8596):缺少“端流”或不正确的流长度 语法错误(16945):对象'35 0 obj' 语法错误(16946):流中的“长度”属性不良 语法错误(23267):缺少“端流”或不正确的流长度 语法错误(23332):对象'37 0 obj'已经被解析 语法错误(23333):流中的“长度”属性不良 语法错误(28645):缺少“端流”或不正确的流长

(请注意:由于PDFSAM无法单独导出,似乎有4页似乎畸形)。

按照KJ建议,在文本编辑器中打开PDF文件只会显示“ 29 0 OBJ”的单个命中。使用 mutool show -be mypdf.pdf 29 输出a 警告:pdf流长不正确,然后是压缩内容。

[QPDF检查] 仍然遵循KJ建议,使用检查标志的farts运行QPDF:

检查mypdfwithissues.pdf
PDF版本:1.5
文件未加密
文件不是线性化的
警告:mypdfwithissues.pdf(Offset 5602):检测到的循环解决对象29 0
警告:mypdfwithissues.pdf(对象29 0,offset 5552): /stream dictionary中的长度密钥不是整数
警告:mypdfwithissues.pdf(对象29 0,偏移5603):尝试恢复流长度
警告:mypdfwithissues.pdf(对象29 0,偏移5603):恢复的流长度:2983
警告:mypdfwithissues.pdf(偏移16945):检测到的循环解决对象35 0
警告:mypdfwithissues.pdf(object 35 0,offset 16895): /stream dictionary中的长度密钥不是整数
警告:mypdfwithissues.pdf(对象35 0,偏移16946):尝试恢复流长度
警告:mypdfwithissues.pdf(对象35 0,偏移16946):恢复的流长度:6311
警告:mypdfwithissues.pdf(偏移23332):检测到的循环解决对象37 0
警告:mypdfwithissues.pdf(object 37 0,offset 23282): /stream字典中的长度键不是整数
警告:mypdfwithissues.pdf(对象37 0,偏移23333):尝试恢复流长度
警告:mypdfwithissues.pdf(对象37 0,偏移23333):恢复的流长度:5302

然而,错误的PDF已由另一个用户(来自同一来源)再生,而新的PDF则没有显示任何警告。因此,问题将很难跟踪!

因此,我的问题是:如何使用TIKA / PDFBOX畸形的PDF进行处理,从而触发与可能的递归相关的上述IOException?

任何提示都赞赏

My program is reading documents with Tika 2.24 to extract their contents.

Yet some PDFs (maybe buggy or malformed) cannot be processed by PDFBox although Evince, Libre Office Draw or even Gimp can open them.

I cannot share these PDFs but what I can tell is that they used to trigger a StackOverFlow Error as described on Jira with PDFBox 2.0.25 and now trigger an IOException with PDFBox 2.0.26 :

Caused by: java.io.IOException: Possible recursion detected when dereferencing object 29 0

Consequently now that an IOException can be caught it is tempting to try and process a malformed PDF differently from the first parsing that triggered the IOException.

I read that PDFBox offers a way to handle malformed PDFs by setting setLenient(true) on a parser but could not find a way to set such leniency in Tika.

By the way I followed the solution with both setLenient(true and false) but the IOException still appears.

Edit : following KJ's suggestion I ran pdftotext which output the following warnings :

Syntax Error (5602): Object '29 0 obj' is being already parsed
Syntax Error (5603): Bad 'Length' attribute in stream
Syntax Error (8596): Missing 'endstream' or incorrect stream length
Syntax Error (16945): Object '35 0 obj' is being already parsed
Syntax Error (16946): Bad 'Length' attribute in stream
Syntax Error (23267): Missing 'endstream' or incorrect stream length
Syntax Error (23332): Object '37 0 obj' is being already parsed
Syntax Error (23333): Bad 'Length' attribute in stream
Syntax Error (28645): Missing 'endstream' or incorrect stream length

(Please note : there are 4 pages which seem to be malformed as PDFSam cannot export them separately).

Opening the pdf file in Text Editor as suggested by KJ did only reveal a single hit for "29 0 obj". Using mutool show -be mypdf.pdf 29 outputs a warning: PDF stream Length incorrect and then the compressed content.

[QPDF check]
Still following KJ advices, running QPDF with check flag yields:

checking myPDFWithIssues.pdf
PDF Version: 1.5
File is not encrypted
File is not linearized
WARNING: myPDFWithIssues.pdf (offset 5602): loop detected resolving object 29 0
WARNING: myPDFWithIssues.pdf (object 29 0, offset 5552): /Length key in stream dictionary is not an integer
WARNING: myPDFWithIssues.pdf (object 29 0, offset 5603): attempting to recover stream length
WARNING: myPDFWithIssues.pdf (object 29 0, offset 5603): recovered stream length: 2983
WARNING: myPDFWithIssues.pdf (offset 16945): loop detected resolving object 35 0
WARNING: myPDFWithIssues.pdf (object 35 0, offset 16895): /Length key in stream dictionary is not an integer
WARNING: myPDFWithIssues.pdf (object 35 0, offset 16946): attempting to recover stream length
WARNING: myPDFWithIssues.pdf (object 35 0, offset 16946): recovered stream length: 6311
WARNING: myPDFWithIssues.pdf (offset 23332): loop detected resolving object 37 0
WARNING: myPDFWithIssues.pdf (object 37 0, offset 23282): /Length key in stream dictionary is not an integer
WARNING: myPDFWithIssues.pdf (object 37 0, offset 23333): attempting to recover stream length
WARNING: myPDFWithIssues.pdf (object 37 0, offset 23333): recovered stream length: 5302

Yet the faulty PDF has been regenerated by another user (from the same sources) and the newer PDF does not show any warnings. So issue will be hard to track!

So my question is : how can I process with Tika / PDFBox malformed PDFs that trigger the aforementioned IOException related to possible recursion ?

Any hint appreciated

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

为你鎻心 2025-02-04 19:31:49

我采用的快速而肮脏的方法是使用外部命令行工具 pdftotext (来自Package poppler-utils 在debian/ubuntu上),如@kj所建议(现在已删除) ) 评论。

The quick and dirty way I employed was to use external command line tool pdftotext (from package poppler-utils on Debian / Ubuntu) as suggested by @KJ in their (now deleted) comments.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文