Python，pyPdf OCR 错误：pyPdf.utils.PdfReadError：找不到 EOF 标记

发布于 2024-11-08 11:12:20 字数 2330 浏览 7 评论 0 原文

pyPdf 抛出此异常：

pyPdf.utils.PdfReadError：找不到 EOF 标记

我不需要修复 pyPdf，我只需要获取 EOF 错误即可导致“例外”块执行并跳过文件，但它没有不工作。它仍然会导致程序停止运行。

背景：

PDF 批量 OCR 程序

Python、pyPdf、Adobe PDF OCR 错误：不支持的过滤器/lzwdecode

...传奇仍在继续。

我在一个文件夹中有 10,000 个 pdf 文件。有些是 OCRd，有些不是。无法区分他们。第 1 步是找出哪些不是 OCRd，哪些只是 OCR（有关详细信息，请参阅其他线程）。

所以我正在使用 pyPdf。当我尝试阅读文本时，出现一些与无法识别的字符和不受支持的过滤器相关的异常。所以我估计如果它抛出异常，它里面会有一些文本，然后它不会出现在列表中。问题解决了，对吧？像这样：

      from pyPdf import PdfFileWriter, PdfFileReader
      import sys, os, pyPdf, re

      path = 'C:\Users\Homer\Documents\My Pdfs'

      filelist = os.listdir(path)

      has_text_list = []
      does_not_have_text_list = []

    for pdf_name in filelist:
        pdf_file_with_directory = os.path.join(path, pdf_name)
        pdf = pyPdf.PdfFileReader(open(pdf_file_with_directory, 'rb'))
        print pdf_name
        for i in range(0, pdf.getNumPages()):
            try:
                pdf.write("%%EOF")
                content = pdf.getPage(i).extractText()
                does_it_have_text = re.findall(r'\w{2,}', content) 
                if does_it_have_text == []:
                    does_not_have_text_list.append(pdf_name)
                    print pdf_name
                else:
                    has_text_list.append(pdf_name)
            except:
                has_text_list.append(pdf_name)

print does_not_have_text_list

但后来我得到这个错误：

pyPdf.utils.PdfReadError: EOF mark not find

似乎它出现了很多（来自谷歌）：

http://pdfposter.origo.ethz.ch/node/31

我认为这意味着 pyPdf 打开了该文件，做了它的尝试进行文本处理，引发任何异常，执行 except: 块，但现在无法进入下一步，因为它不知道文件已完成。

还有其他类似的线程，他们声称此问题已得到解决，但似乎并未得到解决。

然后有人在这里有一个函数，他们首先将 EOF 字符写入 .pdf。

http://code.activestate.com/lists/python-list/589529/< /a>

我停留在“pdf.write("%%EOF")”行尝试模仿这个，但没有骰子。

那么我如何获得该错误来运行 except 块呢？我也在使用 wing IDE，所以如果有一种方法可以使用调试器来跳过这些文件，那也是可能的。谢谢。

原文

pyPdf throws this exception:

pyPdf.utils.PdfReadError: EOF marker not found

I don't need to fix pyPdf, I just need to get the EOF error to cause an "except" block to execute and skip over the file, but it doesn't work. It still causes the program to stop running.

Background:

Batch OCR Program for PDFs

Python, pyPdf, Adobe PDF OCR error: unsupported filter /lzwdecode

... the saga continues.

I got 10,000 pdfs in a folder. Some OCRd, some not. Can't tell 'em apart. Step 1 was to figure out which ones are not OCRd and OCR only those (see other threads for details).

So i'm using pyPdf. I get some exceptions related to unrecognized characters and unsupported filters when I try to Read the text. So I guestimated that if it throws an exception, it's got some text in it and then it doens't go in the list. Problem solved, right? Like so:

      from pyPdf import PdfFileWriter, PdfFileReader
      import sys, os, pyPdf, re

      path = 'C:\Users\Homer\Documents\My Pdfs'

      filelist = os.listdir(path)

      has_text_list = []
      does_not_have_text_list = []

    for pdf_name in filelist:
        pdf_file_with_directory = os.path.join(path, pdf_name)
        pdf = pyPdf.PdfFileReader(open(pdf_file_with_directory, 'rb'))
        print pdf_name
        for i in range(0, pdf.getNumPages()):
            try:
                pdf.write("%%EOF")
                content = pdf.getPage(i).extractText()
                does_it_have_text = re.findall(r'\w{2,}', content) 
                if does_it_have_text == []:
                    does_not_have_text_list.append(pdf_name)
                    print pdf_name
                else:
                    has_text_list.append(pdf_name)
            except:
                has_text_list.append(pdf_name)

print does_not_have_text_list

But then I get this error:

pyPdf.utils.PdfReadError: EOF marker not found

Seems like it comes up a lot (from google):

http://pdfposter.origo.ethz.ch/node/31

I think it means that pyPdf opened the file, did its attempt at text processing, raised whatever exception, did the except: block, but is now unable to go to the next step b/c it doesn't know that the file has eneded.

There are other threads like this and they allege that this has been fixed, but it doesn't seem to have been.

Then someone has a function here where they write the EOF character to the .pdf first.

http://code.activestate.com/lists/python-list/589529/

I stuck in the "pdf.write("%%EOF")" line to try to mimick this, but no dice.

So I how do I get that error to run the except block? I'm also using wing IDE so if there's a way to use the debugger to just skip over these files, that would be possible too. Thx.

分享到QQ

分享到微博