我写了一个python函数,从PDF中提取文本。我需要处理这种错误的帮助

发布于 2025-02-09 10:58:57 字数 1140 浏览 1 评论 0原文

def extract_clean_text_lisible(path_input, path_output):

    if spawn.find_executable("pdftotext"):
        path_input = current_app.config['PROJECT_PATH']+path_input
        path_output = current_app.config['PROJECT_PATH']+path_output
        pdftotext = current_app.config['POPPLER_PATH']+"/pdftotext.exe"
        out, err = sp.Popen([pdftotext, "-layout", "-enc", "UTF-8", path_input, "-"], stdout=sp.PIPE, stderr=sp.PIPE, stdin=sp.PIPE)
        fichier = open(path_output, "w", encoding="utf-8")
        s = out.decode("utf-8")
        fichier.write(s)
        fichier.close()
    
    else:
        raise EnvironmentError(
            "pdftotext not installed. can be downloaded from https://poppler.freedesktop.org/"
        )
    return out.decode("utf-8")

out, err = sp.Popen([pdftotext, "-layout", "-enc", "UTF-8", path_input, "-"], stdout=sp.PIPE, stderr=sp.PIPE, stdin=sp.PIPE)
TypeError: cannot unpack non-iterable Popen objec
def extract_clean_text_lisible(path_input, path_output):

    if spawn.find_executable("pdftotext"):
        path_input = current_app.config['PROJECT_PATH']+path_input
        path_output = current_app.config['PROJECT_PATH']+path_output
        pdftotext = current_app.config['POPPLER_PATH']+"/pdftotext.exe"
        out, err = sp.Popen([pdftotext, "-layout", "-enc", "UTF-8", path_input, "-"], stdout=sp.PIPE, stderr=sp.PIPE, stdin=sp.PIPE)
        fichier = open(path_output, "w", encoding="utf-8")
        s = out.decode("utf-8")
        fichier.write(s)
        fichier.close()
    
    else:
        raise EnvironmentError(
            "pdftotext not installed. can be downloaded from https://poppler.freedesktop.org/"
        )
    return out.decode("utf-8")

out, err = sp.Popen([pdftotext, "-layout", "-enc", "UTF-8", path_input, "-"], stdout=sp.PIPE, stderr=sp.PIPE, stdin=sp.PIPE)
TypeError: cannot unpack non-iterable Popen objec

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

好久不见√ 2025-02-16 10:58:57

我找到了该功能的解决方案:

DEF extract_clean_text_lisible(path_input,path_output):

if spawn.find_executable("pdftotext"):
    path_input = current_app.config['PROJECT_PATH']+path_input
    path_output = current_app.config['PROJECT_PATH']+path_output
    pdftotext = current_app.config['POPPLER_PATH']+"/pdftotext.exe"
    out = sp.Popen([pdftotext, "-layout", "-enc", "UTF-8", path_input, "-"], stdout=sp.PIPE, stderr=sp.PIPE, stdin=sp.PIPE)
    text = out.communicate()[0].decode('utf-8')    
    fichier = open(path_output, 'w', encoding='utf-8')
    fichier.write(text)
    fichier.close()

else:
    raise EnvironmentError(
        "pdftotext not installed. can be downloaded from https://poppler.freedesktop.org/"
    )
return out.decode("utf-8")

您可以使用此功能从PDF中提取干净的文本并将其存储在指定的路径中。

确保安装Poppler和Tesseract以使用此功能。

I Have found a solution to this function:

def extract_clean_text_lisible(path_input, path_output):

if spawn.find_executable("pdftotext"):
    path_input = current_app.config['PROJECT_PATH']+path_input
    path_output = current_app.config['PROJECT_PATH']+path_output
    pdftotext = current_app.config['POPPLER_PATH']+"/pdftotext.exe"
    out = sp.Popen([pdftotext, "-layout", "-enc", "UTF-8", path_input, "-"], stdout=sp.PIPE, stderr=sp.PIPE, stdin=sp.PIPE)
    text = out.communicate()[0].decode('utf-8')    
    fichier = open(path_output, 'w', encoding='utf-8')
    fichier.write(text)
    fichier.close()

else:
    raise EnvironmentError(
        "pdftotext not installed. can be downloaded from https://poppler.freedesktop.org/"
    )
return out.decode("utf-8")

you can use this function to extract clean text from pdfs and store it in a specified path.

be sure to install poppler and Tesseract to use this function.

梦明 2025-02-16 10:58:57

对于Windows中的这样的基本命令,您不需要Python只是将pdftotext.exe的快捷方式发送到任何位置,但桌面最简单。

然后根据需要更改属性,因此在这种情况下,添加`-layout -utf-8,您可以添加任何图标,然后对于任何一个PDF,您可以在图标上拖放

,您可以立即将文本文件在同一文件夹中

< img src =“ https://i.sstatic.net/hvc8c.png” alt =“在此处输入映像”>

对于多个文件或PDF的文件夹,您需要更复杂的快捷方式(仍然可以作为可能一行),使得更容易在CMD文件中编写一两行以传递变量。

For such a basic command in windows you don't need python just simply send a shortcut of the pdftotext.exe to any location, but desktop is easiest.

enter image description here

Then change the properties as required, so in this case add `-layout -UTF-8, you can add any icon, then for any one pdf you can drag and drop on the icon

enter image description here

And instantly you get the text file in the same folder

enter image description here

For multiple files or a folder of PDFs you would need a more complex shortcut (still possible as one line), such that its easier to write a line or two in a cmd file to pass the variables.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文