我写了一个python函数,从PDF中提取文本。我需要处理这种错误的帮助
def extract_clean_text_lisible(path_input, path_output):
if spawn.find_executable("pdftotext"):
path_input = current_app.config['PROJECT_PATH']+path_input
path_output = current_app.config['PROJECT_PATH']+path_output
pdftotext = current_app.config['POPPLER_PATH']+"/pdftotext.exe"
out, err = sp.Popen([pdftotext, "-layout", "-enc", "UTF-8", path_input, "-"], stdout=sp.PIPE, stderr=sp.PIPE, stdin=sp.PIPE)
fichier = open(path_output, "w", encoding="utf-8")
s = out.decode("utf-8")
fichier.write(s)
fichier.close()
else:
raise EnvironmentError(
"pdftotext not installed. can be downloaded from https://poppler.freedesktop.org/"
)
return out.decode("utf-8")
out, err = sp.Popen([pdftotext, "-layout", "-enc", "UTF-8", path_input, "-"], stdout=sp.PIPE, stderr=sp.PIPE, stdin=sp.PIPE)
TypeError: cannot unpack non-iterable Popen objec
def extract_clean_text_lisible(path_input, path_output):
if spawn.find_executable("pdftotext"):
path_input = current_app.config['PROJECT_PATH']+path_input
path_output = current_app.config['PROJECT_PATH']+path_output
pdftotext = current_app.config['POPPLER_PATH']+"/pdftotext.exe"
out, err = sp.Popen([pdftotext, "-layout", "-enc", "UTF-8", path_input, "-"], stdout=sp.PIPE, stderr=sp.PIPE, stdin=sp.PIPE)
fichier = open(path_output, "w", encoding="utf-8")
s = out.decode("utf-8")
fichier.write(s)
fichier.close()
else:
raise EnvironmentError(
"pdftotext not installed. can be downloaded from https://poppler.freedesktop.org/"
)
return out.decode("utf-8")
out, err = sp.Popen([pdftotext, "-layout", "-enc", "UTF-8", path_input, "-"], stdout=sp.PIPE, stderr=sp.PIPE, stdin=sp.PIPE)
TypeError: cannot unpack non-iterable Popen objec
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我找到了该功能的解决方案:
DEF extract_clean_text_lisible(path_input,path_output):
您可以使用此功能从PDF中提取干净的文本并将其存储在指定的路径中。
确保安装Poppler和Tesseract以使用此功能。
I Have found a solution to this function:
def extract_clean_text_lisible(path_input, path_output):
you can use this function to extract clean text from pdfs and store it in a specified path.
be sure to install poppler and Tesseract to use this function.
对于Windows中的这样的基本命令,您不需要Python只是将pdftotext.exe的快捷方式发送到任何位置,但桌面最简单。
然后根据需要更改属性,因此在这种情况下,添加`-layout -utf-8,您可以添加任何图标,然后对于任何一个PDF,您可以在图标上拖放
,您可以立即将文本文件在同一文件夹中
< img src =“ https://i.sstatic.net/hvc8c.png” alt =“在此处输入映像”>
对于多个文件或PDF的文件夹,您需要更复杂的快捷方式(仍然可以作为可能一行),使得更容易在CMD文件中编写一两行以传递变量。
For such a basic command in windows you don't need python just simply send a shortcut of the pdftotext.exe to any location, but desktop is easiest.
Then change the properties as required, so in this case add `-layout -UTF-8, you can add any icon, then for any one pdf you can drag and drop on the icon
And instantly you get the text file in the same folder
For multiple files or a folder of PDFs you would need a more complex shortcut (still possible as one line), such that its easier to write a line or two in a cmd file to pass the variables.