如何在线获取PDF并将其转换为Python中的字符串?

发布于 2025-02-09 22:22:17 字数 318 浏览 1 评论 0原文

我正在尝试使用请求之类的内容来在线获取PDF,然后将其转换为Python中的AA字符串。我不想最终将PDF放在硬盘中。取而代之的是,我想在线上获得a在线上,以python3中的文本/字符串进行处理。

例如,说您有一个带有内容的PDF文件:我喜欢编程。

url = 'xyzzy.org/g.pdf'
re = requests.get(url)
# do something to re and assign it to `pdf`
convert_to_string(pdf) -> "I love programming"

I am trying to a get a pdf online using something like requests and convert it to a a string in Python. I don't want to end up with the pdf in my hard disk. instead I want to get a of online and work on it in terms of text/string in python3.

For example say you have a pdf file with the contents: I love programming.

url = 'xyzzy.org/g.pdf'
re = requests.get(url)
# do something to re and assign it to `pdf`
convert_to_string(pdf) -> "I love programming"

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

哎呦我呸! 2025-02-16 22:22:17

更新:阅读k j'

答案分为两个部分:

  1. 通过流对象下载PDF,
  2. 将内存PDF转换为字符串,

这应该可以完成工作(需要Pymupdf软件包):

import io
import requests
import fitz

url = "http://.../sample.pdf"

response = requests.get(url)
pdf = io.BytesIO(response.content)
with fitz.open(stream=pdf) as doc:
    text = ""
    for page in doc:
        text += page.get_text()
print(text)

Update: read K J's answer

As pointed out in the comments, you can divide this task into two parts:

  1. Download the pdf through a stream object
  2. Convert the in-memory pdf into a string

This should do the job (it needs the PyMuPDF package):

import io
import requests
import fitz

url = "http://.../sample.pdf"

response = requests.get(url)
pdf = io.BytesIO(response.content)
with fitz.open(stream=pdf) as doc:
    text = ""
    for page in doc:
        text += page.get_text()
print(text)
戈亓 2025-02-16 22:22:17

虽然上面的正确答案使用了示例的文件流。pdf

(BytesIO(response.content)
with fitz.open(stream=pdf) as doc:

仍然必须通过超级文本响应(下载)从https转移(下载),然后由fitz解码为%tmp%memoryblob.pdf(因此,下载的文件和下载的文件提取后被丢弃),

如果您只想使用OS和Poppler进行类似的操作,以便能够尝试不同的选项。序列只是为

curl -o "%tmp%\temp.pdf" RemoteURL
pdftotext [options] "%tmp%\temp.pdf" filename.txt

您提供无限的时间来重播最后一行,下次覆盖相同的%TMP%内存文件。如果将选项设置并将filename.txt更改为-您可以查看控制台输出,但是要当心非本机编码,控制台可能会出现Cruder Output,而不是在文件名。

Whilst the correct given answer above uses a FileStream for sample.pdf

(BytesIO(response.content)
with fitz.open(stream=pdf) as doc:

It still had to be transferred down from https via hyper text response (download) and then decoded by fitz as a %tmp%MemoryBlob.pdf (thus a file that was downloaded and discarded after extraction)

If you want to do similar using just the OS and Poppler to be able to tryout different options the sequence is simply

curl -o "%tmp%\temp.pdf" RemoteURL
pdftotext [options] "%tmp%\temp.pdf" filename.txt

It gives you infinite time to replay the last line and next time overwrite the same %tmp% memory file. If you set options and change filename.txt to simply - you can review the console output, however beware for non native encodings the console may appear cruder output than would be stored in a filename.

enter image description here

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文