gridfs阅读pdf

发布于 2025-01-26 05:15:12 字数 490 浏览 1 评论 0原文

我正在尝试用烧瓶和Pymongo建造金融仪表板。起点是烧瓶形式,该表格将数据保存在MongoDB数据库中。表单中的字段之一是文件字段(WTFORMS),该字段允许上传PDF,然后将其存储在MongoDB中。 现在,我设法保存了PDF,并且可以在.files和.chunks Collections中看到所得的条目。现在,我想构建一个检索PDF并使用一些基本NLP进行分析的函数,但是我很难获得有意义的数据。

当我这样做时:

storage = gridfs.GridFS(db, collection)
data = storage.get('some id')
a = data.read()

结果是二进制文件。如果我继续进行:

with open(data, 'rb') as f:
   b = f.read()

结果是“ valueerror:嵌入式null字节或有时为空字符串”。

对此有任何帮助吗?

I am trying to build a financial dashboard with Flask and pymongo. The starting point is a flask form which saves data in a MongoDB database. One of the fields in the form is a FileField (wtforms) which allows the upload of a PDF, which is then stored in MongoDB with GridFS.
Now I manage to save the pdf and I can see the resulting entries within the .files and .chunks collections. Now I would like to build a function that retrieves the PDFs and analyses them with some basic NLP, however I struggle with the getting meaningful data.

When I do:

storage = gridfs.GridFS(db, collection)
data = storage.get('some id')
a = data.read()

The result is a binary file. If I continue with:

with open(data, 'rb') as f:
   b = f.read()

The result is "ValueError: embedded null byte or sometimes an empty "byte string".

Any help on this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

落花浅忆 2025-02-02 05:15:12

要在上面的跟进上,我为自己找到了一个解决方案,该解决方案由两个单独的功能组成:

(1)上传表单和将文件上传到mongoDB之前,我根据pdfminer应用一个函数PDF并将其转换为使用NLTK的句子列表。然后,我将通过storece.put(file,send_list = send_list)#sent_list是句子列表的变量名称。>
每当我希望在文件上运行NLP操作时,我都会从MongoDB调用“ send_list”变量。

(2)如果我想在其原始内容中显示存储的PDF,则将以下功能作为单独的路由包含。

storage = GridFS(db, collection)
data = storage.get_last_version(filename)
response = make_response(data.read())
extension = data.filename.split('.')[-1]
response.headers['Content-Type'] = f'application/{extension}'
response.headers['Content-Disposition'] = f'inline; filename={data.filename}'
return response

(2)将在我的烧瓶应用程序中打开一个新选项卡,以其原始格式显示.pdf文件。

我希望这对未来遇到类似问题的任何人都有帮助。

To follow up on the above, I found a solution for myself that consists in 2 separate functions:

(1) Upon upload of the form and before uploading the files to MongoDB, I apply a function based on pdfminer that extracts the string content of the PDF and tranform it into a list of sentences using NLTK. I will then store this list in the .files via the storage.put(file, sent_list = sent_list) #sent_list being the variable name of the list of sentences.
Whenever I wish to run NLP operations on the file, I will just call the "sent_list" variable from mongodb.

(2) If I wish to display the stored pdf in its original content however, I included the following function as a separate route.

storage = GridFS(db, collection)
data = storage.get_last_version(filename)
response = make_response(data.read())
extension = data.filename.split('.')[-1]
response.headers['Content-Type'] = f'application/{extension}'
response.headers['Content-Disposition'] = f'inline; filename={data.filename}'
return response

(2) will open a new tab in my flask app showing the .pdf file in its original format.

I hope this helps anyone coming across a similar problem in the future.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文