gridfs阅读pdf

发布于 2025-01-26 05:15:12 字数 490 浏览 1 评论 0原文

我正在尝试用烧瓶和Pymongo建造金融仪表板。起点是烧瓶形式，该表格将数据保存在MongoDB数据库中。表单中的字段之一是文件字段（WTFORMS），该字段允许上传PDF，然后将其存储在MongoDB中。现在，我设法保存了PDF，并且可以在.files和.chunks Collections中看到所得的条目。现在，我想构建一个检索PDF并使用一些基本NLP进行分析的函数，但是我很难获得有意义的数据。

当我这样做时：

storage = gridfs.GridFS(db, collection)
data = storage.get('some id')
a = data.read()

结果是二进制文件。如果我继续进行：

with open(data, 'rb') as f:
   b = f.read()

结果是“ valueerror：嵌入式null字节或有时为空字符串”。

对此有任何帮助吗？

原文

I am trying to build a financial dashboard with Flask and pymongo. The starting point is a flask form which saves data in a MongoDB database. One of the fields in the form is a FileField (wtforms) which allows the upload of a PDF, which is then stored in MongoDB with GridFS.
Now I manage to save the pdf and I can see the resulting entries within the .files and .chunks collections. Now I would like to build a function that retrieves the PDFs and analyses them with some basic NLP, however I struggle with the getting meaningful data.

When I do:

storage = gridfs.GridFS(db, collection)
data = storage.get('some id')
a = data.read()

The result is a binary file. If I continue with:

with open(data, 'rb') as f:
   b = f.read()

The result is "ValueError: embedded null byte or sometimes an empty "byte string".

Any help on this?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

落花浅忆 2025-02-02 05:15:12

要在上面的跟进上，我为自己找到了一个解决方案，该解决方案由两个单独的功能组成：

（1）上传表单和将文件上传到mongoDB之前，我根据pdfminer应用一个函数PDF并将其转换为使用NLTK的句子列表。然后，我将通过storece.put（file，send_list = send_list）#sent_list是句子列表的变量名称。>
每当我希望在文件上运行NLP操作时，我都会从MongoDB调用“ send_list”变量。

（2）如果我想在其原始内容中显示存储的PDF，则将以下功能作为单独的路由包含。

storage = GridFS(db, collection)
data = storage.get_last_version(filename)
response = make_response(data.read())
extension = data.filename.split('.')[-1]
response.headers['Content-Type'] = f'application/{extension}'
response.headers['Content-Disposition'] = f'inline; filename={data.filename}'
return response

（2）将在我的烧瓶应用程序中打开一个新选项卡，以其原始格式显示.pdf文件。

我希望这对未来遇到类似问题的任何人都有帮助。

To follow up on the above, I found a solution for myself that consists in 2 separate functions:

(1) Upon upload of the form and before uploading the files to MongoDB, I apply a function based on pdfminer that extracts the string content of the PDF and tranform it into a list of sentences using NLTK. I will then store this list in the .files via the storage.put(file, sent_list = sent_list) #sent_list being the variable name of the list of sentences.
Whenever I wish to run NLP operations on the file, I will just call the "sent_list" variable from mongodb.

(2) If I wish to display the stored pdf in its original content however, I included the following function as a separate route.

storage = GridFS(db, collection)
data = storage.get_last_version(filename)
response = make_response(data.read())
extension = data.filename.split('.')[-1]
response.headers['Content-Type'] = f'application/{extension}'
response.headers['Content-Disposition'] = f'inline; filename={data.filename}'
return response

(2) will open a new tab in my flask app showing the .pdf file in its original format.

I hope this helps anyone coming across a similar problem in the future.

回复收藏 0 原文

~没有更多了~