gridfs阅读pdf
我正在尝试用烧瓶和Pymongo建造金融仪表板。起点是烧瓶形式,该表格将数据保存在MongoDB数据库中。表单中的字段之一是文件字段(WTFORMS),该字段允许上传PDF,然后将其存储在MongoDB中。 现在,我设法保存了PDF,并且可以在.files和.chunks Collections中看到所得的条目。现在,我想构建一个检索PDF并使用一些基本NLP进行分析的函数,但是我很难获得有意义的数据。
当我这样做时:
storage = gridfs.GridFS(db, collection)
data = storage.get('some id')
a = data.read()
结果是二进制文件。如果我继续进行:
with open(data, 'rb') as f:
b = f.read()
结果是“ valueerror:嵌入式null字节或有时为空字符串”。
对此有任何帮助吗?
I am trying to build a financial dashboard with Flask and pymongo. The starting point is a flask form which saves data in a MongoDB database. One of the fields in the form is a FileField (wtforms) which allows the upload of a PDF, which is then stored in MongoDB with GridFS.
Now I manage to save the pdf and I can see the resulting entries within the .files and .chunks collections. Now I would like to build a function that retrieves the PDFs and analyses them with some basic NLP, however I struggle with the getting meaningful data.
When I do:
storage = gridfs.GridFS(db, collection)
data = storage.get('some id')
a = data.read()
The result is a binary file. If I continue with:
with open(data, 'rb') as f:
b = f.read()
The result is "ValueError: embedded null byte or sometimes an empty "byte string".
Any help on this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
要在上面的跟进上,我为自己找到了一个解决方案,该解决方案由两个单独的功能组成:
(1)上传表单和将文件上传到mongoDB之前,我根据pdfminer应用一个函数PDF并将其转换为使用NLTK的句子列表。然后,我将通过
storece.put(file,send_list = send_list)#sent_list是句子列表的变量名称。
>每当我希望在文件上运行NLP操作时,我都会从MongoDB调用“ send_list”变量。
(2)如果我想在其原始内容中显示存储的PDF,则将以下功能作为单独的路由包含。
(2)将在我的烧瓶应用程序中打开一个新选项卡,以其原始格式显示.pdf文件。
我希望这对未来遇到类似问题的任何人都有帮助。
To follow up on the above, I found a solution for myself that consists in 2 separate functions:
(1) Upon upload of the form and before uploading the files to MongoDB, I apply a function based on pdfminer that extracts the string content of the PDF and tranform it into a list of sentences using NLTK. I will then store this list in the .files via the
storage.put(file, sent_list = sent_list) #sent_list being the variable name of the list of sentences.
Whenever I wish to run NLP operations on the file, I will just call the "sent_list" variable from mongodb.
(2) If I wish to display the stored pdf in its original content however, I included the following function as a separate route.
(2) will open a new tab in my flask app showing the .pdf file in its original format.
I hope this helps anyone coming across a similar problem in the future.