当前位置：文江博客话题详情

如何将Google Cloud Storage的DOCX文件读取到Jupyter笔记本中？

发布于 2025-01-22 10:44:25 字数 1399 浏览 0 评论 0原文

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

锦爱 2025-01-29 10:44:25

解决！我找到了我的代码缺失的作品。原来我是那里的90％。

我需要下载文件并将其写入我的工作目录（“ blob.download_as_file”），而不是将文件下载为字符串（“ blob.download_as_string”），而是使用python命令 open（）。一旦在我的工作目录中读取AS DOCX文件后，我就可以使用Python的Docx软件包（John Hanley共享）。

这是所需的丢失代码：

import os
import docx

filepath=os.path.join(os.getcwd(),'sample_msword_document.docx')  #find working directory

with open(filepath,'wb') as f:
          blob.download_to_file(f)  #download file to working directory
f.close()

doc = docx.Document(filepath)  #read downloaded file back in using docx package

因此，总的来说，最终脚本应该看起来像这样：

import pandas as pd
from google.cloud import storage
import os
import docx

client = storage.Client()
bucket_name = "sample_bucket"
file_name = "Folder/sample_msword_document.docx"
bucket = client.get_bucket(bucket_name)
blob = bucket.get_blob(file_name)

filepath=os.path.join(os.getcwd(),'sample_msword_document.docx')

with open(filepath,'wb') as f:
          blob.download_to_file(f)
f.close()

doc = docx.Document(filepath)

变量 doc 是我问问题时要寻找的具体结果。因此，我的问题得到了完全回答。

我确实邀请了与更有效的方法一起完成任务的进一步响应。我计划批量处理DOCX文件，因此效率是一个非常重要的因素。

SOLVED! I found the missing piece to my code. Turns out I was 90% of the way there.

Instead of downloading the file as a string ("blob.download_as_string"), I needed to download the file and write it to my working directory ("blob.download_as_file") using the Python command open(). Once read in as docx file on my working directory, I then was able to utilize Python's docx package (as shared by John Hanley).

Here's the missing code needed:

import os
import docx

filepath=os.path.join(os.getcwd(),'sample_msword_document.docx')  #find working directory

with open(filepath,'wb') as f:
          blob.download_to_file(f)  #download file to working directory
f.close()

doc = docx.Document(filepath)  #read downloaded file back in using docx package

So, altogether, the final script should look like this:

import pandas as pd
from google.cloud import storage
import os
import docx

client = storage.Client()
bucket_name = "sample_bucket"
file_name = "Folder/sample_msword_document.docx"
bucket = client.get_bucket(bucket_name)
blob = bucket.get_blob(file_name)

filepath=os.path.join(os.getcwd(),'sample_msword_document.docx')

with open(filepath,'wb') as f:
          blob.download_to_file(f)
f.close()

doc = docx.Document(filepath)

The variable doc is the specific result I was looking for when I asked my question. Hence, my question is fully answered.

I do invite further responses related to more efficient ways to accomplish the task. I plan to process docx files in bulk, so efficiency is a very important factor.

回复收藏 0 原文

~没有更多了~

关于作者

思慕

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

如何将Google Cloud Storage的DOCX文件读取到Jupyter笔记本中？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

尘曦

在梵高的星空下

善良天后

韬韬不绝

qq_CgiN62

不美如何

友情链接

如何将Google Cloud Storage的DOCX文件读取到Jupyter笔记本中？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

尘曦

在梵高的星空下

善良天后

韬韬不绝

qq_CgiN62

不美如何

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。