如何将Google Cloud Storage的DOCX文件读取到Jupyter笔记本中?

发布于 2025-01-22 10:44:25 字数 1399 浏览 0 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

锦爱 2025-01-29 10:44:25

解决!我找到了我的代码缺失的作品。原来我是那里的90%。

我需要下载文件并将其写入我的工作目录(“ blob.download_as_file”),而不是将文件下载为字符串(“ blob.download_as_string”),而是使用python命令 open() 。一旦在我的工作目录中读取AS DOCX文件后,我就可以使用Python的Docx软件包(John Hanley共享)。

这是所需的丢失代码:

import os
import docx

filepath=os.path.join(os.getcwd(),'sample_msword_document.docx')  #find working directory

with open(filepath,'wb') as f:
          blob.download_to_file(f)  #download file to working directory
f.close()

doc = docx.Document(filepath)  #read downloaded file back in using docx package

因此,总的来说,最终脚本应该看起来像这样:

import pandas as pd
from google.cloud import storage
import os
import docx

client = storage.Client()
bucket_name = "sample_bucket"
file_name = "Folder/sample_msword_document.docx"
bucket = client.get_bucket(bucket_name)
blob = bucket.get_blob(file_name)

filepath=os.path.join(os.getcwd(),'sample_msword_document.docx')

with open(filepath,'wb') as f:
          blob.download_to_file(f)
f.close()

doc = docx.Document(filepath)

变量 doc 是我问问题时要寻找的具体结果。因此,我的问题得到了完全回答。

我确实邀请了与更有效的方法一起完成任务的进一步响应。我计划批量处理DOCX文件,因此效率是一个非常重要的因素。

SOLVED! I found the missing piece to my code. Turns out I was 90% of the way there.

Instead of downloading the file as a string ("blob.download_as_string"), I needed to download the file and write it to my working directory ("blob.download_as_file") using the Python command open(). Once read in as docx file on my working directory, I then was able to utilize Python's docx package (as shared by John Hanley).

Here's the missing code needed:

import os
import docx

filepath=os.path.join(os.getcwd(),'sample_msword_document.docx')  #find working directory

with open(filepath,'wb') as f:
          blob.download_to_file(f)  #download file to working directory
f.close()

doc = docx.Document(filepath)  #read downloaded file back in using docx package

So, altogether, the final script should look like this:

import pandas as pd
from google.cloud import storage
import os
import docx

client = storage.Client()
bucket_name = "sample_bucket"
file_name = "Folder/sample_msword_document.docx"
bucket = client.get_bucket(bucket_name)
blob = bucket.get_blob(file_name)

filepath=os.path.join(os.getcwd(),'sample_msword_document.docx')

with open(filepath,'wb') as f:
          blob.download_to_file(f)
f.close()

doc = docx.Document(filepath)

The variable doc is the specific result I was looking for when I asked my question. Hence, my question is fully answered.

I do invite further responses related to more efficient ways to accomplish the task. I plan to process docx files in bulk, so efficiency is a very important factor.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文