import os
import docx
filepath=os.path.join(os.getcwd(),'sample_msword_document.docx') #find working directory
with open(filepath,'wb') as f:
blob.download_to_file(f) #download file to working directory
f.close()
doc = docx.Document(filepath) #read downloaded file back in using docx package
因此,总的来说,最终脚本应该看起来像这样:
import pandas as pd
from google.cloud import storage
import os
import docx
client = storage.Client()
bucket_name = "sample_bucket"
file_name = "Folder/sample_msword_document.docx"
bucket = client.get_bucket(bucket_name)
blob = bucket.get_blob(file_name)
filepath=os.path.join(os.getcwd(),'sample_msword_document.docx')
with open(filepath,'wb') as f:
blob.download_to_file(f)
f.close()
doc = docx.Document(filepath)
SOLVED! I found the missing piece to my code. Turns out I was 90% of the way there.
Instead of downloading the file as a string ("blob.download_as_string"), I needed to download the file and write it to my working directory ("blob.download_as_file") using the Python command open(). Once read in as docx file on my working directory, I then was able to utilize Python's docx package (as shared by John Hanley).
Here's the missing code needed:
import os
import docx
filepath=os.path.join(os.getcwd(),'sample_msword_document.docx') #find working directory
with open(filepath,'wb') as f:
blob.download_to_file(f) #download file to working directory
f.close()
doc = docx.Document(filepath) #read downloaded file back in using docx package
So, altogether, the final script should look like this:
import pandas as pd
from google.cloud import storage
import os
import docx
client = storage.Client()
bucket_name = "sample_bucket"
file_name = "Folder/sample_msword_document.docx"
bucket = client.get_bucket(bucket_name)
blob = bucket.get_blob(file_name)
filepath=os.path.join(os.getcwd(),'sample_msword_document.docx')
with open(filepath,'wb') as f:
blob.download_to_file(f)
f.close()
doc = docx.Document(filepath)
The variable doc is the specific result I was looking for when I asked my question. Hence, my question is fully answered.
I do invite further responses related to more efficient ways to accomplish the task. I plan to process docx files in bulk, so efficiency is a very important factor.
发布评论
评论(1)
解决!我找到了我的代码缺失的作品。原来我是那里的90%。
我需要下载文件并将其写入我的工作目录(“ blob.download_as_file”),而不是将文件下载为字符串(“ blob.download_as_string”),而是使用python命令 open() 。一旦在我的工作目录中读取AS DOCX文件后,我就可以使用Python的Docx软件包(John Hanley共享)。
这是所需的丢失代码:
因此,总的来说,最终脚本应该看起来像这样:
变量 doc 是我问问题时要寻找的具体结果。因此,我的问题得到了完全回答。
我确实邀请了与更有效的方法一起完成任务的进一步响应。我计划批量处理DOCX文件,因此效率是一个非常重要的因素。
SOLVED! I found the missing piece to my code. Turns out I was 90% of the way there.
Instead of downloading the file as a string ("blob.download_as_string"), I needed to download the file and write it to my working directory ("blob.download_as_file") using the Python command open(). Once read in as docx file on my working directory, I then was able to utilize Python's docx package (as shared by John Hanley).
Here's the missing code needed:
So, altogether, the final script should look like this:
The variable doc is the specific result I was looking for when I asked my question. Hence, my question is fully answered.
I do invite further responses related to more efficient ways to accomplish the task. I plan to process docx files in bulk, so efficiency is a very important factor.