使用Python和Google Cloud从网站上下载特定的数据行
I wish to download a file (https://covid.ourworldindata.org/data使用Python和Google Cloud。
目前,我有此代码。
import os
import wget
from google.cloud import storage
url = os.environ['URL']
bucket_name = os.environ['BUCKET'] #without gs://
file_name = os.environ['FILE_NAME']
cf_path = '/tmp/{}'.format(file_name)
def import_file(event, context):
# set storage client
client = storage.Client()
# get bucket
bucket = client.get_bucket(bucket_name)
# download the file to Cloud Function's tmp directory
wget.download(url, cf_path)
# set Blob
blob = storage.Blob(file_name, bucket)
# upload the file to GCS
blob.upload_from_filename(cf_path)
print("""This Function was triggered by messageId {} published at {}""".format(context.event_id, context.timestamp))
虽然此代码效果很好,但每天添加新日期的COVID19数据更新(因此,如果我以3/7访问该文件,它将包含数据到3/6)。 ,我希望仅将新更新的行提取到Google存储中,而不是在计划上运行该函数,而不是覆盖已经保存的文件时,我只想将新更新的行提取到Google存储中。
i,在编程方面相当有意义,并感谢您的帮助。
当文件为CSV格式时,还有一个JSON链接( https://covid.ourworldindata.org/data/owid-covid-data.json )如果它可以使编码更容易。
我可以找出将其存储到云存储中的部分,但是需要在代码上提供帮助,以更具体地提取最新的行/数据。
I wish to download a file (https://covid.ourworldindata.org/data/owid-covid-data.csv) from the internet using Python and Google Cloud.
Currently, I have this code.
import os
import wget
from google.cloud import storage
url = os.environ['URL']
bucket_name = os.environ['BUCKET'] #without gs://
file_name = os.environ['FILE_NAME']
cf_path = '/tmp/{}'.format(file_name)
def import_file(event, context):
# set storage client
client = storage.Client()
# get bucket
bucket = client.get_bucket(bucket_name)
# download the file to Cloud Function's tmp directory
wget.download(url, cf_path)
# set Blob
blob = storage.Blob(file_name, bucket)
# upload the file to GCS
blob.upload_from_filename(cf_path)
print("""This Function was triggered by messageId {} published at {}""".format(context.event_id, context.timestamp))
While this code works beautifully, the Covid19 data updates on a daily with addition of new date (So if I access the file on 3/7, it would include data till 3/6) . Instead of re-writing the whole file again, I wish to only extract the newly updated rows into google storage for every time the function is run on scheduled as opposed to overwriting the file that was already saved.
I fairly weal in programming and would appreciate the help.
While the file is in csv format, there's also a JSON link (https://covid.ourworldindata.org/data/owid-covid-data.json) if it would make it easier to code.
I can figure out the portion on storing it to Cloud Storage but requires help on the code to extract the most updated rows/data in more specifically.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
通常的最佳实践是每天在bigquery中加载数据,并按。
然后,您可以运行查询(或创建视图)以仅选择类型的最新数据(使用 语法上的分区)
The usual best practice is to load data everyday in BigQuery and to partition per ingestion date.
Then, you can run a query (or create a view) to select only the most recent data of type (use the partition over syntax) (deduplicate)