使用Python和Google Cloud从网站上下载特定的数据行

发布于 2025-02-12 16:29:38 字数 1417 浏览 4 评论 0原文

I wish to download a file (https://covid.ourworldindata.org/data使用Python和Google Cloud。

目前，我有此代码。

import os
import wget

from google.cloud import storage

url = os.environ['URL']
bucket_name = os.environ['BUCKET'] #without gs://
file_name = os.environ['FILE_NAME']

cf_path = '/tmp/{}'.format(file_name)

def import_file(event, context):

    # set storage client
    client = storage.Client()

    # get bucket
    bucket = client.get_bucket(bucket_name)

    # download the file to Cloud Function's tmp directory
    wget.download(url, cf_path)

    # set Blob
    blob = storage.Blob(file_name, bucket)
 
    # upload the file to GCS
    blob.upload_from_filename(cf_path)

    print("""This Function was triggered by messageId {} published at {}""".format(context.event_id, context.timestamp))

虽然此代码效果很好，但每天添加新日期的COVID19数据更新（因此，如果我以3/7访问该文件，它将包含数据到3/6）。 ，我希望仅将新更新的行提取到Google存储中，而不是在计划上运行该函数，而不是覆盖已经保存的文件时，我只想将新更新的行提取到Google存储中。

i，在编程方面相当有意义，并感谢您的帮助。

当文件为CSV格式时，还有一个JSON链接（ https://covid.ourworldindata.org/data/owid-covid-data.json ）如果它可以使编码更容易。

我可以找出将其存储到云存储中的部分，但是需要在代码上提供帮助，以更具体地提取最新的行/数据。

原文

I wish to download a file (https://covid.ourworldindata.org/data/owid-covid-data.csv) from the internet using Python and Google Cloud.

Currently, I have this code.

import os
import wget

from google.cloud import storage

url = os.environ['URL']
bucket_name = os.environ['BUCKET'] #without gs://
file_name = os.environ['FILE_NAME']

cf_path = '/tmp/{}'.format(file_name)

def import_file(event, context):

    # set storage client
    client = storage.Client()

    # get bucket
    bucket = client.get_bucket(bucket_name)

    # download the file to Cloud Function's tmp directory
    wget.download(url, cf_path)

    # set Blob
    blob = storage.Blob(file_name, bucket)
 
    # upload the file to GCS
    blob.upload_from_filename(cf_path)

    print("""This Function was triggered by messageId {} published at {}""".format(context.event_id, context.timestamp))

While this code works beautifully, the Covid19 data updates on a daily with addition of new date (So if I access the file on 3/7, it would include data till 3/6) . Instead of re-writing the whole file again, I wish to only extract the newly updated rows into google storage for every time the function is run on scheduled as opposed to overwriting the file that was already saved.

I fairly weal in programming and would appreciate the help.

While the file is in csv format, there's also a JSON link (https://covid.ourworldindata.org/data/owid-covid-data.json) if it would make it easier to code.

I can figure out the portion on storing it to Cloud Storage but requires help on the code to extract the most updated rows/data in more specifically.

分享到QQ

分享到微博