使用Python和Google Cloud从网站上下载特定的数据行

发布于 2025-02-12 16:29:38 字数 1417 浏览 4 评论 0原文

I wish to download a file (https://covid.ourworldindata.org/data使用Python和Google Cloud。

目前,我有此代码。

import os
import wget

from google.cloud import storage

url = os.environ['URL']
bucket_name = os.environ['BUCKET'] #without gs://
file_name = os.environ['FILE_NAME']

cf_path = '/tmp/{}'.format(file_name)

def import_file(event, context):

    # set storage client
    client = storage.Client()

    # get bucket
    bucket = client.get_bucket(bucket_name)

    # download the file to Cloud Function's tmp directory
    wget.download(url, cf_path)

    # set Blob
    blob = storage.Blob(file_name, bucket)
 
    # upload the file to GCS
    blob.upload_from_filename(cf_path)

    print("""This Function was triggered by messageId {} published at {}""".format(context.event_id, context.timestamp))

虽然此代码效果很好,但每天添加新日期的COVID19数据更新(因此,如果我以3/7访问该文件,它将包含数据到3/6)。 ,我希望仅将新更新的行提取到Google存储中,而不是在计划上运行该函数,而不是覆盖已经保存的文件时,我只想将新更新的行提取到Google存储中。

i,在编程方面相当有意义,并感谢您的帮助。

当文件为CSV格式时,还有一个JSON链接( https://covid.ourworldindata.org/data/owid-covid-data.json )如果它可以使编码更容易。

我可以找出将其存储到云存储中的部分,但是需要在代码上提供帮助,以更具体地提取最新的行/数据

I wish to download a file (https://covid.ourworldindata.org/data/owid-covid-data.csv) from the internet using Python and Google Cloud.

Currently, I have this code.

import os
import wget

from google.cloud import storage

url = os.environ['URL']
bucket_name = os.environ['BUCKET'] #without gs://
file_name = os.environ['FILE_NAME']

cf_path = '/tmp/{}'.format(file_name)

def import_file(event, context):

    # set storage client
    client = storage.Client()

    # get bucket
    bucket = client.get_bucket(bucket_name)

    # download the file to Cloud Function's tmp directory
    wget.download(url, cf_path)

    # set Blob
    blob = storage.Blob(file_name, bucket)
 
    # upload the file to GCS
    blob.upload_from_filename(cf_path)

    print("""This Function was triggered by messageId {} published at {}""".format(context.event_id, context.timestamp))

While this code works beautifully, the Covid19 data updates on a daily with addition of new date (So if I access the file on 3/7, it would include data till 3/6) . Instead of re-writing the whole file again, I wish to only extract the newly updated rows into google storage for every time the function is run on scheduled as opposed to overwriting the file that was already saved.

I fairly weal in programming and would appreciate the help.

While the file is in csv format, there's also a JSON link (https://covid.ourworldindata.org/data/owid-covid-data.json) if it would make it easier to code.

I can figure out the portion on storing it to Cloud Storage but requires help on the code to extract the most updated rows/data in more specifically.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

活雷疯 2025-02-19 16:29:38

通常的最佳实践是每天在bigquery中加载数据,并按

然后,您可以运行查询(或创建视图)以仅选择类型的最新数据(使用 语法上的分区)

The usual best practice is to load data everyday in BigQuery and to partition per ingestion date.

Then, you can run a query (or create a view) to select only the most recent data of type (use the partition over syntax) (deduplicate)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文