从 Vertex AI 和 Google Cloud Storage 读取文件

发布于 2025-01-11 16:13:23 字数 1308 浏览 6 评论 0原文

我正在尝试在 GCP/Vertex AI 中设置管道，但遇到了很多麻烦。该管道是使用 Kubeflow Pipelines 编写的，并且具有许多不同的组件，但有一件事特别给我带来了麻烦。最终我想在云调度程序的帮助下从云功能启动它。

给我带来问题的部分相当简单，我相信我只需要某种形式的介绍来说明我应该如何考虑这个设置。我只想从文件（可能是 .csv、.txt 或类似文件）中读取和写入。我想象 GCP 中本地计算机上的文件系统的模拟是云存储，因此这是我暂时尝试读取的位置（如果我错了，请纠正我）。我构建的组件是对这篇帖子，看起来像这样。

@component(
    packages_to_install=["google-cloud"],
    base_image="python:3.9"
)


def main(
):
    import csv
    from io import StringIO

    from google.cloud import storage

    BUCKET_NAME = "gs://my_bucket"

    storage_client = storage.Client()
    bucket = storage_client.get_bucket(BUCKET_NAME)

    blob = bucket.blob('test/test.txt')
    blob = blob.download_as_string()
    blob = blob.decode('utf-8')

    blob = StringIO(blob)  #tranform bytes to string here

    names = csv.reader(blob)  #then use csv library to read the content
    for name in names:
        print(f"First Name: {name[0]}")

我收到的错误如下所示：

google.api_core.exceptions.NotFound: 404 GET https://storage.googleapis.com/storage/v1/b/gs://pipeline_dev?projection=noAcl&prettyPrint=false: Not Found

我的大脑出了什么问题？我感觉读写文件不应该这么困难。我一定错过了一些基本的东西吗？非常感谢任何帮助。

原文

I am trying to set up a pipeline in GCP/Vertex AI and am having a lot of trouble. The pipeline is being written using Kubeflow Pipelines and has many different components, one thing in particular is giving me trouble however. Eventually I want to launch this from a Cloud Function with the help of the Cloud Scheduler.

The part that is giving me issues is fairly simple and I believe I just need some form of introduction to how I should be thinking about this setup. I simply want to read and write from files (might be .csv, .txt or similar). I imagine that the analog to the filesystem on my local machine in GCP is the Cloud Storage so this is where I have been trying to read from for the time being (please correct me if I'm wrong). The component I've built is a blatant rip-off of this post and looks like this.

@component(
    packages_to_install=["google-cloud"],
    base_image="python:3.9"
)


def main(
):
    import csv
    from io import StringIO

    from google.cloud import storage

    BUCKET_NAME = "gs://my_bucket"

    storage_client = storage.Client()
    bucket = storage_client.get_bucket(BUCKET_NAME)

    blob = bucket.blob('test/test.txt')
    blob = blob.download_as_string()
    blob = blob.decode('utf-8')

    blob = StringIO(blob)  #tranform bytes to string here

    names = csv.reader(blob)  #then use csv library to read the content
    for name in names:
        print(f"First Name: {name[0]}")

The error I'm getting looks like the following:

google.api_core.exceptions.NotFound: 404 GET https://storage.googleapis.com/storage/v1/b/gs://pipeline_dev?projection=noAcl&prettyPrint=false: Not Found

What's going wrong in my brain? I get the feeling that it shouldn't be this difficult to read and write files. I must be missing something fundamental? Any help is highly appreciated.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

路还长，别太狂 2025-01-18 16:13:23

尝试指定存储桶名称 w/oa gs://。这应该可以解决问题。另一篇 stackoverflow 帖子也说了同样的事情： Cloud Storage python client failed to检索存储桶

您尝试在 GCP 中访问的任何存储桶都有一个唯一的地址来访问它。该地址始终以 gs:// 开头，指定它是云存储 URL。现在，GCS api 的设计使得它们只需要存储桶名称即可使用它。因此，您只需传递存储桶名称即可。如果您通过浏览器访问存储桶，则需要完整的地址才能访问，因此还需要 gs:// 前缀。

回复收藏 0 原文

~没有更多了~