Vertex Workbench-如何在Jupyter笔记本中运行BigQueryExamplegen

发布于 2025-02-07 03:13:54 字数 1564 浏览 3 评论 0 原文

问题

试图运行

InvalidUserInputError: Request missing required parameter projectId [while running 'InputToRecord/QueryTable/ReadFromBigQuery/Read/SDFBoundedSourceReader/ParDo(SDFBoundedSourceDoFn)/SplitAndSizeRestriction']

import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = "path_to_credential_file"


from tfx.v1.extensions.google_cloud_big_query import BigQueryExampleGen
from tfx.v1.components import (
    StatisticsGen,
    SchemaGen,
)
from tfx.orchestration.experimental.interactive.interactive_context import InteractiveContext
%load_ext tfx.orchestration.experimental.interactive.notebook_extensions.skip
context = InteractiveContext(pipeline_root='./data/artifacts')

运行BigQueryExample。

query = """
SELECT 
    * EXCEPT (trip_start_timestamp, ML_use)
FROM 
    {PROJECT_ID}.public_dataset.chicago_taxitrips_prep
""".format(PROJECT_ID=PROJECT_ID)

example_gen = context.run(
    BigQueryExampleGen(query=query)
)

有错误。

InvalidUserInputError: Request missing required parameter projectId [while running 'InputToRecord/QueryTable/ReadFromBigQuery/Read/SDFBoundedSourceReader/ParDo(SDFBoundedSourceDoFn)/SplitAndSizeRestriction']

数据

请参见为芝加哥出租车旅行数据集设置BigQuery数据集。

Problem

Tried to run BigQueryExampleGen in the

InvalidUserInputError: Request missing required parameter projectId [while running 'InputToRecord/QueryTable/ReadFromBigQuery/Read/SDFBoundedSourceReader/ParDo(SDFBoundedSourceDoFn)/SplitAndSizeRestriction']

Steps

BigQueryExampleGen
Setup the GCP project and the interactive TFX context.

import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = "path_to_credential_file"


from tfx.v1.extensions.google_cloud_big_query import BigQueryExampleGen
from tfx.v1.components import (
    StatisticsGen,
    SchemaGen,
)
from tfx.orchestration.experimental.interactive.interactive_context import InteractiveContext
%load_ext tfx.orchestration.experimental.interactive.notebook_extensions.skip
context = InteractiveContext(pipeline_root='./data/artifacts')

Run the BigqueryExampleGen.

query = """
SELECT 
    * EXCEPT (trip_start_timestamp, ML_use)
FROM 
    {PROJECT_ID}.public_dataset.chicago_taxitrips_prep
""".format(PROJECT_ID=PROJECT_ID)

example_gen = context.run(
    BigQueryExampleGen(query=query)
)

Got the error.

InvalidUserInputError: Request missing required parameter projectId [while running 'InputToRecord/QueryTable/ReadFromBigQuery/Read/SDFBoundedSourceReader/ParDo(SDFBoundedSourceDoFn)/SplitAndSizeRestriction']

Data

See mlops-with-vertex-ai/01-dataset-management.ipynb to setup the BigQuery dataset for CThe Chicago Taxi Trips dataset.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

妄想挽回 2025-02-14 03:13:54

在GCP中运行的项目ID

需要通过 beam_pipeline_args 参数提供项目ID。

已提出#888来进行这项工作。通过这种更改,您将能够做

  context.run(...,beam_pipeline_args = [' -  project','my-project'])
 
query = """
SELECT 
    * EXCEPT (trip_start_timestamp, ML_use)
FROM 
    {PROJECT_ID}.public_dataset.chicago_taxitrips_prep
""".format(PROJECT_ID=PROJECT_ID)

example_gen = context.run(
    BigQueryExampleGen(query=query),
    beam_pipeline_args=[
        '--project', PROJECT_ID,
    ]
)

,但它仍然会因其他错误而失败。

ValueError: ReadFromBigQuery requires a GCS location to be provided. Neither gcs_location in the constructor nor the fallback option --temp_location is set. [while running 'InputToRecord/QueryTable/ReadFromBigQuery/Read/SDFBoundedSourceReader/ParDo(SDFBoundedSourceDoFn)/SplitAndSizeRestriction']

GCS存储桶

它在GCP中,交互式上下文通过数据流运行BigQueryExampleGen,因此需要通过 beam_pipeline_args 参数来提供GCS存储桶URL。

运行数据流管线时通过参数-temp_location gs:// bucket/subfolder/

query = """
SELECT 
    * EXCEPT (trip_start_timestamp, ML_use)
FROM 
    {PROJECT_ID}.public_dataset.chicago_taxitrips_prep
""".format(PROJECT_ID=PROJECT_ID)

example_gen = context.run(
    BigQueryExampleGen(query=query),
    beam_pipeline_args=[
        '--project', PROJECT_ID,
        '--temp_location', BUCKET
    ]
)
statistics_gen = context.run(
    StatisticsGen(examples=example_gen.component.outputs['examples'])
)
context.show(statistics_gen.component.outputs['statistics'])

schema_gen = SchemaGen(
    statistics=statistics_gen.component.outputs['statistics'],
    infer_feature_shape=True
)
context.run(schema_gen)
context.show(schema_gen.outputs['schema'])

”

documentation

本基于笔记本的教程将使用Google Cloud BigQuery作为训练ML模型的数据源。 ML管道将使用TFX构建,并在Google Cloud Cloud顶点管道上运行。在本教程中,我们将使用bigQueryExampleGen组件,该组件读取从BigQuery到TFX管道的数据。

我们还需要通过Beam_pipeline_args 进行BigQueryExample。它包括之类的配置,例如GCP项目的名称和临时存储 for BigQuery执行。

Project ID

To run in GCP, need to provide the project ID via beam_pipeline_args argument.

have proposed #888 to make this work. With that change, you would be able to do

context.run(..., beam_pipeline_args=['--project', 'my-project'])
query = """
SELECT 
    * EXCEPT (trip_start_timestamp, ML_use)
FROM 
    {PROJECT_ID}.public_dataset.chicago_taxitrips_prep
""".format(PROJECT_ID=PROJECT_ID)

example_gen = context.run(
    BigQueryExampleGen(query=query),
    beam_pipeline_args=[
        '--project', PROJECT_ID,
    ]
)

However, it still fails with another error.

ValueError: ReadFromBigQuery requires a GCS location to be provided. Neither gcs_location in the constructor nor the fallback option --temp_location is set. [while running 'InputToRecord/QueryTable/ReadFromBigQuery/Read/SDFBoundedSourceReader/ParDo(SDFBoundedSourceDoFn)/SplitAndSizeRestriction']

GCS Bucket

It looks inside GCP, the interactive context runs the BigQueryExampleGen via Dataflow, hence need to provide a GCS bucket URL via the beam_pipeline_args argument.

When running your Dataflow pipeline pass the argument --temp_location gs://bucket/subfolder/

query = """
SELECT 
    * EXCEPT (trip_start_timestamp, ML_use)
FROM 
    {PROJECT_ID}.public_dataset.chicago_taxitrips_prep
""".format(PROJECT_ID=PROJECT_ID)

example_gen = context.run(
    BigQueryExampleGen(query=query),
    beam_pipeline_args=[
        '--project', PROJECT_ID,
        '--temp_location', BUCKET
    ]
)
statistics_gen = context.run(
    StatisticsGen(examples=example_gen.component.outputs['examples'])
)
context.show(statistics_gen.component.outputs['statistics'])

schema_gen = SchemaGen(
    statistics=statistics_gen.component.outputs['statistics'],
    infer_feature_shape=True
)
context.run(schema_gen)
context.show(schema_gen.outputs['schema'])

enter image description here

Documentation

This notebook-based tutorial will use Google Cloud BigQuery as a data source to train an ML model. The ML pipeline will be constructed using TFX and run on Google Cloud Vertex Pipelines. In this tutorial, we will use the BigQueryExampleGen component which reads data from BigQuery to TFX pipelines.

We also need to pass beam_pipeline_args for the BigQueryExampleGen. It includes configs like the name of the GCP project and the temporary storage for the BigQuery execution.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文