当前位置：文江博客话题详情

Vertex Workbench-如何在Jupyter笔记本中运行BigQueryExamplegen

发布于 2025-02-07 03:13:54 字数 1564 浏览 3 评论 0 原文

问题

试图运行

InvalidUserInputError: Request missing required parameter projectId [while running 'InputToRecord/QueryTable/ReadFromBigQuery/Read/SDFBoundedSourceReader/ParDo(SDFBoundedSourceDoFn)/SplitAndSizeRestriction']

”

import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = "path_to_credential_file"


from tfx.v1.extensions.google_cloud_big_query import BigQueryExampleGen
from tfx.v1.components import (
    StatisticsGen,
    SchemaGen,
)
from tfx.orchestration.experimental.interactive.interactive_context import InteractiveContext
%load_ext tfx.orchestration.experimental.interactive.notebook_extensions.skip
context = InteractiveContext(pipeline_root='./data/artifacts')

运行BigQueryExample。

query = """
SELECT 
    * EXCEPT (trip_start_timestamp, ML_use)
FROM 
    {PROJECT_ID}.public_dataset.chicago_taxitrips_prep
""".format(PROJECT_ID=PROJECT_ID)

example_gen = context.run(
    BigQueryExampleGen(query=query)
)

有错误。

InvalidUserInputError: Request missing required parameter projectId [while running 'InputToRecord/QueryTable/ReadFromBigQuery/Read/SDFBoundedSourceReader/ParDo(SDFBoundedSourceDoFn)/SplitAndSizeRestriction']

数据

请参见为芝加哥出租车旅行数据集设置BigQuery数据集。

原文

Problem

Tried to run BigQueryExampleGen in the

InvalidUserInputError: Request missing required parameter projectId [while running 'InputToRecord/QueryTable/ReadFromBigQuery/Read/SDFBoundedSourceReader/ParDo(SDFBoundedSourceDoFn)/SplitAndSizeRestriction']

Steps

BigQueryExampleGen
Setup the GCP project and the interactive TFX context.

import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = "path_to_credential_file"


from tfx.v1.extensions.google_cloud_big_query import BigQueryExampleGen
from tfx.v1.components import (
    StatisticsGen,
    SchemaGen,
)
from tfx.orchestration.experimental.interactive.interactive_context import InteractiveContext
%load_ext tfx.orchestration.experimental.interactive.notebook_extensions.skip
context = InteractiveContext(pipeline_root='./data/artifacts')

Run the BigqueryExampleGen.

query = """
SELECT 
    * EXCEPT (trip_start_timestamp, ML_use)
FROM 
    {PROJECT_ID}.public_dataset.chicago_taxitrips_prep
""".format(PROJECT_ID=PROJECT_ID)

example_gen = context.run(
    BigQueryExampleGen(query=query)
)

Got the error.

InvalidUserInputError: Request missing required parameter projectId [while running 'InputToRecord/QueryTable/ReadFromBigQuery/Read/SDFBoundedSourceReader/ParDo(SDFBoundedSourceDoFn)/SplitAndSizeRestriction']

Data

See mlops-with-vertex-ai/01-dataset-management.ipynb to setup the BigQuery dataset for CThe Chicago Taxi Trips dataset.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

妄想挽回 2025-02-14 03:13:54

在GCP中运行的项目ID

需要通过 beam_pipeline_args 参数提供项目ID。

colab Interactivecontext无法确定bq＃882 的ProjectIDID

已提出＃888来进行这项工作。通过这种更改，您将能够做
  context.run（...，beam_pipeline_args = [' -  project'，'my-project']）
 

query = """
SELECT 
    * EXCEPT (trip_start_timestamp, ML_use)
FROM 
    {PROJECT_ID}.public_dataset.chicago_taxitrips_prep
""".format(PROJECT_ID=PROJECT_ID)

example_gen = context.run(
    BigQueryExampleGen(query=query),
    beam_pipeline_args=[
        '--project', PROJECT_ID,
    ]
)

，但它仍然会因其他错误而失败。

ValueError: ReadFromBigQuery requires a GCS location to be provided. Neither gcs_location in the constructor nor the fallback option --temp_location is set. [while running 'InputToRecord/QueryTable/ReadFromBigQuery/Read/SDFBoundedSourceReader/ParDo(SDFBoundedSourceDoFn)/SplitAndSizeRestriction']

GCS存储桶

它在GCP中，交互式上下文通过数据流运行BigQueryExampleGen，因此需要通过 beam_pipeline_args 参数来提供GCS存储桶URL。

bigqueryexamplegen破碎的readfrombigquery需要GCS位置才能提供＃2293
：//stackoverflow.com/questions/68325195/valueError-in-dataflow-invalid-gcs-location-none> valueerror in dataFlow中：无效的GCS位置：无

运行数据流管线时通过参数-temp_location gs：// bucket/subfolder/

query = """
SELECT 
    * EXCEPT (trip_start_timestamp, ML_use)
FROM 
    {PROJECT_ID}.public_dataset.chicago_taxitrips_prep
""".format(PROJECT_ID=PROJECT_ID)

example_gen = context.run(
    BigQueryExampleGen(query=query),
    beam_pipeline_args=[
        '--project', PROJECT_ID,
        '--temp_location', BUCKET
    ]
)

statistics_gen = context.run(
    StatisticsGen(examples=example_gen.component.outputs['examples'])
)
context.show(statistics_gen.component.outputs['statistics'])

schema_gen = SchemaGen(
    statistics=statistics_gen.component.outputs['statistics'],
    infer_feature_shape=True
)
context.run(schema_gen)
context.show(schema_gen.outputs['schema'])

documentation

使用tfx和vertex pipelines读取BigQuery的数据

本基于笔记本的教程将使用Google Cloud BigQuery作为训练ML模型的数据源。 ML管道将使用TFX构建，并在Google Cloud Cloud顶点管道上运行。在本教程中，我们将使用bigQueryExampleGen组件，该组件读取从BigQuery到TFX管道的数据。

我们还需要通过Beam_pipeline_args 进行BigQueryExample。它包括之类的配置，例如GCP项目的名称和临时存储 for BigQuery执行。

Project ID

To run in GCP, need to provide the project ID via beam_pipeline_args argument.

Colab InteractiveContext Unable to Determine ProjectID for BQ #882

have proposed #888 to make this work. With that change, you would be able to do
context.run(..., beam_pipeline_args=['--project', 'my-project'])

query = """
SELECT 
    * EXCEPT (trip_start_timestamp, ML_use)
FROM 
    {PROJECT_ID}.public_dataset.chicago_taxitrips_prep
""".format(PROJECT_ID=PROJECT_ID)

example_gen = context.run(
    BigQueryExampleGen(query=query),
    beam_pipeline_args=[
        '--project', PROJECT_ID,
    ]
)

However, it still fails with another error.

ValueError: ReadFromBigQuery requires a GCS location to be provided. Neither gcs_location in the constructor nor the fallback option --temp_location is set. [while running 'InputToRecord/QueryTable/ReadFromBigQuery/Read/SDFBoundedSourceReader/ParDo(SDFBoundedSourceDoFn)/SplitAndSizeRestriction']

GCS Bucket

It looks inside GCP, the interactive context runs the BigQueryExampleGen via Dataflow, hence need to provide a GCS bucket URL via the beam_pipeline_args argument.

When running your Dataflow pipeline pass the argument --temp_location gs://bucket/subfolder/

query = """
SELECT 
    * EXCEPT (trip_start_timestamp, ML_use)
FROM 
    {PROJECT_ID}.public_dataset.chicago_taxitrips_prep
""".format(PROJECT_ID=PROJECT_ID)

example_gen = context.run(
    BigQueryExampleGen(query=query),
    beam_pipeline_args=[
        '--project', PROJECT_ID,
        '--temp_location', BUCKET
    ]
)

statistics_gen = context.run(
    StatisticsGen(examples=example_gen.component.outputs['examples'])
)
context.show(statistics_gen.component.outputs['statistics'])

schema_gen = SchemaGen(
    statistics=statistics_gen.component.outputs['statistics'],
    infer_feature_shape=True
)
context.run(schema_gen)
context.show(schema_gen.outputs['schema'])

Documentation

Reading data from BigQuery with TFX and Vertex Pipelines

This notebook-based tutorial will use Google Cloud BigQuery as a data source to train an ML model. The ML pipeline will be constructed using TFX and run on Google Cloud Vertex Pipelines. In this tutorial, we will use the BigQueryExampleGen component which reads data from BigQuery to TFX pipelines.

We also need to pass beam_pipeline_args for the BigQueryExampleGen. It includes configs like the name of the GCP project and the temporary storage for the BigQuery execution.