气流DataProc无服务器创建者不使用Python参数

发布于 2025-02-08 20:43:16 字数 1845 浏览 3 评论 0原文

我正在尝试使用dataproccreatebatchoperator运算符从Google Cloud Composer设置DataProc Server批处理作业运算符，该操作员采用了一些会影响基础Python代码的参数。但是，我正在遇到以下错误：

error: unrecognized arguments: --run_timestamp "2022-06-17T13:22:51.800834+00:00" --temp_bucket "gs://pipeline/spark_temp_bucket/hourly/" --bucket "pipeline" --pipeline "hourly"

这是我的操作员的设置方式：

create_batch = DataprocCreateBatchOperator(
        task_id="hourly_pipeline",
        project_id="dev",
        region="us-west1",
        batch_id="".join(random.choice(string.ascii_lowercase + string.digits + "-") for i in range(40)),
        batch={
            "environment_config": {
                "execution_config": {
                    "service_account": "<service_account>",
                    "subnetwork_uri": "<uri>
                }
            },
            "pyspark_batch": {
                "main_python_file_uri": "gs://pipeline/code/pipeline_feat_creation.py",
                "args": [
                    '--run_timestamp "{{ ts }}"',
                    '--temp_bucket "gs://pipeline/spark_temp_bucket/hourly/"',
                    '--bucket "pipeline"',
                    '--pipeline "hourly"'
                ],
                "jar_file_uris": [
                    "gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.25.0.jar"
                ],
            }
        }
    )

关于args array：我尝试使用或不用“”封装它们的参数设置参数。。我也已经做过gcloud提交工作的工作：

gcloud dataproc batches submit pyspark "gs://pipeline/code/pipeline_feat_creation.py" \
--batch=jskdnkajsnd-test-10 --region=us-west1 --subnet="<uri>" \
-- --run_timestamp "2020-01-01" --temp_bucket gs://pipeline/spark_temp_bucket/hourly/ --bucket pipeline --pipeline hourly

原文

I'm trying to setup a Dataproc Serverless Batch Job from google cloud composer using the DataprocCreateBatchOperator operator that takes some arguments that would impact the underlying python code. However I'm running into the following error:

error: unrecognized arguments: --run_timestamp "2022-06-17T13:22:51.800834+00:00" --temp_bucket "gs://pipeline/spark_temp_bucket/hourly/" --bucket "pipeline" --pipeline "hourly"

This is how my operator is setup:

create_batch = DataprocCreateBatchOperator(
        task_id="hourly_pipeline",
        project_id="dev",
        region="us-west1",
        batch_id="".join(random.choice(string.ascii_lowercase + string.digits + "-") for i in range(40)),
        batch={
            "environment_config": {
                "execution_config": {
                    "service_account": "<service_account>",
                    "subnetwork_uri": "<uri>
                }
            },
            "pyspark_batch": {
                "main_python_file_uri": "gs://pipeline/code/pipeline_feat_creation.py",
                "args": [
                    '--run_timestamp "{{ ts }}"',
                    '--temp_bucket "gs://pipeline/spark_temp_bucket/hourly/"',
                    '--bucket "pipeline"',
                    '--pipeline "hourly"'
                ],
                "jar_file_uris": [
                    "gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.25.0.jar"
                ],
            }
        }
    )

Regarding the args array: I tried setting the parameters with and without encapsulating them with "". I've also already did a gcloud submit that worked like so:

gcloud dataproc batches submit pyspark "gs://pipeline/code/pipeline_feat_creation.py" \
--batch=jskdnkajsnd-test-10 --region=us-west1 --subnet="<uri>" \
-- --run_timestamp "2020-01-01" --temp_bucket gs://pipeline/spark_temp_bucket/hourly/ --bucket pipeline --pipeline hourly

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

尛丟丟 2025-02-15 20:43:16

我遇到的错误是，我没有在每个参数之后添加=。我还消除了“围绕每个参数的封装。这就是args现在的设置：

"args": [
    '--run_timestamp={{ ts }}',
    '--temp_bucket=gs://pipeline/spark_temp_bucket/hourly/',
    '--bucket=pipeline',
    '--pipeline=hourly'
]

The error I was running into was that I wasn't adding a = after each parameter; I've also eliminated the " encapsulation around each parameter. This is how the args are now setup:

"args": [
    '--run_timestamp={{ ts }}',
    '--temp_bucket=gs://pipeline/spark_temp_bucket/hourly/',
    '--bucket=pipeline',
    '--pipeline=hourly'
]

回复收藏 0 原文

~没有更多了~