气流DataProc无服务器创建者不使用Python参数

发布于 2025-02-08 20:43:16 字数 1845 浏览 3 评论 0原文

我正在尝试使用dataproccreatebatchoperator运算符从Google Cloud Composer设置DataProc Server批处理作业运算符,该操作员采用了一些会影响基础Python代码的参数。但是,我正在遇到以下错误:

error: unrecognized arguments: --run_timestamp "2022-06-17T13:22:51.800834+00:00" --temp_bucket "gs://pipeline/spark_temp_bucket/hourly/" --bucket "pipeline" --pipeline "hourly"

这是我的操作员的设置方式:

create_batch = DataprocCreateBatchOperator(
        task_id="hourly_pipeline",
        project_id="dev",
        region="us-west1",
        batch_id="".join(random.choice(string.ascii_lowercase + string.digits + "-") for i in range(40)),
        batch={
            "environment_config": {
                "execution_config": {
                    "service_account": "<service_account>",
                    "subnetwork_uri": "<uri>
                }
            },
            "pyspark_batch": {
                "main_python_file_uri": "gs://pipeline/code/pipeline_feat_creation.py",
                "args": [
                    '--run_timestamp "{{ ts }}"',
                    '--temp_bucket "gs://pipeline/spark_temp_bucket/hourly/"',
                    '--bucket "pipeline"',
                    '--pipeline "hourly"'
                ],
                "jar_file_uris": [
                    "gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.25.0.jar"
                ],
            }
        }
    )

关于args array:我尝试使用或不用“”封装它们的参数设置参数。 。我也已经做过gcloud提交工作的工作:

gcloud dataproc batches submit pyspark "gs://pipeline/code/pipeline_feat_creation.py" \
--batch=jskdnkajsnd-test-10 --region=us-west1 --subnet="<uri>" \
-- --run_timestamp "2020-01-01" --temp_bucket gs://pipeline/spark_temp_bucket/hourly/ --bucket pipeline --pipeline hourly

I'm trying to setup a Dataproc Serverless Batch Job from google cloud composer using the DataprocCreateBatchOperator operator that takes some arguments that would impact the underlying python code. However I'm running into the following error:

error: unrecognized arguments: --run_timestamp "2022-06-17T13:22:51.800834+00:00" --temp_bucket "gs://pipeline/spark_temp_bucket/hourly/" --bucket "pipeline" --pipeline "hourly"

This is how my operator is setup:

create_batch = DataprocCreateBatchOperator(
        task_id="hourly_pipeline",
        project_id="dev",
        region="us-west1",
        batch_id="".join(random.choice(string.ascii_lowercase + string.digits + "-") for i in range(40)),
        batch={
            "environment_config": {
                "execution_config": {
                    "service_account": "<service_account>",
                    "subnetwork_uri": "<uri>
                }
            },
            "pyspark_batch": {
                "main_python_file_uri": "gs://pipeline/code/pipeline_feat_creation.py",
                "args": [
                    '--run_timestamp "{{ ts }}"',
                    '--temp_bucket "gs://pipeline/spark_temp_bucket/hourly/"',
                    '--bucket "pipeline"',
                    '--pipeline "hourly"'
                ],
                "jar_file_uris": [
                    "gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.25.0.jar"
                ],
            }
        }
    )

Regarding the args array: I tried setting the parameters with and without encapsulating them with "". I've also already did a gcloud submit that worked like so:

gcloud dataproc batches submit pyspark "gs://pipeline/code/pipeline_feat_creation.py" \
--batch=jskdnkajsnd-test-10 --region=us-west1 --subnet="<uri>" \
-- --run_timestamp "2020-01-01" --temp_bucket gs://pipeline/spark_temp_bucket/hourly/ --bucket pipeline --pipeline hourly

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

尛丟丟 2025-02-15 20:43:16

我遇到的错误是,我没有在每个参数之后添加=。我还消除了围绕每个参数的封装。这就是args现在的设置:

"args": [
    '--run_timestamp={{ ts }}',
    '--temp_bucket=gs://pipeline/spark_temp_bucket/hourly/',
    '--bucket=pipeline',
    '--pipeline=hourly'
]

The error I was running into was that I wasn't adding a = after each parameter; I've also eliminated the " encapsulation around each parameter. This is how the args are now setup:

"args": [
    '--run_timestamp={{ ts }}',
    '--temp_bucket=gs://pipeline/spark_temp_bucket/hourly/',
    '--bucket=pipeline',
    '--pipeline=hourly'
]
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文