气流DataProc无服务器创建者不使用Python参数
我正在尝试使用dataproccreatebatchoperator
运算符从Google Cloud Composer设置DataProc Server批处理作业运算符,该操作员采用了一些会影响基础Python代码的参数。但是,我正在遇到以下错误:
error: unrecognized arguments: --run_timestamp "2022-06-17T13:22:51.800834+00:00" --temp_bucket "gs://pipeline/spark_temp_bucket/hourly/" --bucket "pipeline" --pipeline "hourly"
这是我的操作员的设置方式:
create_batch = DataprocCreateBatchOperator(
task_id="hourly_pipeline",
project_id="dev",
region="us-west1",
batch_id="".join(random.choice(string.ascii_lowercase + string.digits + "-") for i in range(40)),
batch={
"environment_config": {
"execution_config": {
"service_account": "<service_account>",
"subnetwork_uri": "<uri>
}
},
"pyspark_batch": {
"main_python_file_uri": "gs://pipeline/code/pipeline_feat_creation.py",
"args": [
'--run_timestamp "{{ ts }}"',
'--temp_bucket "gs://pipeline/spark_temp_bucket/hourly/"',
'--bucket "pipeline"',
'--pipeline "hourly"'
],
"jar_file_uris": [
"gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.25.0.jar"
],
}
}
)
关于args
array:我尝试使用或不用“”
封装它们的参数设置参数。 。我也已经做过gcloud
提交工作的工作:
gcloud dataproc batches submit pyspark "gs://pipeline/code/pipeline_feat_creation.py" \
--batch=jskdnkajsnd-test-10 --region=us-west1 --subnet="<uri>" \
-- --run_timestamp "2020-01-01" --temp_bucket gs://pipeline/spark_temp_bucket/hourly/ --bucket pipeline --pipeline hourly
I'm trying to setup a Dataproc Serverless Batch Job from google cloud composer using the DataprocCreateBatchOperator
operator that takes some arguments that would impact the underlying python code. However I'm running into the following error:
error: unrecognized arguments: --run_timestamp "2022-06-17T13:22:51.800834+00:00" --temp_bucket "gs://pipeline/spark_temp_bucket/hourly/" --bucket "pipeline" --pipeline "hourly"
This is how my operator is setup:
create_batch = DataprocCreateBatchOperator(
task_id="hourly_pipeline",
project_id="dev",
region="us-west1",
batch_id="".join(random.choice(string.ascii_lowercase + string.digits + "-") for i in range(40)),
batch={
"environment_config": {
"execution_config": {
"service_account": "<service_account>",
"subnetwork_uri": "<uri>
}
},
"pyspark_batch": {
"main_python_file_uri": "gs://pipeline/code/pipeline_feat_creation.py",
"args": [
'--run_timestamp "{{ ts }}"',
'--temp_bucket "gs://pipeline/spark_temp_bucket/hourly/"',
'--bucket "pipeline"',
'--pipeline "hourly"'
],
"jar_file_uris": [
"gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.25.0.jar"
],
}
}
)
Regarding the args
array: I tried setting the parameters with and without encapsulating them with ""
. I've also already did a gcloud
submit that worked like so:
gcloud dataproc batches submit pyspark "gs://pipeline/code/pipeline_feat_creation.py" \
--batch=jskdnkajsnd-test-10 --region=us-west1 --subnet="<uri>" \
-- --run_timestamp "2020-01-01" --temp_bucket gs://pipeline/spark_temp_bucket/hourly/ --bucket pipeline --pipeline hourly
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我遇到的错误是,我没有在每个参数之后添加
=
。我还消除了“
围绕每个参数的封装。这就是args
现在的设置:The error I was running into was that I wasn't adding a
=
after each parameter; I've also eliminated the"
encapsulation around each parameter. This is how theargs
are now setup: