GCP DataProc-提交工作不起作用时添加多个软件包(Kafka,MongoDB)
我正在尝试添加Kafka& MongoDB软件包在提交DataProc Pyspark作业时,但是失败了。 到目前为止,我一直在使用Kafka软件包,这很好, 但是,当我尝试在下面的命令中添加mongoDB软件包时,它可以使错误
命令正常工作,只有Kafka软件包,
gcloud dataproc jobs submit pyspark main.py \
--cluster versa-structured-stream \
--properties spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2, spark.dynamicAllocation.enabled=true,spark.shuffle.service.enabled=true
我尝试了几个选项来添加两个软件包,但是这不起作用: 例如。
--properties ^#^spark:spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2,org.mongodb.spark:mongo-spark-connector_2.12:3.0.2,spark:spark.dynamicAllocation.enabled=true,spark:spark.shuffle.service.enabled=true,spark:spark.executor.memory=20g,spark:spark.driver.memory=5g,spark:spark.executor.cores=2 \
--jars=gs://dataproc-spark-jars/spark-avro_2.12-3.1.2.jar,gs://dataproc-spark-jars/isolation-forest_2.4.3_2.12-2.0.8.jar,gs://dataproc-spark-jars/spark-bigquery-with-dependencies_2.12-0.23.2.jar \
--region us-east1 \
--py-files streams.zip,utils.zip
Error :
Traceback (most recent call last):
File "/tmp/1abcccefa3144660967606f3f7f9491d/main.py", line 303, in <module>
sys.exit(main())
File "/tmp/1abcccefa3144660967606f3f7f9491d/main.py", line 260, in main
df_stream = spark.readStream.format('kafka') \
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/streaming.py", line 482, in load
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in deco
pyspark.sql.utils.AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".
我如何做这项工作?
蒂亚!
I'm trying to add the kafka & mongoDB packages while submitting dataproc pyspark jobs, however that is failing.
So far, i've been using only kafka package and that is working fine,
however when i try to add mongoDB package in the command below it gives error
Command working fine, with only Kafka package
gcloud dataproc jobs submit pyspark main.py \
--cluster versa-structured-stream \
--properties spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2, spark.dynamicAllocation.enabled=true,spark.shuffle.service.enabled=true
I tried e few options to add both the packages, however that is not working :
eg.
--properties ^#^spark:spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2,org.mongodb.spark:mongo-spark-connector_2.12:3.0.2,spark:spark.dynamicAllocation.enabled=true,spark:spark.shuffle.service.enabled=true,spark:spark.executor.memory=20g,spark:spark.driver.memory=5g,spark:spark.executor.cores=2 \
--jars=gs://dataproc-spark-jars/spark-avro_2.12-3.1.2.jar,gs://dataproc-spark-jars/isolation-forest_2.4.3_2.12-2.0.8.jar,gs://dataproc-spark-jars/spark-bigquery-with-dependencies_2.12-0.23.2.jar \
--region us-east1 \
--py-files streams.zip,utils.zip
Error :
Traceback (most recent call last):
File "/tmp/1abcccefa3144660967606f3f7f9491d/main.py", line 303, in <module>
sys.exit(main())
File "/tmp/1abcccefa3144660967606f3f7f9491d/main.py", line 260, in main
df_stream = spark.readStream.format('kafka') \
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/streaming.py", line 482, in load
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in deco
pyspark.sql.utils.AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".
How do i make this work ?
tia!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
在您的
中 - 属性
您已将^#^
定义为分界符。要正确使用定界符,您需要将,
更改为#
作为属性的分隔符。在单个密钥中定义多个值时,您将仅使用,
。另外,您需要删除 spark:属性上的前缀。请参阅下面的示例命令:检查作业配置时,这是结果:
In your
--properties
you have defined^#^
as delimiter. To properly use the delimiter you need to change the,
to#
as separator of your properties. You will only use,
when defining multiple values in a single key. Also you need to removespark:
prefix on your properties. See sample command below:When job config is inspected this is the result: