GCP DataProc-提交工作不起作用时添加多个软件包(Kafka,MongoDB)

发布于 2025-02-01 03:32:11 字数 1821 浏览 5 评论 0原文

我正在尝试添加Kafka& MongoDB软件包在提交DataProc Pyspark作业时,但是失败了。 到目前为止,我一直在使用Kafka软件包,这很好, 但是,当我尝试在下面的命令中添加mongoDB软件包时,它可以使错误

命令正常工作,只有Kafka软件包,

gcloud dataproc jobs submit pyspark main.py \
  --cluster versa-structured-stream  \
  --properties spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2, spark.dynamicAllocation.enabled=true,spark.shuffle.service.enabled=true

我尝试了几个选项来添加两个软件包,但是这不起作用: 例如。

--properties ^#^spark:spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2,org.mongodb.spark:mongo-spark-connector_2.12:3.0.2,spark:spark.dynamicAllocation.enabled=true,spark:spark.shuffle.service.enabled=true,spark:spark.executor.memory=20g,spark:spark.driver.memory=5g,spark:spark.executor.cores=2 \
  --jars=gs://dataproc-spark-jars/spark-avro_2.12-3.1.2.jar,gs://dataproc-spark-jars/isolation-forest_2.4.3_2.12-2.0.8.jar,gs://dataproc-spark-jars/spark-bigquery-with-dependencies_2.12-0.23.2.jar \
  --region us-east1 \
  --py-files streams.zip,utils.zip


Error :
Traceback (most recent call last):
  File "/tmp/1abcccefa3144660967606f3f7f9491d/main.py", line 303, in <module>
    sys.exit(main())
  File "/tmp/1abcccefa3144660967606f3f7f9491d/main.py", line 260, in main
    df_stream = spark.readStream.format('kafka') \
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/streaming.py", line 482, in load
  File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in deco
pyspark.sql.utils.AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".

我如何做这项工作?

蒂亚!

I'm trying to add the kafka & mongoDB packages while submitting dataproc pyspark jobs, however that is failing.
So far, i've been using only kafka package and that is working fine,
however when i try to add mongoDB package in the command below it gives error

Command working fine, with only Kafka package

gcloud dataproc jobs submit pyspark main.py \
  --cluster versa-structured-stream  \
  --properties spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2, spark.dynamicAllocation.enabled=true,spark.shuffle.service.enabled=true

I tried e few options to add both the packages, however that is not working :
eg.

--properties ^#^spark:spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2,org.mongodb.spark:mongo-spark-connector_2.12:3.0.2,spark:spark.dynamicAllocation.enabled=true,spark:spark.shuffle.service.enabled=true,spark:spark.executor.memory=20g,spark:spark.driver.memory=5g,spark:spark.executor.cores=2 \
  --jars=gs://dataproc-spark-jars/spark-avro_2.12-3.1.2.jar,gs://dataproc-spark-jars/isolation-forest_2.4.3_2.12-2.0.8.jar,gs://dataproc-spark-jars/spark-bigquery-with-dependencies_2.12-0.23.2.jar \
  --region us-east1 \
  --py-files streams.zip,utils.zip


Error :
Traceback (most recent call last):
  File "/tmp/1abcccefa3144660967606f3f7f9491d/main.py", line 303, in <module>
    sys.exit(main())
  File "/tmp/1abcccefa3144660967606f3f7f9491d/main.py", line 260, in main
    df_stream = spark.readStream.format('kafka') \
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/streaming.py", line 482, in load
  File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in deco
pyspark.sql.utils.AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".

How do i make this work ?

tia!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

长梦不多时 2025-02-08 03:32:11

在您的中 - 属性您已将^#^定义为分界符。要正确使用定界符,您需要将更改为作为属性的分隔符。在单个密钥中定义多个值时,您将仅使用。另外,您需要删除 spark:属性上的前缀。请参阅下面的示例命令:

gcloud dataproc jobs submit pyspark main.py \
  --cluster=cluster-3069  \
  --region=us-central1 \
  --properties ^#^spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2,org.mongodb.spark:mongo-spark-connector_2.12:3.0.2#spark.dynamicAllocation.enabled=true#spark.shuffle.service.enabled=true#spark.executor.memory=20g#spark.driver.memory=5g#spark.executor.cores=2

检查作业配置时,这是结果:

”在此处输入图像描述”

In your --properties you have defined ^#^ as delimiter. To properly use the delimiter you need to change the , to # as separator of your properties. You will only use , when defining multiple values in a single key. Also you need to remove spark: prefix on your properties. See sample command below:

gcloud dataproc jobs submit pyspark main.py \
  --cluster=cluster-3069  \
  --region=us-central1 \
  --properties ^#^spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2,org.mongodb.spark:mongo-spark-connector_2.12:3.0.2#spark.dynamicAllocation.enabled=true#spark.shuffle.service.enabled=true#spark.executor.memory=20g#spark.driver.memory=5g#spark.executor.cores=2

When job config is inspected this is the result:

enter image description here

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文