将Pyspark DataFrame上传到BigQuery作为DataProc作业

发布于 2025-01-19 08:32:41 字数 606 浏览 4 评论 0原文

我正在尝试在DataProc群集上提交Pyspark作业。我的Pyspark工作是将数据框架上传到BigQuery。当我使用群集上的提交作业进行操作时，我面临错误，作业失败了。但是，当我提供此罐子时：
“ gs：//spark-lib/bigquery/spark-bigquery-latest_2.12.jar”，在提交作业中的jar文件参数中，作业成功执行。

我想要的是找到一种方法来避免在运行时提供此罐子，而只能单独提供.py文件的位置来运行工作。我该怎么做？是否有可能在.py文件本身中指定此jar？

我使用以下方法在.py文件本身中提供JAR，但似乎不起作用。

from pyspark.sql import SparkSession
spark = SparkSession.builder.master('yarn')\
.config('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar') \
.appName('df-to-bq-sample').enableHiveSupport().getOrCreate()

任何人都可以提出一种方法来做到这一点，我不想为此使用CLI。谢谢你！

原文

I'm trying to submit a PySpark job on Dataproc cluster. My Pyspark job is uploading a dataframe to bigquery.
When I do it using submit job on the cluster, I face an error, the job fails. But, when I provide this jar :
"gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar", in the jar file parameter in submit job, the job executes successfully.

What I wanted is to find a way to avoid providing this jar during run-time and just run the job by giving the location of .py file alone. How can I do it? Is it somehow possible to specify this jar within the .py file itself?

I used the below approach to provide the jar in the .py file itself but it doesn't seem to work.

from pyspark.sql import SparkSession
spark = SparkSession.builder.master('yarn')\
.config('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar') \
.appName('df-to-bq-sample').enableHiveSupport().getOrCreate()

Can anyone suggest a way to do it, and I do not want to use CLI for this.
Thank you!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

小姐丶请自重 2025-01-26 08:32:41

首先，在阅读和写作给BigQuery时，必须是一个必须的。如果您不想将其添加到作业提交中，则可以使用连接器init Action 这样：

REGION=<region>
CLUSTER_NAME=<cluster_name>
gcloud dataproc clusters create ${CLUSTER_NAME} \
    --region ${REGION} \
    --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/connectors/connectors.sh \
    --metadata spark-bigquery-connector-version=0.24.2

First of all, the mentioned is a must when reading and writing to BigQuery. If you don't want to add it to the job submission, you can add the BigQuery connector jar on cluster creation using the connectors init action like this:

REGION=<region>
CLUSTER_NAME=<cluster_name>
gcloud dataproc clusters create ${CLUSTER_NAME} \
    --region ${REGION} \
    --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/connectors/connectors.sh \
    --metadata spark-bigquery-connector-version=0.24.2

回复收藏 0 原文

~没有更多了~