Pyspark 未创建 SparkContext (Yarn)。网关故障或网络流量被阻止？

发布于 2025-01-09 13:04:12 字数 5798 浏览 0 评论 0原文

这是我安装 pyspark 二进制文件的一些上下文。

在我的公司，我们使用 Cloudera Data Science Workbench (CDSW)。当我们为新项目创建会话时，我猜测它是来自特定 Dockerfile 的图像。在这个 dockerfile 中推送了 CDH 二进制文件的安装和配置。

现在我希望在 CDSW 之外使用这些配置。我有一个 kubernetes 集群，我在其中部署 web 应用程序。我想在 Yarn 模式下使用 Spark 为 Web 应用程序部署非常小的资源。

我所做的是将 /opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072 和 /var 中的所有二进制文件和配置 tar.gz /lib/cdsw/client-config/. 然后将它们解压到容器或 WSL2 实例中。

我没有像我应该做的那样将所有内容解压到 /var/ 或 /opt/ 中，而是将它们放入 $HOME/opt/cloudera/parcels/CDH 中-6.3.4-1.cdh6.3.4.p4484.8795072/* 和 $USER/etc/client-config/*。我为什么这么做？因为我可能想在我的 kubernetes 中使用已安装的卷并在容器之间共享二进制文件。

我已经 sed 并修改了所有配置文件以适应路径：

spark-env.sh
topology.py
Any *.txt、*.sh、*.py

所以我设法运行 beeline< /code> hadoop hdfs hbase 使用 hadoop-conf 文件夹指向它们。我可以使用 pyspark 但只能在本地模式下使用。但我真正想要的是使用 pyspark 和 yarn。

所以我设置了一堆环境变量来完成这项工作：

export HADOOP_CONF_DIR=$HOME/etc/client-config/spark-conf/yarn-conf
export SPARK_CONF_DIR=$HOME/etc/client-config/spark-conf/yarn-conf
export JAVA_HOME=/usr/local
export BIN_DIR=$HOME/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/bin
export PATH=$BIN_DIR:$JAVA_HOME/bin:$PATH
export PYSPARK_PYTHON=python3.6
export PYSPARK_DRIVER_PYTHON=python3.6
export OPENBLAS_NUM_THREADS=1
export MKL_NUM_THREADS=1

export SPARK_HOME=/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark
export PYSPARK_ARCHIVES_PATH=$(ZIPS=("$CDH_DIR"/lib/spark/python/lib/*.zip); IFS=:; echo "${ZIPS[*]}"):$PYSPARK_ARCHIVES_PATH
export SPARK_DIST_CLASSPATH=$HOME/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/hadoop/client/accessors-smart-1.2.jar:<ALL OTHER JARS FOR EVERY BINARIES>

无论如何，所有路径都存在并且可以工作。由于我已经 sed 所有配置文件，它们也会生成与导出的路径相同的路径。

我像这样启动我的 pyspark 二进制文件：

pyspark --conf "spark.master=yarn" --properties-file $HOME/etc/client-config/spark-conf/spark-defaults.conf --verbose

仅供参考，它使用的是 pyspark 2.4.0。我已经安装了 Java(TM) SE 运行时环境（版本 1.8.0_131-b11）。与我在 CDSW 实例上找到的相同。我添加了带有公司公共证书的密钥库。我还为 kerberos 身份验证生成了一个密钥表。它们都可以工作，因为我可以使用 hdfs 和 HADOOP_CONF_DIR=$HOME/etc/client-config/hadoop-conf

在详细模式下，我可以看到所有详细信息和配置来自火花。当我从 CDSW 会话中进行比较时，它们完全相同（修改了路径，例如：

Using properties file: /home/docker4sg/etc/client-config/spark-conf/spark-defaults.conf
Adding default property: spark.lineage.log.dir=/var/log/spark/lineage
Adding default property: spark.port.maxRetries=250
Adding default property: spark.serializer=org.apache.spark.serializer.KryoSerializer
Adding default property: spark.driver.log.persistToDfs.enabled=true
Adding default property: spark.yarn.jars=local:/home/docker4sg/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark/jars/*,local:/home/docker4sg/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark/hive/*
...

几秒钟后，它无法创建 sparkSession：

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2022-02-22 14:44:14 WARN  Client:760 - Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
2022-02-22 14:44:14 ERROR SparkContext:94 - Error initializing SparkContext.
java.lang.IllegalArgumentException: java.net.URISyntaxException: Expected scheme-specific part at index 12: pyspark.zip:
...
Caused by: java.net.URISyntaxException: Expected scheme-specific part at index 12: pyspark.zip:
...
2022-02-22 14:44:15 WARN  YarnSchedulerBackend$YarnSchedulerEndpoint:69 - Attempted to request executors before the AM has registered!
2022-02-22 14:44:15 WARN  MetricsSystem:69 - Stopping a MetricsSystem that is not running
2022-02-22 14:44:15 WARN  SparkContext:69 - Another SparkContext is being constructed (or threw an exception in its constructor).  This may indicate an error, since only one SparkContext may be running in this JVM (see SPARK-2243). The other SparkContext was created at:
org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58

据我所知，它失败的原因是我'我不确定，然后尝试退回到其他模式，这也在

配置文件 spark-conf/yarn-conf/yarn-site.xml 中指定。使用zookeeper：

  <property>
    <name>yarn.resourcemanager.zk-address</name>
    <value>corporate.machine.node1.name.net:9999,corporate.machine.node2.name.net:9999,corporate.machine.node3.name.net:9999</value>
  </property>

Yarn 集群是否不接受来自随机 IP（kuber IP 或来自计算机的个人 IP）的流量？对于我来说，我正在处理的 IP 不在白名单中，但目前我无法要求将我的 ip 添加到白名单中。我如何确定我正在寻找正确的方向？

编辑 1：

正如评论中所述，pyspark.zip< 的 URI /code> 错了。我已将 PYSPARK_ARCHIVES_PATH 修改为 pyspark.zip 的真实位置，

PYSPARK_ARCHIVES_PATH=local:$HOME/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark/python/lib/py4j-0.10.7-src.zip,local:$HOME/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark/python/lib/pyspark.zip

现在我收到错误 UnknownHostException：

org.apache.spark.SparkException: Uncaught exception: org.apache.spark.SparkException: Exception thrown in awaitResult
...
Caused by: java.io.IOException: Failed to connect to <HOSTNAME>:13250
...
Caused by: java.net.UnknownHostException: <HOSTNAME>
...

原文

Here is some context of my installation of pyspark binary.

In my company, we use a Cloudera Data Science Workbench (CDSW). When we create a session for a new projet, I'm guessing it's a image from a specific Dockerfile. And inside this dockerfile is pushed the installation of CDH binaries and configuration.

Now I wish to use thoses configurations outside CDSW. I have a kubernetes cluster where I deploy webapps. And I would like to use spark in Yarn mode to deploy very small ressources for the webapps.

What I have done, is to tar.gz all binaries and config from /opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072 and /var/lib/cdsw/client-config/.
Then untar.gz them in a container or in a WSL2 instance.

Instead of unpacking everything in /var/ or /opt/ like I should do, I've put them in $HOME/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/* and $USER/etc/client-config/*. Why I did this? Because I might want to use a mounted Volume in my kubernetes and share binaries between containers.

I've sed and modifiy all configuration files to adapt paths:

spark-env.sh
topology.py
Any *.txt, *.sh, *.py

So I managed to run beeline hadoop hdfs hbase pointing them with the hadoop-conf folder. I can use pyspark but in local mode only. But What I really want is to use pyspark with yarn.

So I set a bunch of env variables to make this work:

export HADOOP_CONF_DIR=$HOME/etc/client-config/spark-conf/yarn-conf
export SPARK_CONF_DIR=$HOME/etc/client-config/spark-conf/yarn-conf
export JAVA_HOME=/usr/local
export BIN_DIR=$HOME/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/bin
export PATH=$BIN_DIR:$JAVA_HOME/bin:$PATH
export PYSPARK_PYTHON=python3.6
export PYSPARK_DRIVER_PYTHON=python3.6
export OPENBLAS_NUM_THREADS=1
export MKL_NUM_THREADS=1

export SPARK_HOME=/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark
export PYSPARK_ARCHIVES_PATH=$(ZIPS=("$CDH_DIR"/lib/spark/python/lib/*.zip); IFS=:; echo "${ZIPS[*]}"):$PYSPARK_ARCHIVES_PATH
export SPARK_DIST_CLASSPATH=$HOME/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/hadoop/client/accessors-smart-1.2.jar:<ALL OTHER JARS FOR EVERY BINARIES>

Anyway, all of the paths are existing and working. And since I've sed all config files, they also generate the same path as the exported one.

I launch my pyspark binary like this:

pyspark --conf "spark.master=yarn" --properties-file $HOME/etc/client-config/spark-conf/spark-defaults.conf --verbose

FYI, it is using pyspark 2.4.0. And I've install Java(TM) SE Runtime Environment (build 1.8.0_131-b11). The same that I found on the CDSW instance. I added the keystore with the public certificate of the company. And I also generate a keytab for the kerberos auth. Both of them are working since I can used hdfs with HADOOP_CONF_DIR=$HOME/etc/client-config/hadoop-conf

In verbose mode I can see all the details and configuration from spark. When I compare it from the CDSW session, they are quite identical (with modified path, for example :

Using properties file: /home/docker4sg/etc/client-config/spark-conf/spark-defaults.conf
Adding default property: spark.lineage.log.dir=/var/log/spark/lineage
Adding default property: spark.port.maxRetries=250
Adding default property: spark.serializer=org.apache.spark.serializer.KryoSerializer
Adding default property: spark.driver.log.persistToDfs.enabled=true
Adding default property: spark.yarn.jars=local:/home/docker4sg/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark/jars/*,local:/home/docker4sg/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark/hive/*
...

After few seconds it fails to create a sparkSession:

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2022-02-22 14:44:14 WARN  Client:760 - Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
2022-02-22 14:44:14 ERROR SparkContext:94 - Error initializing SparkContext.
java.lang.IllegalArgumentException: java.net.URISyntaxException: Expected scheme-specific part at index 12: pyspark.zip:
...
Caused by: java.net.URISyntaxException: Expected scheme-specific part at index 12: pyspark.zip:
...
2022-02-22 14:44:15 WARN  YarnSchedulerBackend$YarnSchedulerEndpoint:69 - Attempted to request executors before the AM has registered!
2022-02-22 14:44:15 WARN  MetricsSystem:69 - Stopping a MetricsSystem that is not running
2022-02-22 14:44:15 WARN  SparkContext:69 - Another SparkContext is being constructed (or threw an exception in its constructor).  This may indicate an error, since only one SparkContext may be running in this JVM (see SPARK-2243). The other SparkContext was created at:
org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58

From what I understand, it fails for a reason I'm not sure about and then tries to fall back into an other mode. That fails too.

In the configuration file spark-conf/yarn-conf/yarn-site.xml, it is specified that it is using a zookeeper:

  <property>
    <name>yarn.resourcemanager.zk-address</name>
    <value>corporate.machine.node1.name.net:9999,corporate.machine.node2.name.net:9999,corporate.machine.node3.name.net:9999</value>
  </property>

Could it be that the Yarn cluster does not accept traffic from a random IP (kuber IP or personnal IP from computer)? For me, the IP i'm working on is not on the whitelist, but at the moment I cannot ask to add my ip to the whitelist. How can I know for sure I'm looking in the good direction?

Edit 1:

As said in the comment, the URI of the pyspark.zip was wrong. I've modified my PYSPARK_ARCHIVES_PATH to the real location of pyspark.zip.

PYSPARK_ARCHIVES_PATH=local:$HOME/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark/python/lib/py4j-0.10.7-src.zip,local:$HOME/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark/python/lib/pyspark.zip

Now I get an error UnknownHostException:

org.apache.spark.SparkException: Uncaught exception: org.apache.spark.SparkException: Exception thrown in awaitResult
...
Caused by: java.io.IOException: Failed to connect to <HOSTNAME>:13250
...
Caused by: java.net.UnknownHostException: <HOSTNAME>
...

分享到QQ

分享到微博