Pyspark 未创建 SparkContext (Yarn)。网关故障或网络流量被阻止?
这是我安装 pyspark 二进制文件的一些上下文。
在我的公司,我们使用 Cloudera Data Science Workbench (CDSW)。当我们为新项目创建会话时,我猜测它是来自特定 Dockerfile 的图像。在这个 dockerfile 中推送了 CDH 二进制文件的安装和配置。
现在我希望在 CDSW 之外使用这些配置。我有一个 kubernetes 集群,我在其中部署 web 应用程序。我想在 Yarn 模式下使用 Spark 为 Web 应用程序部署非常小的资源。
我所做的是将 /opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072
和 /var 中的所有二进制文件和配置 tar.gz /lib/cdsw/client-config/
. 然后将它们解压到容器或 WSL2 实例中。
我没有像我应该做的那样将所有内容解压到 /var/
或 /opt/
中,而是将它们放入 $HOME/opt/cloudera/parcels/CDH 中-6.3.4-1.cdh6.3.4.p4484.8795072/*
和 $USER/etc/client-config/*
。我为什么这么做?因为我可能想在我的 kubernetes 中使用已安装的卷并在容器之间共享二进制文件。
我已经 sed
并修改了所有配置文件以适应路径:
- spark-env.sh
- topology.py
- Any *.txt、*.sh、*.py
所以我设法运行 beeline< /code>
hadoop
hdfs
hbase
使用 hadoop-conf
文件夹指向它们。我可以使用 pyspark 但只能在本地模式下使用。但我真正想要的是使用 pyspark
和 yarn
。
所以我设置了一堆环境变量来完成这项工作:
export HADOOP_CONF_DIR=$HOME/etc/client-config/spark-conf/yarn-conf
export SPARK_CONF_DIR=$HOME/etc/client-config/spark-conf/yarn-conf
export JAVA_HOME=/usr/local
export BIN_DIR=$HOME/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/bin
export PATH=$BIN_DIR:$JAVA_HOME/bin:$PATH
export PYSPARK_PYTHON=python3.6
export PYSPARK_DRIVER_PYTHON=python3.6
export OPENBLAS_NUM_THREADS=1
export MKL_NUM_THREADS=1
export SPARK_HOME=/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark
export PYSPARK_ARCHIVES_PATH=$(ZIPS=("$CDH_DIR"/lib/spark/python/lib/*.zip); IFS=:; echo "${ZIPS[*]}"):$PYSPARK_ARCHIVES_PATH
export SPARK_DIST_CLASSPATH=$HOME/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/hadoop/client/accessors-smart-1.2.jar:<ALL OTHER JARS FOR EVERY BINARIES>
无论如何,所有路径都存在并且可以工作。由于我已经 sed 所有配置文件,它们也会生成与导出的路径相同的路径。
我像这样启动我的 pyspark 二进制文件:
pyspark --conf "spark.master=yarn" --properties-file $HOME/etc/client-config/spark-conf/spark-defaults.conf --verbose
仅供参考,它使用的是 pyspark 2.4.0。我已经安装了 Java(TM) SE 运行时环境(版本 1.8.0_131-b11)
。与我在 CDSW 实例上找到的相同。我添加了带有公司公共证书的密钥库。我还为 kerberos 身份验证生成了一个密钥表。它们都可以工作,因为我可以使用 hdfs
和 HADOOP_CONF_DIR=$HOME/etc/client-config/hadoop-conf
在详细模式下,我可以看到所有详细信息和配置来自火花。当我从 CDSW 会话中进行比较时,它们完全相同(修改了路径,例如:
Using properties file: /home/docker4sg/etc/client-config/spark-conf/spark-defaults.conf
Adding default property: spark.lineage.log.dir=/var/log/spark/lineage
Adding default property: spark.port.maxRetries=250
Adding default property: spark.serializer=org.apache.spark.serializer.KryoSerializer
Adding default property: spark.driver.log.persistToDfs.enabled=true
Adding default property: spark.yarn.jars=local:/home/docker4sg/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark/jars/*,local:/home/docker4sg/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark/hive/*
...
几秒钟后,它无法创建 sparkSession
:
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2022-02-22 14:44:14 WARN Client:760 - Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
2022-02-22 14:44:14 ERROR SparkContext:94 - Error initializing SparkContext.
java.lang.IllegalArgumentException: java.net.URISyntaxException: Expected scheme-specific part at index 12: pyspark.zip:
...
Caused by: java.net.URISyntaxException: Expected scheme-specific part at index 12: pyspark.zip:
...
2022-02-22 14:44:15 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:69 - Attempted to request executors before the AM has registered!
2022-02-22 14:44:15 WARN MetricsSystem:69 - Stopping a MetricsSystem that is not running
2022-02-22 14:44:15 WARN SparkContext:69 - Another SparkContext is being constructed (or threw an exception in its constructor). This may indicate an error, since only one SparkContext may be running in this JVM (see SPARK-2243). The other SparkContext was created at:
org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58
据我所知,它失败的原因是我'我不确定,然后尝试退回到其他模式,这也在
配置文件 spark-conf/yarn-conf/yarn-site.xml
中指定。使用zookeeper
:
<property>
<name>yarn.resourcemanager.zk-address</name>
<value>corporate.machine.node1.name.net:9999,corporate.machine.node2.name.net:9999,corporate.machine.node3.name.net:9999</value>
</property>
Yarn 集群是否不接受来自随机 IP(kuber IP 或来自计算机的个人 IP)的流量?对于我来说,我正在处理的 IP 不在白名单中,但目前我无法要求将我的 ip 添加到白名单中。我如何确定我正在寻找正确的方向?
编辑 1:
正如评论中所述,pyspark.zip< 的 URI /code> 错了。我已将
PYSPARK_ARCHIVES_PATH
修改为 pyspark.zip
的真实位置,
PYSPARK_ARCHIVES_PATH=local:$HOME/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark/python/lib/py4j-0.10.7-src.zip,local:$HOME/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark/python/lib/pyspark.zip
现在我收到错误 UnknownHostException
:
org.apache.spark.SparkException: Uncaught exception: org.apache.spark.SparkException: Exception thrown in awaitResult
...
Caused by: java.io.IOException: Failed to connect to <HOSTNAME>:13250
...
Caused by: java.net.UnknownHostException: <HOSTNAME>
...
Here is some context of my installation of pyspark binary.
In my company, we use a Cloudera Data Science Workbench (CDSW). When we create a session for a new projet, I'm guessing it's a image from a specific Dockerfile. And inside this dockerfile is pushed the installation of CDH binaries and configuration.
Now I wish to use thoses configurations outside CDSW. I have a kubernetes cluster where I deploy webapps. And I would like to use spark in Yarn mode to deploy very small ressources for the webapps.
What I have done, is to tar.gz all binaries and config from /opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072
and /var/lib/cdsw/client-config/
.
Then untar.gz them in a container or in a WSL2 instance.
Instead of unpacking everything in /var/
or /opt/
like I should do, I've put them in $HOME/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/*
and $USER/etc/client-config/*
. Why I did this? Because I might want to use a mounted Volume in my kubernetes and share binaries between containers.
I've sed
and modifiy all configuration files to adapt paths:
- spark-env.sh
- topology.py
- Any *.txt, *.sh, *.py
So I managed to run beeline
hadoop
hdfs
hbase
pointing them with the hadoop-conf
folder. I can use pyspark
but in local mode only. But What I really want is to use pyspark
with yarn
.
So I set a bunch of env variables to make this work:
export HADOOP_CONF_DIR=$HOME/etc/client-config/spark-conf/yarn-conf
export SPARK_CONF_DIR=$HOME/etc/client-config/spark-conf/yarn-conf
export JAVA_HOME=/usr/local
export BIN_DIR=$HOME/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/bin
export PATH=$BIN_DIR:$JAVA_HOME/bin:$PATH
export PYSPARK_PYTHON=python3.6
export PYSPARK_DRIVER_PYTHON=python3.6
export OPENBLAS_NUM_THREADS=1
export MKL_NUM_THREADS=1
export SPARK_HOME=/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark
export PYSPARK_ARCHIVES_PATH=$(ZIPS=("$CDH_DIR"/lib/spark/python/lib/*.zip); IFS=:; echo "${ZIPS[*]}"):$PYSPARK_ARCHIVES_PATH
export SPARK_DIST_CLASSPATH=$HOME/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/hadoop/client/accessors-smart-1.2.jar:<ALL OTHER JARS FOR EVERY BINARIES>
Anyway, all of the paths are existing and working. And since I've sed all config files, they also generate the same path as the exported one.
I launch my pyspark binary like this:
pyspark --conf "spark.master=yarn" --properties-file $HOME/etc/client-config/spark-conf/spark-defaults.conf --verbose
FYI, it is using pyspark 2.4.0. And I've install Java(TM) SE Runtime Environment (build 1.8.0_131-b11)
. The same that I found on the CDSW instance. I added the keystore with the public certificate of the company. And I also generate a keytab for the kerberos auth. Both of them are working since I can used hdfs
with HADOOP_CONF_DIR=$HOME/etc/client-config/hadoop-conf
In verbose mode I can see all the details and configuration from spark. When I compare it from the CDSW session, they are quite identical (with modified path, for example :
Using properties file: /home/docker4sg/etc/client-config/spark-conf/spark-defaults.conf
Adding default property: spark.lineage.log.dir=/var/log/spark/lineage
Adding default property: spark.port.maxRetries=250
Adding default property: spark.serializer=org.apache.spark.serializer.KryoSerializer
Adding default property: spark.driver.log.persistToDfs.enabled=true
Adding default property: spark.yarn.jars=local:/home/docker4sg/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark/jars/*,local:/home/docker4sg/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark/hive/*
...
After few seconds it fails to create a sparkSession
:
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2022-02-22 14:44:14 WARN Client:760 - Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
2022-02-22 14:44:14 ERROR SparkContext:94 - Error initializing SparkContext.
java.lang.IllegalArgumentException: java.net.URISyntaxException: Expected scheme-specific part at index 12: pyspark.zip:
...
Caused by: java.net.URISyntaxException: Expected scheme-specific part at index 12: pyspark.zip:
...
2022-02-22 14:44:15 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:69 - Attempted to request executors before the AM has registered!
2022-02-22 14:44:15 WARN MetricsSystem:69 - Stopping a MetricsSystem that is not running
2022-02-22 14:44:15 WARN SparkContext:69 - Another SparkContext is being constructed (or threw an exception in its constructor). This may indicate an error, since only one SparkContext may be running in this JVM (see SPARK-2243). The other SparkContext was created at:
org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58
From what I understand, it fails for a reason I'm not sure about and then tries to fall back into an other mode. That fails too.
In the configuration file spark-conf/yarn-conf/yarn-site.xml
, it is specified that it is using a zookeeper
:
<property>
<name>yarn.resourcemanager.zk-address</name>
<value>corporate.machine.node1.name.net:9999,corporate.machine.node2.name.net:9999,corporate.machine.node3.name.net:9999</value>
</property>
Could it be that the Yarn cluster does not accept traffic from a random IP (kuber IP or personnal IP from computer)? For me, the IP i'm working on is not on the whitelist, but at the moment I cannot ask to add my ip to the whitelist. How can I know for sure I'm looking in the good direction?
Edit 1:
As said in the comment, the URI of the pyspark.zip
was wrong. I've modified my PYSPARK_ARCHIVES_PATH
to the real location of pyspark.zip
.
PYSPARK_ARCHIVES_PATH=local:$HOME/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark/python/lib/py4j-0.10.7-src.zip,local:$HOME/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark/python/lib/pyspark.zip
Now I get an error UnknownHostException
:
org.apache.spark.SparkException: Uncaught exception: org.apache.spark.SparkException: Exception thrown in awaitResult
...
Caused by: java.io.IOException: Failed to connect to <HOSTNAME>:13250
...
Caused by: java.net.UnknownHostException: <HOSTNAME>
...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论