自定义端点上来自 s3a 的 Spark 加载数据停止
我试图在 Spark 集群上执行一个简单的操作,只需在 pyspark --master yarn 中运行以下代码:
op = spark.read.format("csv")
op = op.options(header=True, sep=";")
# This is actually a custom S3 endpoint on a AWS Snowball Edge device
op = op.load("s3a://some-bucket/some/path/file_*.txt")
没有显示错误,但操作未完成。另外,如果我在 S3 中传递一个不存在的路径,它会抛出一个错误,指出该路径不存在。如果我尝试从 HDFS 读取它就会起作用。所以看来是S3读取数据时的通信问题。
以下是我的堆栈的详细信息:
spark: https://dlcdn.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
awscli: https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip
hadoop: https://dlcdn.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz
hive: https://dlcdn.apache.org/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz
hadoop_aws: https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.1/hadoop-aws-3.3.1.jar
aws_java_sdk_bundle: https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.874/aws-java-sdk-bundle-1.11.874.jar
我的 core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://34.223.14.233:9000</value>
</property>
<property>
<name>fs.s3a.endpoint</name>
<value>http://172.16.100.1:8080</value>
</property>
<property>
<name>fs.s3a.access.key</name>
<value>foo</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>bar</value>
</property>
<property>
<name>fs.s3a.connection.ssl.enabled</name>
<value>false</value>
</property>
<property>
<name>fs.s3a.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>
<property>
<name>fs.s3a.connection.maximum</name>
<value>100</value>
</property>
</configuration>
有关解决此问题的任何想法吗?太感谢了!
I am trying to do a simple operation on a spark cluster, by simply running in pyspark --master yarn
the following code:
op = spark.read.format("csv")
op = op.options(header=True, sep=";")
# This is actually a custom S3 endpoint on a AWS Snowball Edge device
op = op.load("s3a://some-bucket/some/path/file_*.txt")
No errors show, but the operation does not complete. Also if I pass an inexistent path in S3 it will throw an error saying the path does not exist. If I try to read from HDFS it will work. So it seems it is communication issue with S3 on reading data.
Here are the details of my stack:
spark: https://dlcdn.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
awscli: https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip
hadoop: https://dlcdn.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz
hive: https://dlcdn.apache.org/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz
hadoop_aws: https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.1/hadoop-aws-3.3.1.jar
aws_java_sdk_bundle: https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.874/aws-java-sdk-bundle-1.11.874.jar
My core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://34.223.14.233:9000</value>
</property>
<property>
<name>fs.s3a.endpoint</name>
<value>http://172.16.100.1:8080</value>
</property>
<property>
<name>fs.s3a.access.key</name>
<value>foo</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>bar</value>
</property>
<property>
<name>fs.s3a.connection.ssl.enabled</name>
<value>false</value>
</property>
<property>
<name>fs.s3a.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>
<property>
<name>fs.s3a.connection.maximum</name>
<value>100</value>
</property>
</configuration>
Any ideias on troubleshooting this issue? Thank you so much!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我在调查类似问题时最终来到这里。我还在自定义端点上遇到了 s3a 停顿(即冻结或挂起)的情况。但是,我的设置有所不同 – 我在代码中设置 HadoopConf,而不是配置 XML。
代码中配置设置语句的顺序是相关的:设置
fs.s3a.endpoint
必须是第一个,并且只能在fs.s3a.access.key之后
和fs.s3a.secret.key
可以设置。导致我找到此解决方案的是,我记录了所有 hadoop conf 值,并注意到fs.s3a.endpoint
为空。I ended up here when investigating a similar problem. I also had s3a on a custom endpoint stalling (i.e. freezing or hanging). However, my setup is different – I set
HadoopConf
in code instead of a configuration XML.The order of config setting statements in code is relevant: Setting
fs.s3a.endpoint
has to be first, and only after thatfs.s3a.access.key
andfs.s3a.secret.key
can be set. What lead me to this solution was that I logged all hadoop conf values and noticed thatfs.s3a.endpoint
was empty.