自定义端点上来自 s3a 的 Spark 加载数据停止

发布于 2025-01-09 16:57:21 字数 2136 浏览 0 评论 0原文

我试图在 Spark 集群上执行一个简单的操作，只需在 pyspark --master yarn 中运行以下代码：

op = spark.read.format("csv")
op = op.options(header=True, sep=";")
# This is actually a custom S3 endpoint on a AWS Snowball Edge device
op = op.load("s3a://some-bucket/some/path/file_*.txt")

没有显示错误，但操作未完成。另外，如果我在 S3 中传递一个不存在的路径，它会抛出一个错误，指出该路径不存在。如果我尝试从 HDFS 读取它就会起作用。所以看来是S3读取数据时的通信问题。

以下是我的堆栈的详细信息：

spark: https://dlcdn.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
awscli: https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip
hadoop: https://dlcdn.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz
hive: https://dlcdn.apache.org/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz
hadoop_aws: https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.1/hadoop-aws-3.3.1.jar
aws_java_sdk_bundle: https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.874/aws-java-sdk-bundle-1.11.874.jar

我的 core-site.xml

<configuration>

  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://34.223.14.233:9000</value>
  </property>

  <property>
    <name>fs.s3a.endpoint</name>
    <value>http://172.16.100.1:8080</value>
  </property>

  <property>
    <name>fs.s3a.access.key</name>
    <value>foo</value>
  </property>

  <property>
    <name>fs.s3a.secret.key</name>
    <value>bar</value>
  </property>

  <property>
    <name>fs.s3a.connection.ssl.enabled</name>
    <value>false</value>
  </property>

  <property>
    <name>fs.s3a.impl</name>
    <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
  </property>

  <property>
    <name>fs.s3a.connection.maximum</name>
    <value>100</value>
  </property>

</configuration>

有关解决此问题的任何想法吗？太感谢了！

原文

I am trying to do a simple operation on a spark cluster, by simply running in pyspark --master yarn the following code:

op = spark.read.format("csv")
op = op.options(header=True, sep=";")
# This is actually a custom S3 endpoint on a AWS Snowball Edge device
op = op.load("s3a://some-bucket/some/path/file_*.txt")

No errors show, but the operation does not complete. Also if I pass an inexistent path in S3 it will throw an error saying the path does not exist. If I try to read from HDFS it will work. So it seems it is communication issue with S3 on reading data.

Here are the details of my stack:

spark: https://dlcdn.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
awscli: https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip
hadoop: https://dlcdn.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz
hive: https://dlcdn.apache.org/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz
hadoop_aws: https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.1/hadoop-aws-3.3.1.jar
aws_java_sdk_bundle: https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.874/aws-java-sdk-bundle-1.11.874.jar

My core-site.xml

<configuration>

  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://34.223.14.233:9000</value>
  </property>

  <property>
    <name>fs.s3a.endpoint</name>
    <value>http://172.16.100.1:8080</value>
  </property>

  <property>
    <name>fs.s3a.access.key</name>
    <value>foo</value>
  </property>

  <property>
    <name>fs.s3a.secret.key</name>
    <value>bar</value>
  </property>

  <property>
    <name>fs.s3a.connection.ssl.enabled</name>
    <value>false</value>
  </property>

  <property>
    <name>fs.s3a.impl</name>
    <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
  </property>

  <property>
    <name>fs.s3a.connection.maximum</name>
    <value>100</value>
  </property>

</configuration>

Any ideias on troubleshooting this issue? Thank you so much!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

客…行舟 2025-01-16 16:57:21

我在调查类似问题时最终来到这里。我还在自定义端点上遇到了 s3a 停顿（即冻结或挂起）的情况。但是，我的设置有所不同 – 我在代码中设置 HadoopConf，而不是配置 XML。

代码中配置设置语句的顺序是相关的：设置fs.s3a.endpoint必须是第一个，并且只能在fs.s3a.access.key之后 和 fs.s3a.secret.key 可以设置。导致我找到此解决方案的是，我记录了所有 hadoop conf 值，并注意到 fs.s3a.endpoint 为空。