自定义端点上来自 s3a 的 Spark 加载数据停止

发布于 2025-01-09 16:57:21 字数 2136 浏览 0 评论 0原文

我试图在 Spark 集群上执行一个简单的操作,只需在 pyspark --master yarn 中运行以下代码:

op = spark.read.format("csv")
op = op.options(header=True, sep=";")
# This is actually a custom S3 endpoint on a AWS Snowball Edge device
op = op.load("s3a://some-bucket/some/path/file_*.txt")

没有显示错误,但操作未完成。另外,如果我在 S3 中传递一个不存在的路径,它会抛出一个错误,指出该路径不存在。如果我尝试从 HDFS 读取它就会起作用。所以看来是S3读取数据时的通信问题。

以下是我的堆栈的详细信息:

spark: https://dlcdn.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
awscli: https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip
hadoop: https://dlcdn.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz
hive: https://dlcdn.apache.org/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz
hadoop_aws: https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.1/hadoop-aws-3.3.1.jar
aws_java_sdk_bundle: https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.874/aws-java-sdk-bundle-1.11.874.jar

我的 core-site.xml

<configuration>

  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://34.223.14.233:9000</value>
  </property>

  <property>
    <name>fs.s3a.endpoint</name>
    <value>http://172.16.100.1:8080</value>
  </property>

  <property>
    <name>fs.s3a.access.key</name>
    <value>foo</value>
  </property>

  <property>
    <name>fs.s3a.secret.key</name>
    <value>bar</value>
  </property>

  <property>
    <name>fs.s3a.connection.ssl.enabled</name>
    <value>false</value>
  </property>

  <property>
    <name>fs.s3a.impl</name>
    <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
  </property>

  <property>
    <name>fs.s3a.connection.maximum</name>
    <value>100</value>
  </property>

</configuration>

有关解决此问题的任何想法吗?太感谢了!

I am trying to do a simple operation on a spark cluster, by simply running in pyspark --master yarn the following code:

op = spark.read.format("csv")
op = op.options(header=True, sep=";")
# This is actually a custom S3 endpoint on a AWS Snowball Edge device
op = op.load("s3a://some-bucket/some/path/file_*.txt")

No errors show, but the operation does not complete. Also if I pass an inexistent path in S3 it will throw an error saying the path does not exist. If I try to read from HDFS it will work. So it seems it is communication issue with S3 on reading data.

Here are the details of my stack:

spark: https://dlcdn.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
awscli: https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip
hadoop: https://dlcdn.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz
hive: https://dlcdn.apache.org/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz
hadoop_aws: https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.1/hadoop-aws-3.3.1.jar
aws_java_sdk_bundle: https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.874/aws-java-sdk-bundle-1.11.874.jar

My core-site.xml

<configuration>

  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://34.223.14.233:9000</value>
  </property>

  <property>
    <name>fs.s3a.endpoint</name>
    <value>http://172.16.100.1:8080</value>
  </property>

  <property>
    <name>fs.s3a.access.key</name>
    <value>foo</value>
  </property>

  <property>
    <name>fs.s3a.secret.key</name>
    <value>bar</value>
  </property>

  <property>
    <name>fs.s3a.connection.ssl.enabled</name>
    <value>false</value>
  </property>

  <property>
    <name>fs.s3a.impl</name>
    <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
  </property>

  <property>
    <name>fs.s3a.connection.maximum</name>
    <value>100</value>
  </property>

</configuration>

Any ideias on troubleshooting this issue? Thank you so much!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

客…行舟 2025-01-16 16:57:21

我在调查类似问题时最终来到这里。我还在自定义端点上遇到了 s3a 停顿(即冻结或挂起)的情况。但是,我的设置有所不同 – 我在代码中设置 HadoopConf,而不是配置 XML。

代码中配置设置语句的顺序是相关的:设置fs.s3a.endpoint必须是第一个,并且只能在fs.s3a.access.key之后fs.s3a.secret.key 可以设置。导致我找到此解决方案的是,我记录了所有 hadoop conf 值,并注意到 fs.s3a.endpoint 为空。

I ended up here when investigating a similar problem. I also had s3a on a custom endpoint stalling (i.e. freezing or hanging). However, my setup is different – I set HadoopConf in code instead of a configuration XML.

The order of config setting statements in code is relevant: Setting fs.s3a.endpoint has to be first, and only after that fs.s3a.access.key and fs.s3a.secret.key can be set. What lead me to this solution was that I logged all hadoop conf values and noticed that fs.s3a.endpoint was empty.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文