Spark Sc.BinaryFiles（）分区小文件和纱线

发布于 2025-02-12 06:58:27 字数 1741 浏览 0 评论 0原文

使用hortonworks 2.6.5服务器上的Spark 2.3.0中的SC.BinaryFiles（）函数，我注意到它的行为，我无法在纱线托管群集中进行有关默认分区的解释。请参阅下面的示例代码：

import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkContext

object ReadTestYarn extends App {

  Logger.getLogger("org").setLevel(Level.ERROR)

  val sc = new SparkContext("yarn", "ReadTestYarn")

  val inputRDD1 = sc.textFile("hdfs:/user/maria_dev/readtest/input/*")
  val inputRDD2 = sc.binaryFiles("hdfs:/user/maria_dev/readtest/input/*")

  println("Num of RDD1 partitions: " + inputRDD1.getNumPartitions)
  println("Num of RDD2 partitions: " + inputRDD2.getNumPartitions)

}


[maria_dev@sandbox-hdp readtest]$ spark-submit --master yarn --deploy-mode client --class ReadTestYarn ReadTest.jar

Num of RDD1 partitions: 10
Num of RDD2 partitions: 1

我使用的数据很小，10 CSV文件，每个数据的大小约为4-5MB，总计为43MB。在RDD1的情况下，可以理解的分区的数量是可以理解的，并且在以下帖子和文章中很好地解释了计算方法：

Spark RDD默认分区

https://medium.com/swlh/building-partitions-for-processing-data-files-data-files-in-apache-spark-spark-2ca40209c9b7 ，

但是使用rdd2，binaryfiles（）biaryfiles（）函数和主URL通过作为“纱线”，创建的分区数量仅为1，我完全不了解。

@Mark Rajcok在下面的帖子中给出了一些解释，但是提交更改的链接在那里行不通。有人可以提供有关在这种情况下仅创建一个分区的详细说明吗？

pyspark”> pyspark：分区时使用binaryfiles（）（）（）（）（）（）功能

原文

Using the sc.binaryFiles() function in Spark 2.3.0 on a Hortonworks 2.6.5 server, I noticed its behavior which I cannot explain regarding the default partitioning in a YARN managed cluster. Please see the sample code below:

import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkContext

object ReadTestYarn extends App {

  Logger.getLogger("org").setLevel(Level.ERROR)

  val sc = new SparkContext("yarn", "ReadTestYarn")

  val inputRDD1 = sc.textFile("hdfs:/user/maria_dev/readtest/input/*")
  val inputRDD2 = sc.binaryFiles("hdfs:/user/maria_dev/readtest/input/*")

  println("Num of RDD1 partitions: " + inputRDD1.getNumPartitions)
  println("Num of RDD2 partitions: " + inputRDD2.getNumPartitions)

}


[maria_dev@sandbox-hdp readtest]$ spark-submit --master yarn --deploy-mode client --class ReadTestYarn ReadTest.jar

Num of RDD1 partitions: 10
Num of RDD2 partitions: 1

The data I use is small, 10 csv files, each about 4-5MB in size, 43MB in total. In the case of RDD1, the number of the resulting partitions is understandable and the calculation method is well explained in the following post and article:

Spark RDD default number of partitions

https://medium.com/swlh/building-partitions-for-processing-data-files-in-apache-spark-2ca40209c9b7

But with RDD2, binaryFiles() function and master URL passed to Spark as "yarn", the number of partitions created is only 1 which I don't understand exactly.

@Mark Rajcok has given some explanation in the post below, but the link to commit changes there is not working. Could someone please provide detailed explanation about creating only one partition in this case?

PySpark: Partitioning while reading a binary file using binaryFiles() function

分享到QQ

分享到微博