Spark Sc.BinaryFiles()分区小文件和纱线
使用hortonworks 2.6.5服务器上的Spark 2.3.0中的SC.BinaryFiles()函数,我注意到它的行为,我无法在纱线托管群集中进行有关默认分区的解释。请参阅下面的示例代码:
import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkContext
object ReadTestYarn extends App {
Logger.getLogger("org").setLevel(Level.ERROR)
val sc = new SparkContext("yarn", "ReadTestYarn")
val inputRDD1 = sc.textFile("hdfs:/user/maria_dev/readtest/input/*")
val inputRDD2 = sc.binaryFiles("hdfs:/user/maria_dev/readtest/input/*")
println("Num of RDD1 partitions: " + inputRDD1.getNumPartitions)
println("Num of RDD2 partitions: " + inputRDD2.getNumPartitions)
}
[maria_dev@sandbox-hdp readtest]$ spark-submit --master yarn --deploy-mode client --class ReadTestYarn ReadTest.jar
Num of RDD1 partitions: 10
Num of RDD2 partitions: 1
我使用的数据很小,10 CSV文件,每个数据的大小约为4-5MB,总计为43MB。在RDD1的情况下,可以理解的分区的数量是可以理解的,并且在以下帖子和文章中很好地解释了计算方法:
但是使用rdd2,binaryfiles()biaryfiles()函数和主URL通过作为“纱线”,创建的分区数量仅为1,我完全不了解。
@Mark Rajcok在下面的帖子中给出了一些解释,但是提交更改的链接在那里行不通。有人可以提供有关在这种情况下仅创建一个分区的详细说明吗?
Using the sc.binaryFiles() function in Spark 2.3.0 on a Hortonworks 2.6.5 server, I noticed its behavior which I cannot explain regarding the default partitioning in a YARN managed cluster. Please see the sample code below:
import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkContext
object ReadTestYarn extends App {
Logger.getLogger("org").setLevel(Level.ERROR)
val sc = new SparkContext("yarn", "ReadTestYarn")
val inputRDD1 = sc.textFile("hdfs:/user/maria_dev/readtest/input/*")
val inputRDD2 = sc.binaryFiles("hdfs:/user/maria_dev/readtest/input/*")
println("Num of RDD1 partitions: " + inputRDD1.getNumPartitions)
println("Num of RDD2 partitions: " + inputRDD2.getNumPartitions)
}
[maria_dev@sandbox-hdp readtest]$ spark-submit --master yarn --deploy-mode client --class ReadTestYarn ReadTest.jar
Num of RDD1 partitions: 10
Num of RDD2 partitions: 1
The data I use is small, 10 csv files, each about 4-5MB in size, 43MB in total. In the case of RDD1, the number of the resulting partitions is understandable and the calculation method is well explained in the following post and article:
Spark RDD default number of partitions
https://medium.com/swlh/building-partitions-for-processing-data-files-in-apache-spark-2ca40209c9b7
But with RDD2, binaryFiles() function and master URL passed to Spark as "yarn", the number of partitions created is only 1 which I don't understand exactly.
@Mark Rajcok has given some explanation in the post below, but the link to commit changes there is not working. Could someone please provide detailed explanation about creating only one partition in this case?
PySpark: Partitioning while reading a binary file using binaryFiles() function
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论