为什么Parquet文件在Pyspark中生成多个部分?
经过一项广泛的研究,我认为
parquet是一种面向列的数据文件格式,旨在有效的数据存储和检索。它提供有效的数据压缩和编码方案具有增强性能,以处理大量的复杂数据。
但是,我无法理解为什么Parquet运行df.write.parquet.parquet(“/tmp/输出/my_parquet.parquet”)
,尽管支持灵活的压缩选项和有效的编码。 这与并行处理或类似概念直接相关?
After some extensive research I have figured that
Parquet is a column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.
However, I am unable to understand why parquet writes multiple files when I run df.write.parquet("/tmp/output/my_parquet.parquet")
despite supporting flexible compression options and efficient encoding.
Is this directly related to parallel processing or similar concepts?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
许多框架都利用了镶木式格式的此多文件布局功能。因此,我会说这是标准选项,是镶木木规范的一部分,默认情况下使用它。
这确实对并行处理具有好处,也有其他用例,例如云或网络文件系统上的处理(并行或系列),其中数据传输时间可能是IO总IO的很大一部分。在这些情况下,使用小元数据文件的镶木木木“ Hive”格式提供了有关读取哪些数据文件的统计信息和信息,在阅读数据的小子集时会提供显着的性能好处。无论是单线程应用程序正在读取数据的一个子集还是并行过程中的每个工人都在读取整体的一部分,这是正确的。
Lots of frameworks make use of this multi-file layout feature of the parquet format. So I’d say that it’s a standard option which is part of the parquet specification, and spark uses it by default.
This does have benefits for parallel processing, but also other use cases, such as processing (in parallel or series) on the cloud or networked file systems, where data transfer times may be a significant portion of total IO. in these cases, the parquet “hive” format, which uses small metadata files which provide statistics and information about which data files to read, offers significant performance benefits when reading small subsets of the data. This is true whether a single-threaded application is reading a subset of the data or if each worker in a parallel process is reading a portion of the whole.
它不仅用于镶木木,而且是一个火花功能避免网络io,它将每个洗牌分区写入磁盘上的“零件...”文件和每个文件,如您所说,默认情况下将具有压缩和有效的编码。
是的,它与并行处理直接相关
It's not just for parquet but rather a spark feature where to avoid network io it writes each shuffle partition as a 'part...' file on disk and each file as you said will have compression and efficient encoding by default.
So Yes it is directly related to parallel processing