可以在不读取所有内容的情况下读取Parquet文件的前一部分。
假设我们有一个具有一些列和行的镶木quet文件(或其他任何文件)p。文件p的列只有一个名为a的列,该列仅具有0或1。此外,行具有A列AS 0仅是该文件的1%。我们想读取列A为0的行。
最简单的方法是全部读取它,并且使用以过滤为1。这种方式的成本过高,因为我们实际上只需要1个文件P的百分比P。我们要么在阅读之前将P写入两个文件中,包括a为0和a为1。如果您不将其阅读到内存中,这看起来不可能,如何过滤呢?
如果我们在将其写入磁盘之前,将p按A进行文件排序,则可以在文件P的前一部分中使用A as为0。
Assume we have a parquet file(or other any file) P which have some columns and rows. File P have a column named A which have 0 or 1 only. Moreover, rows have column A as 0 is only 1% of this file. We want to read rows whose column A is 0.
The simplest way to do is do read it all and the use where
to filter A as 1. This way cost a too much because we actually want only 1% of file P. We can't either write P into two files include A as 0 and A as 1 before reading it. This look like impossible if you don't read it into memory, how can you filter it ?
If we let file P sort by column A before write it into disk, we can have A as 0 in the former part of file P. In this situation, Can Spark read rows whose A is 0 only?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这将取决于您正在谈论的文件格式,以及对文件进行分区的方式。让我们讨论镶木quet文件:
在分区列上过滤
您的文件,如果您的文件写得如此:
这是最好的情况。您的文件将以
spark.read.parquet(filename).filter(col(col(“ a”)=== 0)
仅在通缉数据中读取。之所以可能是因为parquet文件对a
列的每个值都有一个子目录。因此,完全不可能在a == 1
的目录中读取。在非分区列上过滤
这是您使用的文件格式产生影响的地方。对于镶木quet文件,有一些过滤也被推下来读取文件。
镶木点文件被排成一组。对于每个行组,Parquet文件保留一些元数据(包括该行组的最小/最大值)。这允许跳过行组(想象一下您只想读取
a == 0
的值,而最小值为1
,您可以跳过整个行组) 。例如,对于CSV文件,第一个文件是不可能的。
This is going to depend on the file format you're talking about, and on the way the file has been partitioned. Let's discuss parquet files:
Filtering on partition column
If your file has been written like so:
This is the best case scenario. Your file will have been written in a way that
spark.read.parquet(filename).filter(col("A") === 0)
will only read in the wanted data. The reason why this is possible is because the parquet file has a subdirectory for each value of theA
column. So it's perfectly possible to not read in the directory whereA == 1
.Filtering on a non-partition column
This is where the file format you're using has an impact. In the case of parquet files, there is some filtering that is also pushed down to reading the file.
Parquet files are chopped up in row groups. For each row group, parquet files keep some metadata (including min/max of that row group). This allows for skipping of row groups (imagine you want to read in only values where
A == 0
and the min value is1
, you can skip that whole row group).For CSV files, for example, this second one is not possible.