当前位置：文江博客话题详情

可以在不读取所有内容的情况下读取Parquet文件的前一部分。

发布于 2025-02-07 23:44:23 字数 271 浏览 0 评论 0原文

假设我们有一个具有一些列和行的镶木quet文件（或其他任何文件）p。文件p的列只有一个名为a的列，该列仅具有0或1。此外，行具有A列AS 0仅是该文件的1％。我们想读取列A为0的行。

最简单的方法是全部读取它，并且使用以过滤为1。这种方式的成本过高，因为我们实际上只需要1个文件P的百分比P。我们要么在阅读之前将P写入两个文件中，包括a为0和a为1。如果您不将其阅读到内存中，这看起来不可能，如何过滤呢？

如果我们在将其写入磁盘之前，将p按A进行文件排序，则可以在文件P的前一部分中使用A as为0。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

凤舞天涯 2025-02-14 23:44:23

这将取决于您正在谈论的文件格式，以及对文件进行分区的方式。让我们讨论镶木quet文件：

在分区列上过滤

您的文件，如果您的文件写得如此：

df.write.partitionBy("A").parquet(filename)

这是最好的情况。您的文件将以spark.read.parquet（filename）.filter（col（col（“ a”）=== 0）仅在通缉数据中读取。之所以可能是因为parquet文件对a列的每个值都有一个子目录。因此，完全不可能在a == 1的目录中读取。

在非分区列上过滤

这是您使用的文件格式产生影响的地方。对于镶木quet文件，有一些过滤也被推下来读取文件。

镶木点文件被排成一组。对于每个行组，Parquet文件保留一些元数据（包括该行组的最小/最大值）。这允许跳过行组（想象一下您只想读取a == 0的值，而最小值为1，您可以跳过整个行组）。

例如，对于CSV文件，第一个文件是不可能的。

This is going to depend on the file format you're talking about, and on the way the file has been partitioned. Let's discuss parquet files:

Filtering on partition column

If your file has been written like so:

df.write.partitionBy("A").parquet(filename)

This is the best case scenario. Your file will have been written in a way that spark.read.parquet(filename).filter(col("A") === 0) will only read in the wanted data. The reason why this is possible is because the parquet file has a subdirectory for each value of the A column. So it's perfectly possible to not read in the directory where A == 1.

Filtering on a non-partition column

This is where the file format you're using has an impact. In the case of parquet files, there is some filtering that is also pushed down to reading the file.

Parquet files are chopped up in row groups. For each row group, parquet files keep some metadata (including min/max of that row group). This allows for skipping of row groups (imagine you want to read in only values where A == 0 and the min value is 1, you can skip that whole row group).

For CSV files, for example, this second one is not possible.

回复收藏 0 原文

~没有更多了~