ParquetDataset 不从过滤器中获取分区

发布于 2025-01-14 23:50:28 字数 1002 浏览 2 评论 0原文

我有一个存储在 s3 上的镶木地板数据集，我想从 if 查询特定行。我正在使用 pyarrow 来做这件事。

我的 s3 数据集使用配置单元分区（客户端=，年份= ...）使用客户端年月日进行分区。我为镶木地板数据集提供了客户、年、月、日的过滤器，但需要花费大量时间才能获得结果。

这是一些代码片段：

from pyarrow import fs
from pyarrow import parquet as pq
import pathlib
s3_file_system = fs.S3FileSystem()
filters = [
                    ("client_id", "=", 'client'),
                    ("year", "=", year),
                    ("month", "=", month),
                    ("day", "=", day)
                ]
dataset = pq.ParquetDataset(
                    str(pathlib.Path('s3_path')),
                    filesystem=s3_file_system,
                    filters=filters,
                )

我尝试为分区提供 s3_path （

dataset = pq.ParquetDataset(
                    str(pathlib.Path('s3_path/client=/year=/month=/day=')),
                    filesystem=s3_file_system,
                    filters=filters,
                )
)

并且它工作得很好。我不知道为什么 Parquetdataset 会扫描过滤器中分区之外的所有文件

原文

I have a parquet dataset stored on s3, and I would like to query specific rows from the if. I am doing it using pyarrow.

My s3 dataset is partitioned using client year month day using hive partitioning (client=, year= ...). I am giving the parquetdataset the filters of client, year, month, day but it is taking a lot of time to get the result.

Here's some code snippet:

from pyarrow import fs
from pyarrow import parquet as pq
import pathlib
s3_file_system = fs.S3FileSystem()
filters = [
                    ("client_id", "=", 'client'),
                    ("year", "=", year),
                    ("month", "=", month),
                    ("day", "=", day)
                ]
dataset = pq.ParquetDataset(
                    str(pathlib.Path('s3_path')),
                    filesystem=s3_file_system,
                    filters=filters,
                )

I tried to give the partition with the s3_path (

dataset = pq.ParquetDataset(
                    str(pathlib.Path('s3_path/client=/year=/month=/day=')),
                    filesystem=s3_file_system,
                    filters=filters,
                )
)

and it worked perfectly. I don't know why the Parquetdataset is scanning all the files outside the partitions in the filters

分享到QQ

分享到微博