ParquetDataset 不从过滤器中获取分区
我有一个存储在 s3 上的镶木地板数据集,我想从 if 查询特定行。我正在使用 pyarrow 来做这件事。
我的 s3 数据集使用配置单元分区(客户端=,年份= ...)使用客户端年月日进行分区。我为镶木地板数据集提供了客户、年、月、日的过滤器,但需要花费大量时间才能获得结果。
这是一些代码片段:
from pyarrow import fs
from pyarrow import parquet as pq
import pathlib
s3_file_system = fs.S3FileSystem()
filters = [
("client_id", "=", 'client'),
("year", "=", year),
("month", "=", month),
("day", "=", day)
]
dataset = pq.ParquetDataset(
str(pathlib.Path('s3_path')),
filesystem=s3_file_system,
filters=filters,
)
我尝试为分区提供 s3_path (
dataset = pq.ParquetDataset(
str(pathlib.Path('s3_path/client=/year=/month=/day=')),
filesystem=s3_file_system,
filters=filters,
)
)
并且它工作得很好。我不知道为什么 Parquetdataset 会扫描过滤器中分区之外的所有文件
I have a parquet dataset stored on s3, and I would like to query specific rows from the if. I am doing it using pyarrow.
My s3 dataset is partitioned using client year month day using hive partitioning (client=, year= ...). I am giving the parquetdataset the filters of client, year, month, day but it is taking a lot of time to get the result.
Here's some code snippet:
from pyarrow import fs
from pyarrow import parquet as pq
import pathlib
s3_file_system = fs.S3FileSystem()
filters = [
("client_id", "=", 'client'),
("year", "=", year),
("month", "=", month),
("day", "=", day)
]
dataset = pq.ParquetDataset(
str(pathlib.Path('s3_path')),
filesystem=s3_file_system,
filters=filters,
)
I tried to give the partition with the s3_path (
dataset = pq.ParquetDataset(
str(pathlib.Path('s3_path/client=/year=/month=/day=')),
filesystem=s3_file_system,
filters=filters,
)
)
and it worked perfectly. I don't know why the Parquetdataset is scanning all the files outside the partitions in the filters
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论