对于亚马逊雅典娜中的多键分区,顺序重要吗?
设置 amazon athena 分区以与带有粘合目录的分区项目一起使用时。 S3 存储桶内的分区顺序重要吗?
分区策略示例:
- 按年/月/日分区。
s3://year=2022/month=01/day=21
- 按日/月/年分区。
s3://day=21/month=01/year=2022
场景 1: 查询指定年、月和日。一种分区策略是否可以更快地执行查询?一种分区策略是否会产生更少的成本 - 我想数据扫描成本是相同的,但是 S3 操作产生的成本又如何呢?
场景 2:。查询仅指定日期。一种分区策略是否可以更快地执行查询?一种分区策略是否会产生更少的成本 - 同样,我认为数据扫描成本是相同的,但是 S3 操作产生的成本又如何呢?
When setting up amazon athena partitions to be used with Partition Project with a glue catalog. Does the order of partitions within the S3 bucket matter?
Example partition strategies:
- Partition by year/month/day.
s3://year=2022/month=01/day=21
- Partition by day/month/year.
s3://day=21/month=01/year=2022
Scenario 1: A query specifies year, month and day. Does one partition strategy execute the query faster? Does one partition strategy incur less costs - I imagine data scanned costs are the same, but what about costs incurred from S3 operations?
Scenario 2:. A query only specifies day. Does one partition strategy execute the query faster? Does one partition strategy incur less costs - again, I imagine data scanned costs are the same, but what about costs incurred from S3 operations?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
请记住,S3具有扁平结构,文件夹是一个幻象,只有桶和钥匙。
对于day = 21的查询,以任何顺序,都需要读取100个键。
对于年份 = 2022 的查询,无论如何都要阅读 200 次。
不是100%确定,但这是我的推理。
参考号https://docs.aws.amazon.com/AmazonS3/最新/userguide/using-folders.html
Remember S3 has a flat structure, folder is an illusion, there is only bucket and key.
for query where day = 21, in any order, 100 keys need to be read.
for query where year = 2022, 200 times read anyway.
not 100% sure but this is my reasoning.
ref. https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-folders.html