Azure 数据资源管理器外部表分区有什么用处?
将权限添加到外部表定义对分区查询没有帮助。
Blob 路径示例
- /data/1234/2021/12/02/9483D.parquet
- /data/1235/2021/12/02/12345.parquet
分区(伪语法不是真正的语法): '/data/'uniqueid'/yyyy /MM/dd/'
因此存储路径中只有两个 uniqueid 值。路径中不同日期的文件总数约为 100 万个
所以我定义了 2 个分区作为虚拟列:
- uniqueid
- datetime
对 uniqueid 执行查询,如下所示: <代码>表|通过uniqueid总结 由于某种原因会检查 blob 存储中的所有文件。
由于 uniqueid 是一个分区并且作为虚拟列,因此查询不应该超级快吗,因为我们在它的路径中只有 2 个值? 我完全错过了分区的要点吗?
编辑添加样本:
.create external table ['sensordata'] (['timestamp']:long,['value']:real)
kind = adl
partition by (['uniqueid']:string ,['datecreated']:datetime )
pathformat = (['uniqueid'] '/' datetime_pattern("yyyy/MM/dd", ['daterecorded']))
dataformat = parquet
(
h@'abfss://[email protected]/histdata;impersonate'
)
with (FileExtension='.parquet')
查询样本:
sensordata
| summarize by uniqueid
Adding pertition to the external table definition does not help with a query on the partition.
Blob path example
- /data/1234/2021/12/02/9483D.parquet
- /data/1235/2021/12/02/12345.parquet
Partition (pseudo syntax not the real one) : '/data/'uniqueid'/yyyy/MM/dd/'
So only two uniqueids values are in the storage path. Total files count ~ 1 million for different dates in the path
So I defined 2 partitions as virtual columns:
- uniqueid
- datetime
Executing a query on the uniqueid like:table | summarize by uniqueid
goes over all files in the blob storage for some reason.
As the uniqueid is a partition and as virtual column, shouldn't the query be super fast as we have only 2 values in the path for it?
Am I totally missing the point of partitioning?
EDIT add smaple:
.create external table ['sensordata'] (['timestamp']:long,['value']:real)
kind = adl
partition by (['uniqueid']:string ,['datecreated']:datetime )
pathformat = (['uniqueid'] '/' datetime_pattern("yyyy/MM/dd", ['daterecorded']))
dataformat = parquet
(
h@'abfss://[email protected]/histdata;impersonate'
)
with (FileExtension='.parquet')
Query sample:
sensordata
| summarize by uniqueid
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
感谢您的输入,@user998888。
我们对分区外部表进行了许多优化,并且我们投入了大量精力来添加越来越多的优化。但我们仍然没有像您提供的那样优化查询类型。它在我们的名单上。
Thanks for your input, @user998888.
We have many optimizations for partitioned external tables, and we invest significant effort in adding more and more optimizations. But we still haven't optimized the type of query like the one you provided. It's on our list.