在雪花中重群落
我正在研究雪花,需要在每次应用程序中将聚类应用于表格。如果更改了聚类信息,它将更改群集键,并且还将触发重簇,但是当群集信息不更改时会发生什么,如果列与当前群集密钥相同,则我们使用Alter语句添加群集键仍然是群集?
例如,
考虑到tablea,我现在使用Alter Table Table cluster添加了群集密钥(名称)
,一段时间后,我重新申请了,相同的查询是否会导致重群集?
I am working on a snowflake and need to apply clustering to table for every run of application. If clustering information is changed it would change the cluster keys and it will also trigger reclustering, but what happens when the clustering information is not changed meaning if columns are same as the current cluster keys then we add the cluster keys using alter statement, would it still recluster?
Eg
consider tableA, I added cluster key using alter table tableA cluster by (name)
Now after some time, I reapply this, the same query will it result in reclustering?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
@Manish您似乎对群集键感到困惑。假设您有一个事实表,大多数查询看起来像这样...
您可以考虑更改表并使用以下方式创建cluster_key:
在后台,自动群集服务将按date_id群集您的表。
无需再次应用群集密钥。
但是,您需要小心。请记住,雪花的以下建议:
使用以下方式检查表上的现有聚类:
如果您的结果看起来像这样 - 您的表非常好:
但是,如果您的桌子看起来像这样,则将其严重聚类,并且您应该考虑创建一个群集键。
您需要查找的关键指示器是
“平均_depth”,
这表明查询将扫描查询给定值的平均分区数。
例如:
如果您执行上述并返回:
这平均表示上述查询需要300,000个分区以查找值。 If however it says:
This indicates less than 10 partition reads.在一张大桌子上(有超过300,000个分区),非常聚集。
只要您的“平均_depth”为10或以下,您就可以了。但是,请记住,我们假设大多数查询都受date_id的限制。
综上所述。如果您认为已经确定了群集密钥的有效案例,则应创建一次,然后监视成本。
您还应该检查查询性能在击中表并通过群集键过滤的查询上得到改进 - 在这种情况下,date_id。
@Manish you seem to be confused about cluster keys. Let's assume you have a fact table where most of the queries look like this...
You might consider altering the table and creating a CLUSTER_KEY using:
In background, the automatic clustering service will cluster your table by DATE_ID.
There is not need to apply the cluster key again.
You need to be careful however. Keep in mind the following advice from Snowflake:
Check the existing clustering on the table using:
If the results you get look like this - your table is VERY well clustered:
If however your table looks like this, it is BADLY clustered, and you should consider creating a cluster key.
The key indicator you need to look for is the
"average_depth"
This shows the average number of partitions a query will scan for a lookup on a given value.
For example:
If you executed the above and it returned:
This indicates on average the above query will need 300,000 partitions read to find the values. If however it says:
This indicates less than 10 partition reads. Which on a large table (with over 300,000 partitions), is VERY Well clustered.
Provided your "average_depth" is 10 or under, you're fine. However, keep in mind, we're assuming that most queries are limited by DATE_ID.
In conclusion. If you think you've identified a valid case for a cluster key, it should be created once and then costs monitored.
You should also check your query performance is improved on queries which hit the table and filter by the cluster key - in this case DATE_ID.
谢谢你的问题。
因此,要重塑,您的意思是说有一个表A上有“名称”列上的群集键。现在,您在“类”列上添加另一个聚类键。
如果我上述理解是正确的,那肯定会群集。想想这些数据存储在微分区中,并根据聚类键进行安排。如果添加了另一个聚类键,它将必须基于群集键在微分段中再次对数据进行排序/重新安排数据。
Thank you for the question.
So to rephrase, you mean to say there is a table A with a clustering key on the column "name". Now you add another clustering key say on the column "class".
If my above understanding is correct, Definitely, it will recluster. Think of this data is stored in micro partitions and arranged based on the clustering key. If another clustering key is added, it will have to sort/re-arrange the data again in micro-partitions based on the clustering keys.