在雪花中重群落

发布于 2025-02-08 04:18:10 字数 221 浏览 1 评论 0原文

我正在研究雪花,需要在每次应用程序中将聚类应用于表格。如果更改了聚类信息,它将更改群集键,并且还将触发重簇,但是当群集信息不更改时会发生什么,如果列与当前群集密钥相同,则我们使用Alter语句添加群集键仍然是群集?

例如,

考虑到tablea,我现在使用Alter Table Table cluster添加了群集密钥(名称)

,一段时间后,我重新申请了,相同的查询是否会导致重群集?

I am working on a snowflake and need to apply clustering to table for every run of application. If clustering information is changed it would change the cluster keys and it will also trigger reclustering, but what happens when the clustering information is not changed meaning if columns are same as the current cluster keys then we add the cluster keys using alter statement, would it still recluster?

Eg

consider tableA, I added cluster key using alter table tableA cluster by (name)

Now after some time, I reapply this, the same query will it result in reclustering?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

欲拥i 2025-02-15 04:18:10

@Manish您似乎对群集键感到困惑。假设您有一个事实表,大多数查询看起来像这样...

select ...
from big_table
where date_id between <Date Start> and <Date End>;

您可以考虑更改表并使用以下方式创建cluster_key:

alter table big_table 
cluster by date_id;

在后台,自动群集服务将按date_id群集您的表。

无需再次应用群集密钥。

但是,您需要小心。请记住,雪花的以下建议:

  1. 则仅考虑1TB或更多群集上的群集键
  2. 如果您发现partitions_scanned aptitions_total, 。 IE。您目前没有消除分区,这导致查询响应时间较差。
  3. 确保群集键以谓词出现在查询的Where子句中。
  4. 警惕将群集键放在经常更新的大部分分区的桌子上。这可能会导致高凝聚成本,因为更新会破坏聚类序列。

使用以下方式检查表上的现有聚类:

select system$clustering_information('big_table');

如果您的结果看起来像这样 - 您的表非常好:

select system$clustering_information('ORDERS_BY_DAY', '(O_ORDERDATE)');
{
"cluster_by_keys" : "LINEAR(O_ORDERDATE)",
"total_partition_count" : 6648,
"total_constant_partition_count" : 6648,
"average_overlaps" : 0.0,
"average_depth" : 1.0,
"partition_depth_histogram" : {
"00000" : 0,
"00001" : 6648,
"00002" : 0,
"00003" : 0,
"00004" : 0,
"00005" : 0,
"00006" : 0,
"00007" : 0,
"00008" : 0,
"00009" : 0,
"00010" : 0,
"00011" : 0,
"00012" : 0,
"00013" : 0,
"00014" : 0,
"00015" : 0,
"00016" : 0
}
}

但是,如果您的桌子看起来像这样,则将其严重聚类,并且您应该考虑创建一个群集键。

select system$clustering_information('snowflake_sample_data.tpcds_sf100tcl.web_sales','ws_web_page_sk');

{
"cluster_by_keys" : "LINEAR(ws_web_page_sk)",
"total_partition_count" : 300112,
"total_constant_partition_count" : 0,
"average_overlaps" : 300111.0,
"average_depth" : 300112.0,
"partition_depth_histogram" : {
"00000" : 0,
"00001" : 0,
"00002" : 0,
"00003" : 0,
"00004" : 0,
"00005" : 0,
"00006" : 0,
"00007" : 0,
"00008" : 0,
"00009" : 0,
"00010" : 0,
"00011" : 0,
"00012" : 0,
"00013" : 0,
"00014" : 0,
"00015" : 0,
"00016" : 0,
"524288" : 300112
}
}

您需要查找的关键指示器是

“平均_depth”,

这表明查询将扫描查询给定值的平均分区数。

例如:

select ...
from big_table
where date_id = to_date('22-May-2022','DD-Mon-YYYY');

如果您执行上述并返回:

"average_depth" : 300112.0

这平均表示上述查询需要300,000个分区以查找值。 If however it says:

"average_depth" : 10

This indicates less than 10 partition reads.在一张大桌子上(有超过300,000个分区),非常聚集。

只要您的“平均_depth”为10或以下,您就可以了。但是,请记住,我们假设大多数查询都受date_id的限制。

综上所述。如果您认为已经确定了群集密钥的有效案例,则应创建一次,然后监视成本。

您还应该检查查询性能在击中表并通过群集键过滤的查询上得到改进 - 在这种情况下,date_id。

@Manish you seem to be confused about cluster keys. Let's assume you have a fact table where most of the queries look like this...

select ...
from big_table
where date_id between <Date Start> and <Date End>;

You might consider altering the table and creating a CLUSTER_KEY using:

alter table big_table 
cluster by date_id;

In background, the automatic clustering service will cluster your table by DATE_ID.

There is not need to apply the cluster key again.

You need to be careful however. Keep in mind the following advice from Snowflake:

  1. Only consider cluster keys on tables at 1TB or more
  2. Only cluster if you find the PARTITIONS_SCANNED is close to the PARTITIONS_TOTAL. IE. You currently don't partition eliminate and this leads to poor query response times.
  3. Ensure the cluster key appears as a predicate in the WHERE clause of queries.
  4. Be wary of placing cluster keys on tables where a significant proportion of the partitions are frequently updated. This may lead to a high cost of reclustering as updates can disrupt the clustering sequence.

Check the existing clustering on the table using:

select system$clustering_information('big_table');

If the results you get look like this - your table is VERY well clustered:

select system$clustering_information('ORDERS_BY_DAY', '(O_ORDERDATE)');
{
"cluster_by_keys" : "LINEAR(O_ORDERDATE)",
"total_partition_count" : 6648,
"total_constant_partition_count" : 6648,
"average_overlaps" : 0.0,
"average_depth" : 1.0,
"partition_depth_histogram" : {
"00000" : 0,
"00001" : 6648,
"00002" : 0,
"00003" : 0,
"00004" : 0,
"00005" : 0,
"00006" : 0,
"00007" : 0,
"00008" : 0,
"00009" : 0,
"00010" : 0,
"00011" : 0,
"00012" : 0,
"00013" : 0,
"00014" : 0,
"00015" : 0,
"00016" : 0
}
}

If however your table looks like this, it is BADLY clustered, and you should consider creating a cluster key.

select system$clustering_information('snowflake_sample_data.tpcds_sf100tcl.web_sales','ws_web_page_sk');

{
"cluster_by_keys" : "LINEAR(ws_web_page_sk)",
"total_partition_count" : 300112,
"total_constant_partition_count" : 0,
"average_overlaps" : 300111.0,
"average_depth" : 300112.0,
"partition_depth_histogram" : {
"00000" : 0,
"00001" : 0,
"00002" : 0,
"00003" : 0,
"00004" : 0,
"00005" : 0,
"00006" : 0,
"00007" : 0,
"00008" : 0,
"00009" : 0,
"00010" : 0,
"00011" : 0,
"00012" : 0,
"00013" : 0,
"00014" : 0,
"00015" : 0,
"00016" : 0,
"524288" : 300112
}
}

The key indicator you need to look for is the

"average_depth"

This shows the average number of partitions a query will scan for a lookup on a given value.

For example:

select ...
from big_table
where date_id = to_date('22-May-2022','DD-Mon-YYYY');

If you executed the above and it returned:

"average_depth" : 300112.0

This indicates on average the above query will need 300,000 partitions read to find the values. If however it says:

"average_depth" : 10

This indicates less than 10 partition reads. Which on a large table (with over 300,000 partitions), is VERY Well clustered.

Provided your "average_depth" is 10 or under, you're fine. However, keep in mind, we're assuming that most queries are limited by DATE_ID.

In conclusion. If you think you've identified a valid case for a cluster key, it should be created once and then costs monitored.

You should also check your query performance is improved on queries which hit the table and filter by the cluster key - in this case DATE_ID.

想你的星星会说话 2025-02-15 04:18:10

谢谢你的问题。

因此,要重塑,您的意思是说有一个表A上有“名称”列上的群集键。现在,您在“类”列上添加另一个聚类键。

如果我上述理解是正确的,那肯定会群集。想想这些数据存储在微分区中,并根据聚类键进行安排。如果添加了另一个聚类键,它将必须基于群集键在微分段中再次对数据进行排序/重新安排数据。

Thank you for the question.

So to rephrase, you mean to say there is a table A with a clustering key on the column "name". Now you add another clustering key say on the column "class".

If my above understanding is correct, Definitely, it will recluster. Think of this data is stored in micro partitions and arranged based on the clustering key. If another clustering key is added, it will have to sort/re-arrange the data again in micro-partitions based on the clustering keys.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文