为什么雪花将表格创建表作为选择(CTA)忽略子句的顺序?

发布于 2025-02-05 07:14:17 字数 213 浏览 1 评论 0原文

命令是:

drop table if exists metrics_done;
create table metrics_done as select * from metrics where end_morning='2022-03-31' order by LOG_INFO desc;

预期的行为是创建带有分类条目的表。但这不会发生。为什么?

The command is:

drop table if exists metrics_done;
create table metrics_done as select * from metrics where end_morning='2022-03-31' order by LOG_INFO desc;

The expected behaviour is creation of a table with sorted entries. But this does not happen. Why?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

痴情 2025-02-12 07:14:17

Snowflake确实在CTA上使用订单。您可以看到,通过使用系统$ clustering_information-受到高基数的某些限制以及该功能如何在运行自动群集服务之前至少使用新密钥一次运行群集状态。

但是,仅仅因为Snowflake在CTA中使用订单,这并不意味着无需使用订单子句即可按顺序返回行。 Snowflake是MPP系统,在查询期间将扫描多个小部分。如果没有指定订单,则没有理由优化器应生成保证订单的计划。它生成的计划可以并且将按照他们为结果准备就绪的顺序返回行。

这是一个过于简单的示例:在您按日期订购的CTA上,微分节1的所有行都有日期2022-01-01;微型分支机构中的所有行均具有2022-01-02的日期。当您从该表中选择行时,微分节2的扫描可能与微分节1的可能性一样。如果#2首先结束,则这些行将在结果集中首先成为。

同样,当桌子变大并且比仓库中可用的CPU具有更多的扫描小部分时,将需要一个或多个CPU来扫描多个微型分支。在这种情况下,没有理由希望在另一个小问题上扫描一个微型。

Snowflake does use ORDER BY on a CTAS. You can see that by using the system$clustering_information - subject to some limitations on high cardinality and how the function checks clustering state before it runs the auto clustering service with a new key at least once.

However, just because Snowflake uses the ORDER BY in a CTAS, it doesn't mean the rows will return in order without using an ORDER BY clause. Snowflake is an MPP system and will scan multiple micropartitions during a query. Without specifying an ORDER BY, there is no reason the optimizer should generate a plan that guarantees order. The plan it generates can and will return rows in the order they're ready for the result.

Here's an over-simplistic example: on a CTAS you order by date and all rows in micropartition 1 have date 2022-01-01; all rows in micropartition have date 2022-01-02. When you select rows from that table, the scan for micropartition 2 is just as likely to finish first as micropartition 1 is. If #2 finishes first, those rows will be first in the result set.

Also, when the table becomes large and it has more micropartitions assigned to scan than there are available CPUs in the warehouse, one or more CPUs will need to scan multiple micropartitions. In this case, there's no reason to prefer to scan one micropartition before another.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文