为什么雪花将表格创建表作为选择(CTA)忽略子句的顺序?
命令是:
drop table if exists metrics_done;
create table metrics_done as select * from metrics where end_morning='2022-03-31' order by LOG_INFO desc;
预期的行为是创建带有分类条目的表。但这不会发生。为什么?
The command is:
drop table if exists metrics_done;
create table metrics_done as select * from metrics where end_morning='2022-03-31' order by LOG_INFO desc;
The expected behaviour is creation of a table with sorted entries. But this does not happen. Why?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
Snowflake确实在CTA上使用订单。您可以看到,通过使用系统$ clustering_information-受到高基数的某些限制以及该功能如何在运行自动群集服务之前至少使用新密钥一次运行群集状态。
但是,仅仅因为Snowflake在CTA中使用订单,这并不意味着无需使用订单子句即可按顺序返回行。 Snowflake是MPP系统,在查询期间将扫描多个小部分。如果没有指定订单,则没有理由优化器应生成保证订单的计划。它生成的计划可以并且将按照他们为结果准备就绪的顺序返回行。
这是一个过于简单的示例:在您按日期订购的CTA上,微分节1的所有行都有日期2022-01-01;微型分支机构中的所有行均具有2022-01-02的日期。当您从该表中选择行时,微分节2的扫描可能与微分节1的可能性一样。如果#2首先结束,则这些行将在结果集中首先成为。
同样,当桌子变大并且比仓库中可用的CPU具有更多的扫描小部分时,将需要一个或多个CPU来扫描多个微型分支。在这种情况下,没有理由希望在另一个小问题上扫描一个微型。
Snowflake does use ORDER BY on a CTAS. You can see that by using the system$clustering_information - subject to some limitations on high cardinality and how the function checks clustering state before it runs the auto clustering service with a new key at least once.
However, just because Snowflake uses the ORDER BY in a CTAS, it doesn't mean the rows will return in order without using an ORDER BY clause. Snowflake is an MPP system and will scan multiple micropartitions during a query. Without specifying an ORDER BY, there is no reason the optimizer should generate a plan that guarantees order. The plan it generates can and will return rows in the order they're ready for the result.
Here's an over-simplistic example: on a CTAS you order by date and all rows in micropartition 1 have date 2022-01-01; all rows in micropartition have date 2022-01-02. When you select rows from that table, the scan for micropartition 2 is just as likely to finish first as micropartition 1 is. If #2 finishes first, those rows will be first in the result set.
Also, when the table becomes large and it has more micropartitions assigned to scan than there are available CPUs in the warehouse, one or more CPUs will need to scan multiple micropartitions. In this case, there's no reason to prefer to scan one micropartition before another.