优化事实阅读&写雪花

发布于 2025-02-08 16:10:49 字数 5535 浏览 2 评论 0原文

总而言之,

我正在研究一些技巧,以优化雪花事实表上的ELT。 150亿行。

每35分钟,我们每35分钟就会得到大约30,000行,我们始终将获得3个帐户维度键值,即销售,COGS& Mkt。

Finance_KeyDate_KeyDepartment_Group_KeyScenario_KeyAccount_KeyValueIsCurrent
0012019-01-010010012SALES_001100Y
0012019-01-010010012COGS_001300Y
0012019-01-010010012MKT_001200Y

This is then PIVOTED based on Finance Key and Date键并加载到另一个表中进行报告,例如

Finance_keyDate_keyDepartment_group_keyspecario_keysealescogsmktiscurrent
0012019-01-01-01001001 0012100300200y

有时会进行调整,并进行1个帐户密钥。

finance_keydate_keyDepartment_kekespecor_keyaccount_keyaccount_key
0012019-01-010010012sales_00150

必须

Finance_keyDate_keydepartment_keysaceario_keyaccount_keycouncel_key001
account_keyaccount_Key001我们Department_key
spectario_key因此300执行Y
0012019-01-010010012MKT_001200Y
0012019-01-010010012SALES_00150Y

And the resulting value should be

Finance_KeyDate_KeyDepartment_Group_KeyScenario_KeySALESCOGSMKT
0012019-01-01001001250300200

However my question是如何优化查询以进行扫描并更新枢轴表格的大约。雪花上有150亿行。

这更多是读写的优化。

任何指针

谢谢

All,

I am looking at some tips with regards to optimizing a ELT on a Snowflake Fact table with approx. 15 billion rows.

We get approximately 30,000 rows every 35 mins like the one below, we always will get 3 Account Dimension Key values i.e. Sales, Cogs & MKT.

Finance_KeyDate_KeyDepartment_Group_KeyScenario_KeyAccount_KeyValueIsCurrent
0012019-01-010010012SALES_001100Y
0012019-01-010010012COGS_001300Y
0012019-01-010010012MKT_001200Y

This is then PIVOTED based on Finance Key and Date Key and loaded into another table for reporting, like the one below

Finance_KeyDate_KeyDepartment_Group_KeyScenario_KeySALESCOGSMKTIsCurrent
0012019-01-010010012100300200Y

At times there is an adjustment made and for 1 Account key.

Finance_KeyDate_KeyDepartment_Group_KeyScenario_KeyAccount_KeyValue
0012019-01-010010012SALES_00150

Hence we have to do this

Finance_KeyDate_KeyDepartment_Group_KeyScenario_KeyAccount_KeyValueIsCurrent
0012019-01-010010012SALES_001100X
0012019-01-010010012COGS_001300Y
0012019-01-010010012MKT_001200Y
0012019-01-010010012SALES_00150Y

And the resulting value should be

Finance_KeyDate_KeyDepartment_Group_KeyScenario_KeySALESCOGSMKT
0012019-01-01001001250300200

However my question is how do I go about optimizing the query to scan and update the Pivoted table for approx. 15 billion rows in Snowflake.

This is more of a optimization the read and write .

Any pointers

Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

烙印 2025-02-15 16:10:49

因此,一个更长的表单答案:

我要假设date_key值是过去的,只是示例日期的功能,因为如果您有15B行,并且每35分钟就有〜 10k(30k /3)更新要应用,并且它们涵盖了您数据的整个日期范围,那么您几乎无法做到优化它。雪花查询加速服务可能有助于IO。

但是,改善总处理时间的主要方法是过程较少。

例如,在我的旧工作中,我们拥有IoT数据,这些数据可能会有长达两周大的消息。作为管道错误处理的一部分,我们重复了负载上的所有消息(实际上是一个类似的过程)。我们发现,以-2周的分钟为单位(也有数十亿行)的时间为-2周,大部分时间都使用,阅读/编写表。通过将摄入到比几个小时大的边线消息中,并将其处理推迟到“午夜”,我们可以在不到5分钟内完成所有批次完成的所有及时点(我们进行了大量其他处理,在此间隔中)每批次),我们可以将仓库群集大小从核心数据小时降低,并使用保存的积分在午夜的“ catchup”中运行更大的实例,以将所有天数的侧面数据带入船上。这种最终一致的方法在我们的应用程序中效果很好。

因此,如果您的目标表非常聚集,因此读/写入只是数据的一小部分,那么这对您来说应该不那么问题。但是,根据提出问题的本质,我认为事实并非如此。

因此,如果表自然聚类与加载数据未对齐,则是因为目标表需要是关键读取表正在处理的重要形状,在此时,成本/时间更敏感。另一个选项是将目标表按日期聚集(再次假设新数据是一个很小的时间窗口),并在该表的顶部具有物质化的视图/表格,以便使用其他类型,以便完成重写对于您来说,Snowflake,现在这不是免费的。但这可能会允许更快的时间和更快的使用性能。假设两个时间敏感。

So a longer form answer:

I am going to assume the Date_Key values being very is the past, is just a feature of the example date, because if you have 15B rows, and every 35 minutes you have ~10K (30K /3) updates to apply, and they span the whole date range of your data, then there is very little you can do to optimize it. Snowflake Query Acceleration Service might help with the IO.

But the primary way to improve total processing time, is process less.

For example in my old job we had IoT data, that could have messages up to two weeks old. And we duplicated all messages on load (which is effectively a similar process), as part of our pipeline error handling. We found that handling batches with a min date of -2 week against of full message tables (that also had billions of rows) used most of the time, reading/writing the tables. By altering the ingestion to sideline message older than a couple of hours, and deferring their processing until "midnight" we could get the processing of all the timely points in the batch done in under 5 minutes (we did a large amount of other processing, in that interval) for every batch, and we could turn the warehouse cluster size down out of core data hours, and use the saved credits to run a bigger instance at the midnight "catchup" to bring all the days worth of sidelined data on board. This eventually consistent approach worked very well for our application.

So if your destination table is well clustered, so the read/writes is just a fraction of the data then this should be less of a problem for you. But by the nature of asking the question I assume this is not the case.

So if the tables natural clustering is unaligned with the loading data, is that because the destination table needs to be a different shape for the critical read the tables is handling, at which point which is more cost/time sensitive. Another option is to have the destination table clustered by the date (again assuming the fresh data is a small window of time) and have a materialize view/table on top of that table, with a different sort, so that the rewrite is being done for you by Snowflake, now this in not a free-lunch. But it might allow faster upsert times, and faster usage performance. Assuming both a time sensitive.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文