snowflake-schema snowflake-cloud-data-platform

优化事实阅读＆amp;写雪花

发布于 2025-02-08 16:10:49 字数 5535 浏览 2 评论 0原文

总而言之，

我正在研究一些技巧，以优化雪花事实表上的ELT。 150亿行。

每35分钟，我们每35分钟就会得到大约30,000行，我们始终将获得3个帐户维度键值，即销售，COGS＆amp; Mkt。

Finance_Key	Date_Key	Department_Group_Key	Scenario_Key	Account_Key	Value	IsCurrent
001	2019-01-01	001	0012	SALES_001	100	Y
001	2019-01-01	001	0012	COGS_001	300	Y
001	2019-01-01	001	0012	MKT_001	200	Y

This is then PIVOTED based on Finance Key and Date键并加载到另一个表中进行报告，例如

Finance_key	Date_key	Department_group_key	specario_key	seales	cogs	mkt	iscurrent
001	2019-01-01-01	001	001 0012	100	300	200	y

有时会进行调整，并进行1个帐户密钥。

finance_key	date_key	Department_keke	specor_key	account_key	account_key
001	2019-01-01	001	0012	sales_001	50

必须

Finance_key	Date_key	department_key	saceario_key	account_key	councel_key	001
account_key	account_Key	001	，	我们	此	Department_key
spectario_key	因此	300			执行	Y
001	2019-01-01	001	0012	MKT_001	200	Y
001	2019-01-01	001	0012	SALES_001	50	Y

And the resulting value should be

Finance_Key	Date_Key	Department_Group_Key	Scenario_Key	SALES	COGS	MKT
001	2019-01-01	001	0012	50	300	200

However my question是如何优化查询以进行扫描并更新枢轴表格的大约。雪花上有150亿行。

这更多是读写的优化。

任何指针

谢谢

原文

All,

I am looking at some tips with regards to optimizing a ELT on a Snowflake Fact table with approx. 15 billion rows.

We get approximately 30,000 rows every 35 mins like the one below, we always will get 3 Account Dimension Key values i.e. Sales, Cogs & MKT.

Finance_Key	Date_Key	Department_Group_Key	Scenario_Key	Account_Key	Value	IsCurrent
001	2019-01-01	001	0012	SALES_001	100	Y
001	2019-01-01	001	0012	COGS_001	300	Y
001	2019-01-01	001	0012	MKT_001	200	Y

This is then PIVOTED based on Finance Key and Date Key and loaded into another table for reporting, like the one below

Finance_Key	Date_Key	Department_Group_Key	Scenario_Key	SALES	COGS	MKT	IsCurrent
001	2019-01-01	001	0012	100	300	200	Y

At times there is an adjustment made and for 1 Account key.

Finance_Key	Date_Key	Department_Group_Key	Scenario_Key	Account_Key	Value
001	2019-01-01	001	0012	SALES_001	50

Hence we have to do this

Finance_Key	Date_Key	Department_Group_Key	Scenario_Key	Account_Key	Value	IsCurrent
001	2019-01-01	001	0012	SALES_001	100	X
001	2019-01-01	001	0012	COGS_001	300	Y
001	2019-01-01	001	0012	MKT_001	200	Y
001	2019-01-01	001	0012	SALES_001	50	Y

And the resulting value should be

Finance_Key	Date_Key	Department_Group_Key	Scenario_Key	SALES	COGS	MKT
001	2019-01-01	001	0012	50	300	200

However my question is how do I go about optimizing the query to scan and update the Pivoted table for approx. 15 billion rows in Snowflake.

This is more of a optimization the read and write .

Any pointers

Thanks

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

烙印 2025-02-15 16:10:49

因此，一个更长的表单答案：

我要假设date_key值是过去的，只是示例日期的功能，因为如果您有15B行，并且每35分钟就有〜 10k（30k /3）更新要应用，并且它们涵盖了您数据的整个日期范围，那么您几乎无法做到优化它。雪花查询加速服务可能有助于IO。

但是，改善总处理时间的主要方法是过程较少。

例如，在我的旧工作中，我们拥有IoT数据，这些数据可能会有长达两周大的消息。作为管道错误处理的一部分，我们重复了负载上的所有消息（实际上是一个类似的过程）。我们发现，以-2周的分钟为单位（也有数十亿行）的时间为-2周，大部分时间都使用，阅读/编写表。通过将摄入到比几个小时大的边线消息中，并将其处理推迟到“午夜”，我们可以在不到5分钟内完成所有批次完成的所有及时点（我们进行了大量其他处理，在此间隔中）每批次），我们可以将仓库群集大小从核心数据小时降低，并使用保存的积分在午夜的“ catchup”中运行更大的实例，以将所有天数的侧面数据带入船上。这种最终一致的方法在我们的应用程序中效果很好。

因此，如果您的目标表非常聚集，因此读/写入只是数据的一小部分，那么这对您来说应该不那么问题。但是，根据提出问题的本质，我认为事实并非如此。

因此，如果表自然聚类与加载数据未对齐，则是因为目标表需要是关键读取表正在处理的重要形状，在此时，成本/时间更敏感。另一个选项是将目标表按日期聚集（再次假设新数据是一个很小的时间窗口），并在该表的顶部具有物质化的视图/表格，以便使用其他类型，以便完成重写对于您来说，Snowflake，现在这不是免费的。但这可能会允许更快的时间和更快的使用性能。假设两个时间敏感。

So a longer form answer:

I am going to assume the Date_Key values being very is the past, is just a feature of the example date, because if you have 15B rows, and every 35 minutes you have ~10K (30K /3) updates to apply, and they span the whole date range of your data, then there is very little you can do to optimize it. Snowflake Query Acceleration Service might help with the IO.

But the primary way to improve total processing time, is process less.

For example in my old job we had IoT data, that could have messages up to two weeks old. And we duplicated all messages on load (which is effectively a similar process), as part of our pipeline error handling. We found that handling batches with a min date of -2 week against of full message tables (that also had billions of rows) used most of the time, reading/writing the tables. By altering the ingestion to sideline message older than a couple of hours, and deferring their processing until "midnight" we could get the processing of all the timely points in the batch done in under 5 minutes (we did a large amount of other processing, in that interval) for every batch, and we could turn the warehouse cluster size down out of core data hours, and use the saved credits to run a bigger instance at the midnight "catchup" to bring all the days worth of sidelined data on board. This eventually consistent approach worked very well for our application.

So if your destination table is well clustered, so the read/writes is just a fraction of the data then this should be less of a problem for you. But by the nature of asking the question I assume this is not the case.

So if the tables natural clustering is unaligned with the loading data, is that because the destination table needs to be a different shape for the critical read the tables is handling, at which point which is more cost/time sensitive. Another option is to have the destination table clustered by the date (again assuming the fresh data is a small window of time) and have a materialize view/table on top of that table, with a different sort, so that the rewrite is being done for you by Snowflake, now this in not a free-lunch. But it might allow faster upsert times, and faster usage performance. Assuming both a time sensitive.

回复收藏 0 原文

~没有更多了~