雪花:通过滑动窗口(过去60分钟)的数据集汇总,该数据集的采样频率是不均匀的
我有具有不均匀采样区的数据。我想以滚动/滑动为基础(过去60分钟)进行汇总数据。
为了实现小时平均值(由城市划分),我曾经遵循有效的代码。
SELECT *,
AVG(VALUE) OVER (PARTITION BY CITY, DATE_AND_HOUR ORDER BY TIMESTAMP
FROM
(
SELECT *,
date_trunc('HOUR', TIMESTAMP) as DATE_AND_Hour
FROM SAMPLE_DATA
)
但是,我所需的输出如下:
我知道雪花不支持范围,并且我无法使用Windows功能中的哪个行进行指定,因为我的采样dist是不均匀的。
我在此页面上阅读了一些潜在的解决方案,但它们在雪花上不起作用:使用SQL窗口函数的最后N天数量 本质上,这是一个类似的问题。
I have data with non-uniform sampling dist. I want to the aggregate data on a rolling/ sliding basis (the past 60 mins).
In order to achieve an hourly average (partitioned by city), I used to following code which worked.
SELECT *,
AVG(VALUE) OVER (PARTITION BY CITY, DATE_AND_HOUR ORDER BY TIMESTAMP
FROM
(
SELECT *,
date_trunc('HOUR', TIMESTAMP) as DATE_AND_Hour
FROM SAMPLE_DATA
)
However, my desired output is as follows:
I know Snowflake doesn't support RANGE and I can't use specify which rows BETWEEN in a windows function as my sampling dist is non-uniform.
I read some potential solutions on this page but they don't work in snowflake: sum last n days quantity using sql window function
Essentially, it's an analogous problem.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
首先,您在示例中显示为“平均”是“总和”,而您的第一个“上诉”结果包括“北京”结果。
您有两个选项,构建一个固定尺寸的窗口数据集(为每分钟构建部分),然后使用固定尺寸的窗框,或自加入并汇总这些窗口(如Felipe所示)。
如果您的数据非常密集,则可能会发现以前的性能更高,如果数据稀疏,则以后的方法应该更快,并且代码肯定会更快。
因此,简单的第一:
给予:
致密版本:
给予:
那些“额外”的行可能会被合格倾倒
Firstly what you show as "average" in your example is the "sum", and you first "Shanghia" result is including a "Beijing" result.
You have two options, build a fixed sized window dataset (build partials for each minute) and then use window frame of fixed size over that, OR self-join and just aggregate those (as Felipe has shown).
If you have very dense data, you might find the former more performant, and if you have sparse data, the later approach should be faster, and is definitely faster to code.
So the simple first:
gives:
The dense version:
gives:
And those "extra" rows could be dumped with a qualify
您可以通过自我加入来解决此问题:
如果您想要最后的X分钟,请更改“小时”“分钟”。
You can solve this with a self-join:
Just change 'hour' for 'minutes', if you want the last x minutes.