雪花:通过滑动窗口(过去60分钟)的数据集汇总,该数据集的采样频率是不均匀的

发布于 2025-02-10 03:09:10 字数 914 浏览 2 评论 0原文

我有具有不均匀采样区的数据。我想以滚动/滑动为基础(过去60分钟)进行汇总数据。

为了实现小时平均值(由城市划分),我曾经遵循有效的代码。

SELECT *,
       AVG(VALUE) OVER (PARTITION BY CITY, DATE_AND_HOUR ORDER BY TIMESTAMP
FROM 
(
SELECT *,
       date_trunc('HOUR', TIMESTAMP) as DATE_AND_Hour

FROM SAMPLE_DATA
)

但是,我所需的输出如下:

”“在此处输入图像说明”

我知道雪花不支持范围,并且我无法使用Windows功能中的哪个行进行指定,因为我的采样dist是不均匀的。

我在此页面上阅读了一些潜在的解决方案,但它们在雪花上不起作用:使用SQL窗口函数的最后N天数量 本质上,这是一个类似的问题。

I have data with non-uniform sampling dist. I want to the aggregate data on a rolling/ sliding basis (the past 60 mins).

enter image description here

In order to achieve an hourly average (partitioned by city), I used to following code which worked.

SELECT *,
       AVG(VALUE) OVER (PARTITION BY CITY, DATE_AND_HOUR ORDER BY TIMESTAMP
FROM 
(
SELECT *,
       date_trunc('HOUR', TIMESTAMP) as DATE_AND_Hour

FROM SAMPLE_DATA
)

However, my desired output is as follows:

enter image description here

I know Snowflake doesn't support RANGE and I can't use specify which rows BETWEEN in a windows function as my sampling dist is non-uniform.

I read some potential solutions on this page but they don't work in snowflake: sum last n days quantity using sql window function
Essentially, it's an analogous problem.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

拿命拼未来 2025-02-17 03:09:11

首先,您在示例中显示为“平均”是“总和”,而您的第一个“上诉”结果包括“北京”结果。

您有两个选项,构建一个固定尺寸的窗口数据集(为每分钟构建部分),然后使用固定尺寸的窗框,或自加入并汇总这些窗口(如Felipe所示)。

如果您的数据非常密集,则可能会发现以前的性能更高,如果数据稀疏,则以后的方法应该更快,并且代码肯定会更快。

因此,简单的第一:

with data(city, timestamp, value) as (
    select column1, try_to_timestamp(column2, 'yyyy/mm/dd hh:mi'), column3 from values 
    ('beijing', '2022/05/25 10:33', 22),
    ('beijing', '2022/05/25 10:37', 20),
    ('beijing', '2022/05/25 11:36', 29),
    ('beijing', '2022/05/26 11:36', 28),
    ('beijing', '2022/05/26 10:00', 21),
    ('shanghai', '2022/05/26 11:00', 33),
    ('shanghai', '2022/05/26 11:46', 35),
    ('shanghai', '2022/05/26 12:40', 37)
)
select a.*
    ,avg(b.value) as p60_avg
    ,count(b.value)-1 as p60_count
    ,sum(b.value) as p60_sum
from data as a
left join data as b
    on a.city = b.city and b.timestamp between dateadd(hour, -1, a.timestamp) and a.timestamp
group by 1,2,3
order by 1,2

给予:

城市时间戳P60_AVGP60_COUNTP60_SUM
北京2022-05-25 10:33:00.000222北京2022-05-25
10:37:37:00.0002021142
2222 :36:00.0002924.5149
北京2022-05-26 10:00:00.0002121021
北京2022-05-26 11:36:00.000280 28028
上海2022-05-26 11:00:00.00033 33 3333 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33033
上海:00.0003534168
上海12:40:00.0003736172

致密版本:

with data(city, timestamp, value) as (
    select column1, try_to_timestamp(column2, 'yyyy/mm/dd hh:mi'), column3 from values 
    ('beijing', '2022/05/25 10:33', 22),
    ('beijing', '2022/05/25 10:37', 20),
    ('beijing', '2022/05/25 11:36', 29),
    ('beijing', '2022/05/26 11:36', 28),
    ('beijing', '2022/05/26 10:00', 21),
    ('shanghai', '2022/05/26 11:00', 33),
    ('shanghai', '2022/05/26 11:46', 35),
    ('shanghai', '2022/05/26 12:40', 37)
), filled_time as (
    select city,
        dateadd(minute, row_number() over(partition by city order by null)-1, min_t) as timestamp
    from (
        select 
            city, min(timestamp) as min_t, max(timestamp) as max_t
        from data
        group by 1
    ), table(generator(ROWCOUNT => 10000))
    qualify timestamp <= max_t
)
select
    ft.city
    ,ft.timestamp
    ,avg(d.value) over (order by ft.timestamp ROWS BETWEEN 60 PRECEDING AND current row ) as p60_avg
from filled_time as ft
left join data as d
    on ft.city = d.city and ft.timestamp = d.timestamp
order by 1,2;

给予:

CityTimestampP60_AVGG GEN
2022-05-26 11: 462022-05-26 33:00.00022
北京2022-05-25 10:34:00.00022
北京2022-05-25 10:35:00.00022
北京2022-05-25 10:36:22
北京00.000 00.00021
北京2022-05-25 10:38:00.000北京
2022-05-2510:39:00.00021
北京2022-05-25 10:40:00.00021
21:00.000北京
北京2022-05-25 10:42:00.000北京
2022-05-25:43:00.00021
北京北京:44:00.00021
北京2022-05-25 10:45:21
102022-05-25 10 00.000 -05-25 10:46:00.00021
狙击...

那些“额外”的行可能会被合格倾倒

select
    ft.city
    ,ft.timestamp
    ,avg(d.value) over (order by ft.timestamp ROWS BETWEEN 60 PRECEDING AND current row ) as p60_avg
    --,count(b.value)-1 as p60_count
    --,sum(b.value) as p60_sum
from filled_time as ft
left join data as d
    on ft.city = d.city and ft.timestamp = d.timestamp
qualify d.value is not null
order by 1,2;

Firstly what you show as "average" in your example is the "sum", and you first "Shanghia" result is including a "Beijing" result.

You have two options, build a fixed sized window dataset (build partials for each minute) and then use window frame of fixed size over that, OR self-join and just aggregate those (as Felipe has shown).

If you have very dense data, you might find the former more performant, and if you have sparse data, the later approach should be faster, and is definitely faster to code.

So the simple first:

with data(city, timestamp, value) as (
    select column1, try_to_timestamp(column2, 'yyyy/mm/dd hh:mi'), column3 from values 
    ('beijing', '2022/05/25 10:33', 22),
    ('beijing', '2022/05/25 10:37', 20),
    ('beijing', '2022/05/25 11:36', 29),
    ('beijing', '2022/05/26 11:36', 28),
    ('beijing', '2022/05/26 10:00', 21),
    ('shanghai', '2022/05/26 11:00', 33),
    ('shanghai', '2022/05/26 11:46', 35),
    ('shanghai', '2022/05/26 12:40', 37)
)
select a.*
    ,avg(b.value) as p60_avg
    ,count(b.value)-1 as p60_count
    ,sum(b.value) as p60_sum
from data as a
left join data as b
    on a.city = b.city and b.timestamp between dateadd(hour, -1, a.timestamp) and a.timestamp
group by 1,2,3
order by 1,2

gives:

CITYTIMESTAMPVALUEP60_AVGP60_COUNTP60_SUM
beijing2022-05-25 10:33:00.0002222022
beijing2022-05-25 10:37:00.0002021142
beijing2022-05-25 11:36:00.0002924.5149
beijing2022-05-26 10:00:00.0002121021
beijing2022-05-26 11:36:00.0002828028
shanghai2022-05-26 11:00:00.0003333033
shanghai2022-05-26 11:46:00.0003534168
shanghai2022-05-26 12:40:00.0003736172

The dense version:

with data(city, timestamp, value) as (
    select column1, try_to_timestamp(column2, 'yyyy/mm/dd hh:mi'), column3 from values 
    ('beijing', '2022/05/25 10:33', 22),
    ('beijing', '2022/05/25 10:37', 20),
    ('beijing', '2022/05/25 11:36', 29),
    ('beijing', '2022/05/26 11:36', 28),
    ('beijing', '2022/05/26 10:00', 21),
    ('shanghai', '2022/05/26 11:00', 33),
    ('shanghai', '2022/05/26 11:46', 35),
    ('shanghai', '2022/05/26 12:40', 37)
), filled_time as (
    select city,
        dateadd(minute, row_number() over(partition by city order by null)-1, min_t) as timestamp
    from (
        select 
            city, min(timestamp) as min_t, max(timestamp) as max_t
        from data
        group by 1
    ), table(generator(ROWCOUNT => 10000))
    qualify timestamp <= max_t
)
select
    ft.city
    ,ft.timestamp
    ,avg(d.value) over (order by ft.timestamp ROWS BETWEEN 60 PRECEDING AND current row ) as p60_avg
from filled_time as ft
left join data as d
    on ft.city = d.city and ft.timestamp = d.timestamp
order by 1,2;

gives:

CITYTIMESTAMPP60_AVG
beijing2022-05-25 10:33:00.00022
beijing2022-05-25 10:34:00.00022
beijing2022-05-25 10:35:00.00022
beijing2022-05-25 10:36:00.00022
beijing2022-05-25 10:37:00.00021
beijing2022-05-25 10:38:00.00021
beijing2022-05-25 10:39:00.00021
beijing2022-05-25 10:40:00.00021
beijing2022-05-25 10:41:00.00021
beijing2022-05-25 10:42:00.00021
beijing2022-05-25 10:43:00.00021
beijing2022-05-25 10:44:00.00021
beijing2022-05-25 10:45:00.00021
beijing2022-05-25 10:46:00.00021
snip...

And those "extra" rows could be dumped with a qualify

select
    ft.city
    ,ft.timestamp
    ,avg(d.value) over (order by ft.timestamp ROWS BETWEEN 60 PRECEDING AND current row ) as p60_avg
    --,count(b.value)-1 as p60_count
    --,sum(b.value) as p60_sum
from filled_time as ft
left join data as d
    on ft.city = d.city and ft.timestamp = d.timestamp
qualify d.value is not null
order by 1,2;
勿忘初心 2025-02-17 03:09:10

您可以通过自我加入来解决此问题:

with data as (
    select *
    from temp_fh_wikipedia.public.wikipedia_2020
    where title in ('San_Francisco', 'Los_Angeles')
    and wiki='en'
    and datehour > '2020-10-13'
)

select a.title, a.datehour, a.views, avg(b.views) avg_previous_5h
from data a
join (
    select *
    from data
) b
on a.title=b.title
and b.datehour between timestampadd(hour, -5, a.datehour) and a.datehour

group by 1, 2, 3
order by 1, 2
limit 100

如果您想要最后的X分钟,请更改“小时”“分钟”。

You can solve this with a self-join:

with data as (
    select *
    from temp_fh_wikipedia.public.wikipedia_2020
    where title in ('San_Francisco', 'Los_Angeles')
    and wiki='en'
    and datehour > '2020-10-13'
)

select a.title, a.datehour, a.views, avg(b.views) avg_previous_5h
from data a
join (
    select *
    from data
) b
on a.title=b.title
and b.datehour between timestampadd(hour, -5, a.datehour) and a.datehour

group by 1, 2, 3
order by 1, 2
limit 100

Just change 'hour' for 'minutes', if you want the last x minutes.

enter image description here

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文