Redshift 按日期范围聚合分组数据

发布于 2025-01-13 02:48:26 字数 1867 浏览 4 评论 0 原文

我有下表,其中包含每天的物品数量。

ID   Date        Item   Count
-----------------------------
1    2022-01-01  Milk   10
2    2022-01-11  Milk   20
3    2022-01-12  Milk   10
4    2022-01-15  Milk   12
5    2022-01-16  Milk   10
6    2022-01-02  Bread  20
7    2022-01-03  Bread  22
8    2022-01-05  Bread  24
9    2022-01-08  Bread  20
10   2022-01-12  Bread  10

我想汇总(总和、平均值……)过去 7 天(或 14、28 天)每件商品的数量。预期结果如下表所示。

ID   Date        Item   Count  Sum_7d
-------------------------------------
1    2022-01-01  Milk   10     10
2    2022-01-11  Milk   20     20
3    2022-01-12  Milk   10     30
4    2022-01-15  Milk   12     42
5    2022-01-16  Milk   10     52
6    2022-01-02  Bread  20     20
7    2022-01-03  Bread  22     42
8    2022-01-05  Bread  24     66
9    2022-01-08  Bread  10     56
10   2022-01-12  Bread  10     20

我的第一种方法是使用像这样的 Redshift 窗口函数

SELECT *, SUM(Count) OVER (PARTITION BY Item
                           ORDER BY Date
                           ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS Sum_7d
FROM my_table

,但它没有给出预期的结果,因为缺少日期,而且我不知道如何在时间范围上设置条件。

我的后备解决方案是叉积,但这并不可取,因为它对于大数据来说效率低下。

SELECT l.Date, l.Item, l.Count, sum(r.Count) as Sum_7d
FROM my_table l,
     my_table r
WHERE l.Date - r.Date < 7
  AND l.Date - r.Date >= 0
  AND l.Item = r.Item
GROUP BY 1, 2, 3

在 Redshift 中是否有任何有效且简洁的方法来对日期范围进行此类聚合?

相关:

I have following table that contains quantities of items per day.

ID   Date        Item   Count
-----------------------------
1    2022-01-01  Milk   10
2    2022-01-11  Milk   20
3    2022-01-12  Milk   10
4    2022-01-15  Milk   12
5    2022-01-16  Milk   10
6    2022-01-02  Bread  20
7    2022-01-03  Bread  22
8    2022-01-05  Bread  24
9    2022-01-08  Bread  20
10   2022-01-12  Bread  10

I want to aggregate (sum, avg, ...) the quantity per item for the last 7 days (or 14, 28 days). The expected outcome would look like this table.

ID   Date        Item   Count  Sum_7d
-------------------------------------
1    2022-01-01  Milk   10     10
2    2022-01-11  Milk   20     20
3    2022-01-12  Milk   10     30
4    2022-01-15  Milk   12     42
5    2022-01-16  Milk   10     52
6    2022-01-02  Bread  20     20
7    2022-01-03  Bread  22     42
8    2022-01-05  Bread  24     66
9    2022-01-08  Bread  10     56
10   2022-01-12  Bread  10     20

My first approach was using Redshift window functions like this

SELECT *, SUM(Count) OVER (PARTITION BY Item
                           ORDER BY Date
                           ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS Sum_7d
FROM my_table

but it does not give the expected results because there are missing dates and I could not figure out how to put a condition on the time range.

My fallback solution is a cross product, but that's not desirable because it is inefficient for large data.

SELECT l.Date, l.Item, l.Count, sum(r.Count) as Sum_7d
FROM my_table l,
     my_table r
WHERE l.Date - r.Date < 7
  AND l.Date - r.Date >= 0
  AND l.Item = r.Item
GROUP BY 1, 2, 3

Is there any efficient and concise way to do such an aggregation on date ranges in Redshift?

Related:

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

眼波传意 2025-01-20 02:48:26

这是一个缺失数据的问题,“填空”的常见方法是使用交叉连接。您正确地指出,这可能会变得非常昂贵,因为交叉连接(通常)会大量扩展正在处理的数据,而且 Redshift 不擅长创建数据。但您必须填写缺失的数据。我发现的最好方法是创建(接近)最小数据集来完成数据,然后将此数据合并到原始表。下面的代码执行此路径。

有一种方法可以在不添加行的情况下完成此操作,但 SQL 很大、不灵活、容易出错且丑陋。您可以根据 LAG(6)、LAG(5)、LAG(4) ... 创建新列(日期和计数),然后比较每个列的日期,如果日期确实在范围内,则使用计数。如果您想对不同的日期回顾进行求和,则需要添加列,事情会变得更难看。此外,在某些情况下(项目重复很少),这只会比下面的代码更快。它只是将在行中创建新数据替换为在列中创建新数据。所以除非绝对必要,否则不要走这条路。

现在我认为对你有用。您需要为每个尚不存在的日期和项目组合创建一个虚拟行。这是使您的窗口函数正常工作的最小新数据集。实际上,我对数据和项目进行了所有组合,并将它们与现有的合并——与理想相比略有妥协。

首先让我们设置您的数据。我更改了一些名称,因为使用保留字作为列名称并不理想。

create table test (ID int, dt date, Item varchar(16), Cnt int);

insert into test values
(1, '2022-01-01', 'Milk', 10),
(2, '2022-01-11', 'Milk', 20),
(3, '2022-01-12', 'Milk', 10),
(4, '2022-01-15', 'Milk', 12),
(5, '2022-01-16', 'Milk', 10),
(6, '2022-01-02', 'Bread', 20),
(7, '2022-01-03', 'Bread', 22),
(8, '2022-01-05', 'Bread', 24),
(9, '2022-01-08', 'Bread', 20),
(10, '2022-01-12', 'Bread', 10);

用于生成您想要的内容的 SQL 是:

with recursive dates(dt) as
( select min(dt) as dt
  from test
  union all
  select dt + 1
  from dates d
  where d.dt <= current_date 
  )
select * 
from (
  SELECT *, SUM(Cnt) OVER (PARTITION BY Item
                           ORDER BY Dt
                           ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS Sum_7d
  FROM (
    select min(id) as id, dt, item, sum(cnt) as cnt 
    from (
      select * 
      from test
      union all 
      select NULL as id, dt, item, NULL as cnt
      from ( select distinct item from test) as items
      cross join dates
      ) as all_item_dates
    group by dt, item
  ) as grouped
) as windowed
where id is not null
order by id, dt;

快速了解它的作用。

  1. 递归 CTE 创建相关日期范围(从表中的最短日期到今天)。
  2. 这些日期与不同的项目列表交叉连接,从而产生每个唯一项目的每个日期。
  3. 这是与表联合的,因此所有数据都存在。
  4. GROUP By 用于将相同项目和日期的真实数据行与虚拟行合并。
  5. 您的窗口函数已运行。
  6. 周围的 SELECT 有一个 WHERE 子句来删除任何虚拟行。

正如您将注意到的,这确实使用了交叉联接,但数据集大大减少(仅是唯一的项目列表)。只要这个不同的项目列表比表的大小短得多(很可能),那么它的执行速度就会比其他技术快得多。另外,如果您拥有这种数据,您可能会对我写的这篇文章感兴趣 - http://wad-design.s3-website-us-east-1.amazonaws.com/sql_limits_wp_2.html

This is a missing data problem and a common way to "fill in the blanks" is with a cross join. You correctly point out that this can get very expensive because the cross joining (usually) massively expands the data being worked upon AND because Redshift isn't great at creating data. But you do have to fill in the missing data. The best way I have found is to create the (near) minimum data set that will complete the data then UNION this data to the original table. The code below performs this path.

There is a way to do this w/o adding rows but the SQL is large, inflexible, error prone and just plain ugly. You could create new columns (date and count) based on LAG(6), LAG(5), LAG(4) ... and compare the date of each and use the count if the date is truly in range. If you want to sum a different date look-back you need to add columns and things get uglier. Also this will only be faster that the code below for certain circumstances (very few repeats of item). It just replaces making new data in rows for making new data in columns. So don't go this way unless absolutely necessary.

Now to what I think will work for you. You need a dummy row for every date and item combination that doesn't already exist. This is the minimal set of new data that will make you window function work. In reality I make all the combinations of data and item and merge these with the existing - a slight compromise from the ideal.

First let's set up your data. I changed some names as using reserved words for column names is not ideal.

create table test (ID int, dt date, Item varchar(16), Cnt int);

insert into test values
(1, '2022-01-01', 'Milk', 10),
(2, '2022-01-11', 'Milk', 20),
(3, '2022-01-12', 'Milk', 10),
(4, '2022-01-15', 'Milk', 12),
(5, '2022-01-16', 'Milk', 10),
(6, '2022-01-02', 'Bread', 20),
(7, '2022-01-03', 'Bread', 22),
(8, '2022-01-05', 'Bread', 24),
(9, '2022-01-08', 'Bread', 20),
(10, '2022-01-12', 'Bread', 10);

The SQL for generating what you want is:

with recursive dates(dt) as
( select min(dt) as dt
  from test
  union all
  select dt + 1
  from dates d
  where d.dt <= current_date 
  )
select * 
from (
  SELECT *, SUM(Cnt) OVER (PARTITION BY Item
                           ORDER BY Dt
                           ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS Sum_7d
  FROM (
    select min(id) as id, dt, item, sum(cnt) as cnt 
    from (
      select * 
      from test
      union all 
      select NULL as id, dt, item, NULL as cnt
      from ( select distinct item from test) as items
      cross join dates
      ) as all_item_dates
    group by dt, item
  ) as grouped
) as windowed
where id is not null
order by id, dt;

Quickly here what this does.

  1. A recursive CTE creates the date range in question (from min date in table until today).
  2. These dates are cross joined with the distinct list of items resulting in every date for every unique item.
  3. This is UNIONed to the table so all data exists.
  4. GROUP By is used to merge real data rows with dummy rows for the same item and date.
  5. Your window function is run.
  6. A surrounding SELECT has a WHERE clause to remove any dummy rows.

As you will note this does use a cross-join but on a much reduced set of data (just the unique item list). As long as this distinct list of items is much shorter than the size of the table (very likely) then this will perform much faster than other techniques. Also if this is the kind of data you have you might find interest in this post I wrote - http://wad-design.s3-website-us-east-1.amazonaws.com/sql_limits_wp_2.html

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文