Redshift 按日期范围聚合分组数据
我有下表,其中包含每天的物品数量。
ID Date Item Count
-----------------------------
1 2022-01-01 Milk 10
2 2022-01-11 Milk 20
3 2022-01-12 Milk 10
4 2022-01-15 Milk 12
5 2022-01-16 Milk 10
6 2022-01-02 Bread 20
7 2022-01-03 Bread 22
8 2022-01-05 Bread 24
9 2022-01-08 Bread 20
10 2022-01-12 Bread 10
我想汇总(总和、平均值……)过去 7 天(或 14、28 天)每件商品的数量。预期结果如下表所示。
ID Date Item Count Sum_7d
-------------------------------------
1 2022-01-01 Milk 10 10
2 2022-01-11 Milk 20 20
3 2022-01-12 Milk 10 30
4 2022-01-15 Milk 12 42
5 2022-01-16 Milk 10 52
6 2022-01-02 Bread 20 20
7 2022-01-03 Bread 22 42
8 2022-01-05 Bread 24 66
9 2022-01-08 Bread 10 56
10 2022-01-12 Bread 10 20
我的第一种方法是使用像这样的 Redshift 窗口函数
SELECT *, SUM(Count) OVER (PARTITION BY Item
ORDER BY Date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS Sum_7d
FROM my_table
,但它没有给出预期的结果,因为缺少日期,而且我不知道如何在时间范围上设置条件。
我的后备解决方案是叉积,但这并不可取,因为它对于大数据来说效率低下。
SELECT l.Date, l.Item, l.Count, sum(r.Count) as Sum_7d
FROM my_table l,
my_table r
WHERE l.Date - r.Date < 7
AND l.Date - r.Date >= 0
AND l.Item = r.Item
GROUP BY 1, 2, 3
在 Redshift 中是否有任何有效且简洁的方法来对日期范围进行此类聚合?
相关:
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这是一个缺失数据的问题,“填空”的常见方法是使用交叉连接。您正确地指出,这可能会变得非常昂贵,因为交叉连接(通常)会大量扩展正在处理的数据,而且 Redshift 不擅长创建数据。但您必须填写缺失的数据。我发现的最好方法是创建(接近)最小数据集来完成数据,然后将此数据合并到原始表。下面的代码执行此路径。
有一种方法可以在不添加行的情况下完成此操作,但 SQL 很大、不灵活、容易出错且丑陋。您可以根据 LAG(6)、LAG(5)、LAG(4) ... 创建新列(日期和计数),然后比较每个列的日期,如果日期确实在范围内,则使用计数。如果您想对不同的日期回顾进行求和,则需要添加列,事情会变得更难看。此外,在某些情况下(项目重复很少),这只会比下面的代码更快。它只是将在行中创建新数据替换为在列中创建新数据。所以除非绝对必要,否则不要走这条路。
现在我认为对你有用。您需要为每个尚不存在的日期和项目组合创建一个虚拟行。这是使您的窗口函数正常工作的最小新数据集。实际上,我对数据和项目进行了所有组合,并将它们与现有的合并——与理想相比略有妥协。
首先让我们设置您的数据。我更改了一些名称,因为使用保留字作为列名称并不理想。
用于生成您想要的内容的 SQL 是:
快速了解它的作用。
正如您将注意到的,这确实使用了交叉联接,但数据集大大减少(仅是唯一的项目列表)。只要这个不同的项目列表比表的大小短得多(很可能),那么它的执行速度就会比其他技术快得多。另外,如果您拥有这种数据,您可能会对我写的这篇文章感兴趣 - http://wad-design.s3-website-us-east-1.amazonaws.com/sql_limits_wp_2.html
This is a missing data problem and a common way to "fill in the blanks" is with a cross join. You correctly point out that this can get very expensive because the cross joining (usually) massively expands the data being worked upon AND because Redshift isn't great at creating data. But you do have to fill in the missing data. The best way I have found is to create the (near) minimum data set that will complete the data then UNION this data to the original table. The code below performs this path.
There is a way to do this w/o adding rows but the SQL is large, inflexible, error prone and just plain ugly. You could create new columns (date and count) based on LAG(6), LAG(5), LAG(4) ... and compare the date of each and use the count if the date is truly in range. If you want to sum a different date look-back you need to add columns and things get uglier. Also this will only be faster that the code below for certain circumstances (very few repeats of item). It just replaces making new data in rows for making new data in columns. So don't go this way unless absolutely necessary.
Now to what I think will work for you. You need a dummy row for every date and item combination that doesn't already exist. This is the minimal set of new data that will make you window function work. In reality I make all the combinations of data and item and merge these with the existing - a slight compromise from the ideal.
First let's set up your data. I changed some names as using reserved words for column names is not ideal.
The SQL for generating what you want is:
Quickly here what this does.
As you will note this does use a cross-join but on a much reduced set of data (just the unique item list). As long as this distinct list of items is much shorter than the size of the table (very likely) then this will perform much faster than other techniques. Also if this is the kind of data you have you might find interest in this post I wrote - http://wad-design.s3-website-us-east-1.amazonaws.com/sql_limits_wp_2.html