另一个日期空白填充 SQL 难题

发布于 2024-12-23 05:17:48 字数 1535 浏览 2 评论 0原文

不幸的是,我正在使用 Vertica,这使我无法使用 CROSS APPLY。显然 Vertica 中不存在 CTE 这样的东西。

这是我得到的结果:

t:
    day    | id | metric | d_metric
-----------+----+--------+----------
2011-12-01 |  1 | 10     | 10
2011-12-03 |  1 | 12     | 2
2011-12-04 |  1 | 15     | 3

请注意,在第一天,增量等于指标值。 我想填补空白,就像这样:

t_fill:
    day    | id | metric | d_metric
-----------+----+--------+----------
2011-12-01 |  1 | 10     | 10
2011-12-02 |  1 | 10     | 0 -- a delta of 0
2011-12-03 |  1 | 12     | 2
2011-12-04 |  1 | 15     | 3

我每天都在想一种方法来做到这一点,但我真正想要的是一种一次性解决方案。

我想我可以得到一些与 LAST_VALUE 一起工作的东西,但我无法想出正确的 JOIN 语句来让我对每个 id 的日常历史记录进行正确的分区和排序。

编辑: 假设我有一个像这样的表:

calendar:
    day 
------------
2011-01-01
2011-01-02
   ...

可以涉及连接。我的目的是维护日历中的日期范围以匹配t中的日期范围。

编辑: 关于我正在寻找的内容的更多注释,只是具体而言:

在生成 t_fill 时,我还想准确覆盖 t 中的日期范围以及中间缺少的任何日期。因此,正确的 t_fill 将与 t 在同一日期开始并在同一日期结束。 t_fill 有两个属性:

1)一旦 id 出现在某个日期,后面的每个日期都会有一行。这是原始问题中隐含的空白填充。

2) 如果某个 id 的行在某个日期后不再出现,t_fill 解决方案应该愉快地生成从最后一个数据点的日期到下一个数据点的日期具有相同指标值(和 0 增量)的行。 t 的结束日期。

解决方案可能会回填较早的日期,直到 t 中日期范围的开始日期。也就是说,对于 t 中的第一个日期之后出现的任何 id,t 中的第一个日期与 id 的第一个日期之间的行将填充为 metric=0并且d_metric=0。我不喜欢这种解决方案,因为它对进入系统的每个 id 都有更高的增长因子。但我可以通过仅选择新表中 metric!=0 和 d_metric!=0 的行来轻松处理它。

I'm using Vertica, which precludes me from using CROSS APPLY, unfortunately. And apparently there's no such thing as CTEs in Vertica.

Here's what I've got:

t:
    day    | id | metric | d_metric
-----------+----+--------+----------
2011-12-01 |  1 | 10     | 10
2011-12-03 |  1 | 12     | 2
2011-12-04 |  1 | 15     | 3

Note that on the first day, the delta is equal to the metric value.
I'd like to fill in the gaps, like this:

t_fill:
    day    | id | metric | d_metric
-----------+----+--------+----------
2011-12-01 |  1 | 10     | 10
2011-12-02 |  1 | 10     | 0 -- a delta of 0
2011-12-03 |  1 | 12     | 2
2011-12-04 |  1 | 15     | 3

I've thought of a way to do this day by day, but what I'd really like is a solution that works in one go.

I think I could get something working with LAST_VALUE, but I can't come up with the right JOIN statements that will let me properly partition and order on each id's day-by-day history.

edit:
assume I have a table like this:

calendar:
    day 
------------
2011-01-01
2011-01-02
   ...

that can be involved with joins. My intent would be to maintain the date range in calendar to match the date range in t.

edit:
A few more notes on what I'm looking for, just to be specific:

In generating t_fill, I'd like to exactly cover the date range in t, as well as any dates that are missing in between. So a correct t_fill will start on the same date and end on the same date as t.
t_fill has two properties:

1) once an id appears on some date, it will always have a row for each later date. This is the gap-filling implied in the original question.

2) Should no row for an id ever appear again after some date, the t_fill solution should merrily generate rows with the same metric value (and 0 delta) from the date of that last data point up to the end date of t.

A solution might backfill earlier dates up to the start of the date range in t. That is, for any id that appears after the first date in t, rows between the first date in t and the first date for the id will be filled with metric=0 and d_metric=0. I don't prefer this kind of solution, since it has a higher growth factor for each id that enters the system. But I could easily deal with it by selecting into a new table only rows where metric!=0 and d_metric!=0.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

平安喜乐 2024-12-30 05:17:48

这与 Jonathan Leffler 的提议有关,但采用了老式的低级 SQL(没有花哨的 CTE 或窗口函数或聚合子查询):

SET search_path='tmp'
DROP TABLE ttable CASCADE;
CREATE TABLE ttable
        ( zday date NOT NULL
        , id INTEGER NOT NULL
        , metric INTEGER NOT NULL
        , d_metric INTEGER NOT NULL
        , PRIMARY KEY (id,zday)
        );
INSERT INTO ttable(zday,id,metric,d_metric) VALUES
 ('2011-12-01',1,10,10)
,('2011-12-03',1,12,2)
,('2011-12-04',1,15,3)
        ;

DROP TABLE ctable CASCADE;
CREATE TABLE ctable
        ( zday date NOT NULL
        , PRIMARY KEY (zday)
        );
INSERT INTO ctable(zday) VALUES
 ('2011-12-01')
,('2011-12-02')
,('2011-12-03')
,('2011-12-04')
        ;

CREATE VIEW v_cte AS (
        SELECT t.zday,t.id,t.metric,t.d_metric
        FROM ttable t
        JOIN ctable c ON c.zday = t.zday
        UNION
        SELECT c.zday,t.id,t.metric, 0
        FROM ctable c, ttable t
        WHERE t.zday < c.zday
        AND NOT EXISTS ( SELECT *
                FROM ttable nx
                WHERE nx.id = t.id
                AND nx.zday = c.zday
                )
        AND NOT EXISTS ( SELECT *
                FROM ttable nx
                WHERE nx.id = t.id
                AND nx.zday < c.zday
                AND nx.zday > t.zday
                )
        )
        ;
SELECT * FROM v_cte;

结果:

    zday    | id | metric | d_metric 
------------+----+--------+----------
 2011-12-01 |  1 |     10 |       10
 2011-12-02 |  1 |     10 |        0
 2011-12-03 |  1 |     12 |        2
 2011-12-04 |  1 |     15 |        3
(4 rows)

This about what Jonathan Leffler proposed, but into old-fashioned low-level SQL (without fancy CTE's or window functions or aggregating subqueries):

SET search_path='tmp'
DROP TABLE ttable CASCADE;
CREATE TABLE ttable
        ( zday date NOT NULL
        , id INTEGER NOT NULL
        , metric INTEGER NOT NULL
        , d_metric INTEGER NOT NULL
        , PRIMARY KEY (id,zday)
        );
INSERT INTO ttable(zday,id,metric,d_metric) VALUES
 ('2011-12-01',1,10,10)
,('2011-12-03',1,12,2)
,('2011-12-04',1,15,3)
        ;

DROP TABLE ctable CASCADE;
CREATE TABLE ctable
        ( zday date NOT NULL
        , PRIMARY KEY (zday)
        );
INSERT INTO ctable(zday) VALUES
 ('2011-12-01')
,('2011-12-02')
,('2011-12-03')
,('2011-12-04')
        ;

CREATE VIEW v_cte AS (
        SELECT t.zday,t.id,t.metric,t.d_metric
        FROM ttable t
        JOIN ctable c ON c.zday = t.zday
        UNION
        SELECT c.zday,t.id,t.metric, 0
        FROM ctable c, ttable t
        WHERE t.zday < c.zday
        AND NOT EXISTS ( SELECT *
                FROM ttable nx
                WHERE nx.id = t.id
                AND nx.zday = c.zday
                )
        AND NOT EXISTS ( SELECT *
                FROM ttable nx
                WHERE nx.id = t.id
                AND nx.zday < c.zday
                AND nx.zday > t.zday
                )
        )
        ;
SELECT * FROM v_cte;

The results:

    zday    | id | metric | d_metric 
------------+----+--------+----------
 2011-12-01 |  1 |     10 |       10
 2011-12-02 |  1 |     10 |        0
 2011-12-03 |  1 |     12 |        2
 2011-12-04 |  1 |     15 |        3
(4 rows)
月寒剑心 2024-12-30 05:17:48

我不是 Vertica 用户,但如果您不想使用其对 GAP 填充的本机支持,在这里您可以找到更通用的纯 SQL 解决方案来执行此操作。

I am not Vertica user, but if you do not want to use their native support for GAP fillings, here you can find a more generic SQL-only solution to do so.

知你几分 2024-12-30 05:17:48

如果您想使用 CTE 之类的东西,那么使用临时表怎么样?本质上,CTE 是特定查询的视图。

根据您的需要,您可以将临时表设置为事务范围或会话范围。

我仍然很好奇为什么用常量插值填充间隙在这里不起作用。

If you want to use something like a CTE, how about using a temporary table? Essentially, a CTE is a view for a particular query.

Depending on your needs you can make the temporary table transaction or session-scoped.

I'm still curious to know why gap-filling with constant-interpolation wouldn't work here.

硬不硬你别怂 2024-12-30 05:17:48

给定完整的日历表,这是可行的,尽管并不完全微不足道。如果没有日历表,那就难多了。

您的查询需要适度精确地表述,这通常是解决“如何编写查询”问题的一半。我认为您正在寻找:

  • 对于日历中 T(或其他规定范围)中表示的最小日期和最大日期之间的每个日期,
  • 对于 T 中表示的每个不同 ID,
  • 查找 T 中最新记录的给定 ID 的指标在该日期或之前。

这将为您提供带有指标的完整日期列表。

然后,您需要将该列表的两个副本(日期相隔一天)自连接以形成增量。

请注意,如果某些 ID 值未出现在日期范围的开头,则它们将不会显示。

我相信,以此为指导,您应该能够继续前进。

Given the complete calendar table, it is doable, though not exactly trivial. Without the calendar table, it would be a lot harder.

Your query needs to be stated moderately precisely, which is usually half the battle in any issue with 'how to write the query'. I think you are looking for:

  • For each date in Calendar between the minimum and maximum dates represented in T (or other stipulated range),
  • For each distinct ID represented in T,
  • Find the metric for the given ID for the most recent record in T on or before the date.

This gives you a complete list of dates with metrics.

You then need to self-join two copies of that list with dates one day apart to form the deltas.

Note that if some ID values don't appear at the start of the date range, they won't show up.

With that as guidance, you should be able get going, I believe.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文