如何确保在 BigQuery 中仅选择具有最大时间戳值的行?

发布于 2025-01-19 11:19:02 字数 1427 浏览 2 评论 0原文

我的表看起来像这样:

datetime | field_a | field_b | field_c | field_d | field_e | field_f | updated_at

实际上,字段数量比这个还要多,大约有 20 个,af 编号只是为了简洁。

该表会定期更新,相同的行可以多次出现,但具有更新的 updated_at 值。

我想要实现的是选择具有最新 updated_at 的行,以避免重复(如果唯一的区别是 updated_at 的值,则行 A 和 是重复的)。

我最初的尝试是这样的:

WITH temp AS (
    SELECT *, 
           ROW_NUMBER() OVER (PARTITION BY datetime, field_a, field_b, ... field_f ORDER BY updated_at DESC) rnk
    FROM some_table)
)

SELECT * FROM temp WHERE rnk = 1

起初,我认为在 PARTITION BY 子句中使用 datetime 可能就足够了,但似乎我必须包含所有字段,以便实现所需的重复数据删除。

这种方法有意义吗?我是否正确,所有字段都应包含在窗口函数中?有没有更优雅的方式来实现我想要的?

输入示例:

datetime | field_a | field_b | field_c | field_d | field_e | field_f | updated_at 

2022-04-05 | a | b | c | d | e | f | 2022-04-05T20:11:42.864086

2022-04-05 | a | b | c | d | e | f | 2022-04-05T20:22:42.864086

2022-04-04 | a | b | c | d | e | f | 2022-04-05T19:11:42.864086

2022-04-04 | a | b | c | d | e | f | 2022-04-05T19:22:42.864086

查询应返回:

2022-04-05 | a | b | c | d | e | f | 2022-04-05T20:22:42.864086

2022-04-04 | a | b | c | d | e | f | 2022-04-05T19:22:42.864086

即所有字段都相同的行(updated_at 除外),且 updated_at 是最大的。换句话说,(datetime、field_a、field_b、field_c、field_d、field_e、field_f) 的每个唯一组合的最新行。

My table looks something like this:

datetime | field_a | field_b | field_c | field_d | field_e | field_f | updated_at

Actually, the number of fields is larger than that, more about 20, the a-f numbering is just for brevity.

This table is updated on a regular basis and the same rows can appear more than once but with more recent values of updated_at.

What I want to achieve is to select rows with the most recent updated_at so as to avoid duplicates (rows A and are duplicates if the only difference is the value of updated_at).

My initial attempt is something like this:

WITH temp AS (
    SELECT *, 
           ROW_NUMBER() OVER (PARTITION BY datetime, field_a, field_b, ... field_f ORDER BY updated_at DESC) rnk
    FROM some_table)
)

SELECT * FROM temp WHERE rnk = 1

At first, I had thought that using datetime in the PARTITION BY clause might be enough, but it seems that I have to include all the fields so that the desired deduplication can happen.

Does this approach make sense? Am I correct in that all fields should be included in the window function? Is there a more elegant way to achieve what I want?

Sample input:

datetime | field_a | field_b | field_c | field_d | field_e | field_f | updated_at 

2022-04-05 | a | b | c | d | e | f | 2022-04-05T20:11:42.864086

2022-04-05 | a | b | c | d | e | f | 2022-04-05T20:22:42.864086

2022-04-04 | a | b | c | d | e | f | 2022-04-05T19:11:42.864086

2022-04-04 | a | b | c | d | e | f | 2022-04-05T19:22:42.864086

The query should return:

2022-04-05 | a | b | c | d | e | f | 2022-04-05T20:22:42.864086

2022-04-04 | a | b | c | d | e | f | 2022-04-05T19:22:42.864086

That is, rows where all fields are the same (except for updated_at), and updated_at is the largest. In other words, the most recent row for each unique combination of (datetime, field_a, field_b, field_c, field_d, field_e, field_f).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

凡间太子 2025-01-26 11:19:02

请考虑以下方法

select * from your_table t
qualify 1 = row_number() over win
window win as (partition by to_json_string((select as struct * except(updated_at) from unnest([t]))) order by updated_at desc)    

如果应用于问题中的示例数据,

- 输出为在此处输入图像描述

Consider below approach

select * from your_table t
qualify 1 = row_number() over win
window win as (partition by to_json_string((select as struct * except(updated_at) from unnest([t]))) order by updated_at desc)    

if applied to sample data in your question - output is

enter image description here

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文