Row_number对于重复记录不是唯一的
我正在尝试从每个UUID的一堆行中找到特定行的最新更新。 为此,我们在分区上使用row_number()
,如下所示,
"row_number"() OVER (
PARTITION BY "uuid"
ORDER BY "uuid" ASC,
"status"."sequence" DESC,
"modifiedon" DESC
) "row_id_ranked",
此后,除了'row_id_ranked = 1'以外的任何其他内容都被丢弃了。但是,当发生重试/重新传输时,“ status”。“序列”
和“ modifiedon”
将完全相同。这将使用row_id_ranked = 1
创建utiple行。很难删除它们。 根据我的理解,文档>应该是一个独特的价值。但是对于这些重复的情况,显然不是。
row_number()→bigint#
返回每行的唯一顺序数字,从一个开始, 根据窗口分区中的行的顺序。
在这种情况下,我该如何脱颖而出?
I am trying to find the latest update of a particular row from a bunch of rows per uuid.
For that we use row_number()
over a partition as shown below,
"row_number"() OVER (
PARTITION BY "uuid"
ORDER BY "uuid" ASC,
"status"."sequence" DESC,
"modifiedon" DESC
) "row_id_ranked",
After this, anything other than 'row_id_ranked = 1' are discarded. But when a retry/re-transmission happens, "status"."sequence"
and "modifiedon"
will be exactly same. This creates mutiple rows with row_id_ranked=1
. Making it difficult to remove them.
As per my understanding per the documentation, the row_number
should be a unique value. But in case of these duplicates, it clearly is not.
row_number() → bigint#
Returns a unique, sequential number for each row, starting with one,
according to the ordering of rows within the window partition.
How do I de-duplicate in this case?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
假设除了状态序列之外不存在第三列,并且在日期上进行修改以打破潜在的平局,并且还假设您不关心保留哪一条记录,则可以使用 RANDOM() 这里:
Assuming that there does not exist some third column beyond the status sequence and modified on date to break a potential tie, and also assuming you don't care which single record gets retained, you may use
RANDOM()
here:如果您在行号的“over”子句中使用“按 uuid 分区”,您将为每个不同的 uuid 获得第 1 行。也就是说,您已按 uuid 对数据进行分区,并且每个分区的编号将重新开始,如您发布的定义中所示。
我认为您只想使用 order by,它将返回 1 行号 1。
If you use "Partition by uuid" in the "over" clause of row number you will get a row 1 for each distinct uuid. That is you have partitioned your data by uuid and the numbering will restart for each partition, as in the definition you have posted.
I think that you just want to use order by, which will return 1 row number 1.