根据遵循唯一值组合的日期进行分组和最小最大日期分配
我正在尝试解决另一个差距和岛屿问题。
我有以下示例数据集:
MEM_ID | CLM_ID | BGN_SERV_DT | END_SERV_DT | ADM_DT | DCHG_DT | PROV |
---|---|---|---|---|---|---|
1 | 111 | 01-01-2021 | 01-31-2021 | 01-01-2021 | 01-01-2021 | 1 |
1 | 112 | 01-01-2021 | 01-28-2021 | 01-01-2021 | 01-01-2021 | 1 |
1 | 113 | 01-01-2021 | 01-25-2021 | 01-01-2021 | 01-01-2021 | 1 |
1 | 114 | 02-01-2021 | 02-10-2021 | 01-01- 2021年 | 01-01-2021 | 1 |
1 | 115 | 02-11-2021 | 02-15-2021 | 01-01-2021 | 01-01-2021 | 1 |
1 | 116 | 02-15-2021 | 02-17-2021 | 02-15-2021 | 02-15- 2021年 | 2 |
1 | 117 | 2021年2月18日 | 15 | 年2月25日 2021年2月 | 2021 | 日 2021年2月15日3 |
1 | 118 | 2021年3月18日 | 2021年 | 3月25日 2021 | 年2月16日 2021年2月16日 | 3 |
2 | 211 | 01-01-2021 | 01-31-2021 | 01-01-2021 | 01-01-2021 | 1 |
2 | 212 | 03-01-2021 | 03-05-2021 | 03-01-2021 | 03-01-2021 | 2 |
3 | 311 | 02-01- 2021年 | 02-08-2021 | 02-01-2021 | 02-01-2021 | 1 |
3 | 312 | 02-08-2021 | 02-10-2021 | 02-01-2021 | 02-01-2021 | 2 |
我想要完成的任务如下:
基于基于 MEM_ID、PROV 和 ADM_DT 的唯一组合,我想将事件分组为 1 并分配最小 BGN_SERV_DT 和最大 END_SERV_DT。如果 DCHG_DT 与 ADM_DT 相同或最多提前 1 天,我还需要对来自相同 MEM_ID 和 PROV 的任何其他事件进行分组。
最终结果应如下所示:
MEM_ID | CLM_ID | BGN_SERV_DT | END_SERV_DT | ADM_DT | DCHG_DT | PROV | GROUP | MIN_DT | MAX_DT |
---|---|---|---|---|---|---|---|---|---|
1 | 111 | 01-01-2021 | 01-31-2021 | 01-01-2021 | 01-01-2021 | 1 | 1 | 01-01-2021 | 02-15-2021 |
1 | 112 | 01-01-2021 | 01-28-2021 | 01-01-2021 | 01-01-2021 | 1 | 1 | 01-01-2021 | 02-15-2021 |
1 | 113 | 01-01-2021 | 01-25- 2021年 | 01-01-2021 | 01-01-2021 | 1 | 1 | 01-01-2021 | 02-15-2021 |
1 | 114 | 02-01-2021 | 02-10-2021 | 01-01-2021 | 01-01-2021 | 1 | 1 | 01-01- 2021年 | 02-15-2021 |
1 | 115 | 02-11-2021 | 02-15-2021 | 01-01-2021 | 01-01-2021 | 1 | 1 | 01-01-2021 | 02-15-2021 |
1 | 116 | 02-15-2021 | 02-17- 2021年 | 02-15-2021 | 02-15-2021 | 2 | 2 | 02-15-2021 | 02-17-2021 |
1 | 117 | 02-18-2021 | 02-25-2021 | 02-15-2021 | 02-15-2021 | 3 | 3 | 02-18- 2021年 | 03-25-2021 |
1 | 118 | 03-18-2021 | 03-25-2021 | 02-16-2021 | 02-16-2021 | 3 | 3 | 02-18-2021 | 03-25-2021 |
2 | 211 | 01-01-2021 | 01-31- 2021年 | 01-01-2021 | 01-01-2021 | 1 | 1 | 01-01-2021 | 01-31-2021 |
2 | 212 | 03-01-2021 | 03-05-2021 | 03-01-2021 | 03-01-2021 | 2 | 2 | 03-01- 2021年 | 03-05-2021 |
3 | 311 | 02-01-2021 | 02-08-2021 | 02-01-2021 | 02-01-2021 | 1 | 1 | 02-01-2021 | 02-08-2021 |
3 | 312 | 02-08-2021 | 02-10- 2021年 | 02-01-2021 | 02-01-2021 | 2 | 2 | 02-08-2021 | 02-10-2021 |
您会注意到某些 ADM_DT 小于 BGN_SERV_DT,这不是数据错误。
在预期输出中,MEM_ID 1 的前 5 条记录被分组到组 1 中,因为它满足具有相同 MEM_ID、PROV 和 ADM_DT 组合的条件。记录 6 和 7 具有相同的 MEM_ID 和 ADM_DT 组合,但不同的 PROV,因此它们分为组 2 和 3。记录 8 具有相同的 MEM_ID 和 PROV 组合,但 ADM_DT 与记录 7 不同。但是,因为 ADM_DT记录 8 中的 DCHG_DT 距记录 7 中的 1 天内,它被认为是同一事件并分配到组 3。其余数据点非常简单为什么它们会这样分组。
这是我通过以下链接提出的上一个问题的后续:
我尝试了以下代码:
SELECT DISTINCT MEM_ID
, PROV
, CLM_ID
, ADM_DT
, DCHG_DT
, sum(ISSTART) OVER (PARTITION BY MEM_ID ORDER BY ADM_DT, DCHG_DT rows unbounded preceding) AS GROUP
FROM (
SELECT DISTINCT MEM_ID
, PROV
, CLM_ID
, ADM_DT
, DCHG_DT
, CASE
WHEN PROV = LAG(PROV) OVER (PARTITION BY MEM_ID ORDER BY ADM_DT, DCHG_DT)
AND (
(
ADM_DT = LAG(DCHG_DT) OVER (PARTITION BY MEM_ID ORDER BY ADM_DT, DCHG_DT)
)
OR (
ADM_DT = LAG(DCHG_DT + 1) OVER (PARTITION BY MEM_ID ORDER BY ADM_DT, DCHG_DT)
)
)
THEN 0
ELSE 1
END AS ISSTART
FROM c1
) t
上面的代码已捕获大部分实例,但我无法弄清楚如何添加最小和最大日期。我也不确定我的案例陈述是否被正确使用。
任何帮助将不胜感激。
先感谢您!
编辑:
经过进一步调查,看来我创建岛屿的子查询可能有错误,或者我的总和语句计算不正确。
我尝试在 sum 语句中更改以下代码,看看是否可以解决问题
SELECT DISTINCT MEM_ID
, PROV
, CLM_ID
, ADM_DT
, DCHG_DT
, sum(ISSTART) OVER (PARTITION BY MEM_ID,ADM_DT ORDER BY ADM_DT, DCHG_DT rows unbounded preceding) AS GROUP
FROM (
SELECT DISTINCT MEM_ID
, PROV
, CLM_ID
, ADM_DT
, DCHG_DT
, CASE
WHEN PROV = LAG(PROV) OVER (PARTITION BY MEM_ID ORDER BY ADM_DT, DCHG_DT)
AND (
(
ADM_DT = LAG(DCHG_DT) OVER (PARTITION BY MEM_ID ORDER BY ADM_DT, DCHG_DT)
)
OR (
ADM_DT = LAG(DCHG_DT + 1) OVER (PARTITION BY MEM_ID ORDER BY ADM_DT, DCHG_DT)
)
)
THEN 0
ELSE 1
END AS ISSTART
FROM c1
) t
,但是我最终得到的结果如下:
MEM_ID | CLM_ID | BGN_SERV_DT | END_SERV_DT | ADM_DT | DCHG_DT | PROV | GROUP |
---|---|---|---|---|---|---|---|
1 | 111 | 01-01-2021 | 01-31-2021 | 01- 2021年1月 | 2021年1月1日 | 1 | 0 |
1 | 112 | 01-01-2021 | 01-28-2021 | 01-01-2021 | 01-01-2021 | 1 | 1 |
1 | 113 | 01-01-2021 | 01-25-2021 | 01-01-2021 | 01-01-2021 | 1 | 1 |
1 | 114 | 02-01-2021 | 02-10-2021 | 01-01-2021 | 01-01-2021 | 1 | 1 |
1 | 115 | 02-11-2021 | 02-15-2021 | 01-01-2021 | 01-01-2021 | 1 | 1 |
1 | 116 | 2021 | 年2 | 2021年2月15日2021 | 年2月15日 2021年2月 | 年2 | 月18日 |
15日 2021年2月15日 2 0 1 | 117 | 年2月17日2021 | 2021年 | 2月25日 2021 | 月15日 2021年2月15日 | 3 | 0 |
1 | 118 | 2021年3月18日 | 2021年3月25 | 日2021年2月16日 | 2021年2月16日 | 3 | 1 |
I am having another gaps and islands problem that I am trying to solve.
I have the following sample dataset:
MEM_ID | CLM_ID | BGN_SERV_DT | END_SERV_DT | ADM_DT | DCHG_DT | PROV |
---|---|---|---|---|---|---|
1 | 111 | 01-01-2021 | 01-31-2021 | 01-01-2021 | 01-01-2021 | 1 |
1 | 112 | 01-01-2021 | 01-28-2021 | 01-01-2021 | 01-01-2021 | 1 |
1 | 113 | 01-01-2021 | 01-25-2021 | 01-01-2021 | 01-01-2021 | 1 |
1 | 114 | 02-01-2021 | 02-10-2021 | 01-01-2021 | 01-01-2021 | 1 |
1 | 115 | 02-11-2021 | 02-15-2021 | 01-01-2021 | 01-01-2021 | 1 |
1 | 116 | 02-15-2021 | 02-17-2021 | 02-15-2021 | 02-15-2021 | 2 |
1 | 117 | 02-18-2021 | 02-25-2021 | 02-15-2021 | 02-15-2021 | 3 |
1 | 118 | 03-18-2021 | 03-25-2021 | 02-16-2021 | 02-16-2021 | 3 |
2 | 211 | 01-01-2021 | 01-31-2021 | 01-01-2021 | 01-01-2021 | 1 |
2 | 212 | 03-01-2021 | 03-05-2021 | 03-01-2021 | 03-01-2021 | 2 |
3 | 311 | 02-01-2021 | 02-08-2021 | 02-01-2021 | 02-01-2021 | 1 |
3 | 312 | 02-08-2021 | 02-10-2021 | 02-01-2021 | 02-01-2021 | 2 |
What I am trying to accomplish is the following:
Based on the unique combination of a MEM_ID,PROV,and ADM_DT, I want to group the events into 1 and assign the min BGN_SERV_DT and max END_SERV_DT. I also need to group any additional events from the same MEM_ID and PROV if the DCHG_DT is the same as the ADM_DT or up to 1 day ahead.
The end result should look like this:
MEM_ID | CLM_ID | BGN_SERV_DT | END_SERV_DT | ADM_DT | DCHG_DT | PROV | GROUP | MIN_DT | MAX_DT |
---|---|---|---|---|---|---|---|---|---|
1 | 111 | 01-01-2021 | 01-31-2021 | 01-01-2021 | 01-01-2021 | 1 | 1 | 01-01-2021 | 02-15-2021 |
1 | 112 | 01-01-2021 | 01-28-2021 | 01-01-2021 | 01-01-2021 | 1 | 1 | 01-01-2021 | 02-15-2021 |
1 | 113 | 01-01-2021 | 01-25-2021 | 01-01-2021 | 01-01-2021 | 1 | 1 | 01-01-2021 | 02-15-2021 |
1 | 114 | 02-01-2021 | 02-10-2021 | 01-01-2021 | 01-01-2021 | 1 | 1 | 01-01-2021 | 02-15-2021 |
1 | 115 | 02-11-2021 | 02-15-2021 | 01-01-2021 | 01-01-2021 | 1 | 1 | 01-01-2021 | 02-15-2021 |
1 | 116 | 02-15-2021 | 02-17-2021 | 02-15-2021 | 02-15-2021 | 2 | 2 | 02-15-2021 | 02-17-2021 |
1 | 117 | 02-18-2021 | 02-25-2021 | 02-15-2021 | 02-15-2021 | 3 | 3 | 02-18-2021 | 03-25-2021 |
1 | 118 | 03-18-2021 | 03-25-2021 | 02-16-2021 | 02-16-2021 | 3 | 3 | 02-18-2021 | 03-25-2021 |
2 | 211 | 01-01-2021 | 01-31-2021 | 01-01-2021 | 01-01-2021 | 1 | 1 | 01-01-2021 | 01-31-2021 |
2 | 212 | 03-01-2021 | 03-05-2021 | 03-01-2021 | 03-01-2021 | 2 | 2 | 03-01-2021 | 03-05-2021 |
3 | 311 | 02-01-2021 | 02-08-2021 | 02-01-2021 | 02-01-2021 | 1 | 1 | 02-01-2021 | 02-08-2021 |
3 | 312 | 02-08-2021 | 02-10-2021 | 02-01-2021 | 02-01-2021 | 2 | 2 | 02-08-2021 | 02-10-2021 |
You will notice that some of the ADM_DT are less than the BGN_SERV_DT, that is not a data error.
In the expected output, the first 5 records for MEM_ID 1 was grouped into group 1 because it meets the criteria of having the same MEM_ID,PROV, and ADM_DT combination. Records 6 and 7 have the same MEM_ID and ADM_DT combination but different PROV, therefore they are broken into groups 2 and 3. Record 8 has the same combination of MEM_ID and PROV but the ADM_DT is different to that of Record 7. However because the ADM_DT from record 8 is within 1 day of the DCHG_DT from record 7, it is considered the same event and assigned group 3. The rest of the data points are pretty straight forward on why they are grouped the way they are.
This is a follow up to a previous question that i asked via the below link:
Grouping based on start date matching the previous row's end date SQL
I have tried the following code:
SELECT DISTINCT MEM_ID
, PROV
, CLM_ID
, ADM_DT
, DCHG_DT
, sum(ISSTART) OVER (PARTITION BY MEM_ID ORDER BY ADM_DT, DCHG_DT rows unbounded preceding) AS GROUP
FROM (
SELECT DISTINCT MEM_ID
, PROV
, CLM_ID
, ADM_DT
, DCHG_DT
, CASE
WHEN PROV = LAG(PROV) OVER (PARTITION BY MEM_ID ORDER BY ADM_DT, DCHG_DT)
AND (
(
ADM_DT = LAG(DCHG_DT) OVER (PARTITION BY MEM_ID ORDER BY ADM_DT, DCHG_DT)
)
OR (
ADM_DT = LAG(DCHG_DT + 1) OVER (PARTITION BY MEM_ID ORDER BY ADM_DT, DCHG_DT)
)
)
THEN 0
ELSE 1
END AS ISSTART
FROM c1
) t
The above code, has captured most of the instances but i am unable to figure out how to add the min and max dates. I am also unsure if my case statement is correctly being used.
Any help would be appreciated.
Thank you in advance!
EDIT:
After further investigation, it appears that either my subquery to create the islands might be faulty or my sum statement isnt being calculated correctly.
I have attempted the following code change in the sum statement to see if it fixes the issue
SELECT DISTINCT MEM_ID
, PROV
, CLM_ID
, ADM_DT
, DCHG_DT
, sum(ISSTART) OVER (PARTITION BY MEM_ID,ADM_DT ORDER BY ADM_DT, DCHG_DT rows unbounded preceding) AS GROUP
FROM (
SELECT DISTINCT MEM_ID
, PROV
, CLM_ID
, ADM_DT
, DCHG_DT
, CASE
WHEN PROV = LAG(PROV) OVER (PARTITION BY MEM_ID ORDER BY ADM_DT, DCHG_DT)
AND (
(
ADM_DT = LAG(DCHG_DT) OVER (PARTITION BY MEM_ID ORDER BY ADM_DT, DCHG_DT)
)
OR (
ADM_DT = LAG(DCHG_DT + 1) OVER (PARTITION BY MEM_ID ORDER BY ADM_DT, DCHG_DT)
)
)
THEN 0
ELSE 1
END AS ISSTART
FROM c1
) t
However what I end up with is the following,
MEM_ID | CLM_ID | BGN_SERV_DT | END_SERV_DT | ADM_DT | DCHG_DT | PROV | GROUP |
---|---|---|---|---|---|---|---|
1 | 111 | 01-01-2021 | 01-31-2021 | 01-01-2021 | 01-01-2021 | 1 | 0 |
1 | 112 | 01-01-2021 | 01-28-2021 | 01-01-2021 | 01-01-2021 | 1 | 1 |
1 | 113 | 01-01-2021 | 01-25-2021 | 01-01-2021 | 01-01-2021 | 1 | 1 |
1 | 114 | 02-01-2021 | 02-10-2021 | 01-01-2021 | 01-01-2021 | 1 | 1 |
1 | 115 | 02-11-2021 | 02-15-2021 | 01-01-2021 | 01-01-2021 | 1 | 1 |
1 | 116 | 02-15-2021 | 02-17-2021 | 02-15-2021 | 02-15-2021 | 2 | 0 |
1 | 117 | 02-18-2021 | 02-25-2021 | 02-15-2021 | 02-15-2021 | 3 | 0 |
1 | 118 | 03-18-2021 | 03-25-2021 | 02-16-2021 | 02-16-2021 | 3 | 1 |
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果其他一切正常,并且您只需要两列
MIN_DT
和MAX_DT
,那么您可以使用max
和min窗口函数:
If everything else works fine and you just need the two columns
MIN_DT
andMAX_DT
, then you can use themax
andmin
window functions: