bigquerqy sql 在组之间链接公共 grid_id 第二部分

发布于 2025-01-17 00:08:20 字数 8 浏览 4 评论 0原文

continue

The following result was obtained from Part 1.
bigquerqy sql link a common grid_id between groups

with t1 as
(
Select 'obrien-t j' lname_forename_long,11 grid_id_ct ,'grid.416153.4' grid_id,2 name_seq ,1 group_seq UNION ALL
Select 'obrien-t j',1,'grid.1002.3',1,1 UNION ALL
Select 'obrien-terence',2,'grid.1008.9',1,2 UNION ALL
Select 'obrien-terence',4,'grid.416153.4',2,2 UNION ALL
Select 'obrien-terence',1,'grid.484852.7',3,2 UNION ALL
Select 'obrien-terence j',14,'grid.1002.3',1,3 UNION ALL
Select 'obrien-terence j',25,'grid.1008.9',2,3 UNION ALL
Select 'obrien-terence j',3,'grid.1019.9',3,3 UNION ALL
Select 'obrien-terence j',9,'grid.1623.6',4,3 UNION ALL
Select 'obrien-terence j',40,'grid.237081.f',5,3 UNION ALL
Select 'obrien-terence j',1,'grid.267362.4',6,3 UNION ALL
Select 'obrien-terence j',2,'grid.414094.c',7,3 UNION ALL
Select 'obrien-terence j',1,'grid.416060.5',8,3 UNION ALL
Select 'obrien-terence j',36,'grid.416153.4',9,3 UNION ALL
Select 'obrien-terence j',4,'grid.453219.8',10,3 UNION ALL
Select 'obrien-terence j',3,'grid.454055.5',11,3 UNION ALL
Select 'obrien-terence j',6,'grid.474069.8',12,3 UNION ALL
Select 'obrien-terence j',13,'grid.481253.9',13,3 UNION ALL
Select 'obrien-terence john',1,'grid.1002.3',1,4 UNION ALL
Select 'obrien-terence john',1,'grid.1008.9',2,4 UNION ALL
Select 'obrien-terence john',1,'grid.1623.6',3,4 UNION ALL
Select 'obrien-terence john',1,'grid.237081.f',4,4 UNION ALL
Select 'obrien-terence john',2,'grid.416153.4',5,4 UNION ALL
Select 'obrien-terrence',2,'grid.416153.4',1,5 UNION ALL
Select 'obrien-terrence j',1,'grid.416153.4',1,6 UNION ALL
Select 'obrien-terry',1,'grid.137628.9',1,7 UNION ALL
Select 'obrien-terry',2,'grid.237081.f',2,7 UNION ALL
Select 'obrien-terry',1,'grid.267362.4',3,7 UNION ALL
Select 'obrien-timothy',1,'grid.496867.2',1,8 UNION ALL
Select 'obrien-timothy',3,'grid.6142.1',2,8 
) 
select *, if(count(*) over win > 0, string_agg('' || group_seq) over win, '') links
from t1
window win as (partition by grid_id)    ;

The above does not include a counts column which I think maybe needed.

lname_forename_longgrid_id_ctgrid_idname_seqgroup_seqlinkslink_counts
obrien-t j11grid.416153.4211,2,3,4,5,66
obrien-t j1grid.1002.3111,3,43
obrien-terence4grid.416153.4221,2,3,4,5,66
obrien-terence2grid.1008.9122,3,43
obrien-terence1grid.484852.73221
obrien-terence j36grid.416153.4931,2,3,4,5,66
obrien-terence j14grid.1002.3131,3,43
obrien-terence j25grid.1008.9232,3,43
obrien-terence j40grid.237081.f533,4,73
obrien-terence j9grid.1623.6433,42
obrien-terence j1grid.267362.4633,72
obrien-terence j3grid.1019.93331
obrien-terence j2grid.414094.c7331
obrien-terence j1grid.416060.58331
obrien-terence j4grid.453219.810331
obrien-terence j3grid.454055.511331
obrien-terence j6grid.474069.812331
obrien-terence j13grid.481253.913331
obrien-terence john2grid.416153.4541,2,3,4,5,66
obrien-terence john1grid.1002.3141,3,43
obrien-terence john1grid.1008.9242,3,43
obrien-terence john1grid.237081.f443,4,73
obrien-terence john1grid.1623.6343,42
obrien-terrence2grid.416153.4151,2,3,4,5,66
obrien-terrence j1grid.416153.4161,2,3,4,5,66
obrien-terry2grid.237081.f273,4,73
obrien-terry1grid.267362.4373,72
obrien-terry1grid.137628.91771
obrien-timothy3grid.6142.12881
obrien-timothy1grid.496867.21881

The second part is take all the names with max(link_counts)

lname_forename_longgrid_id_ctgrid_idname_seqgroup_seqlinkslink_counts
obrien-t j11grid.416153.4211,2,3,4,5,66
obrien-terence4grid.416153.4221,2,3,4,5,66
obrien-terence j36grid.416153.4931,2,3,4,5,66
obrien-terence john2grid.416153.4541,2,3,4,5,66
obrien-terrence2grid.416153.4151,2,3,4,5,66
obrien-terrence j1grid.416153.4161,2,3,4,5,66

Add the names that are not in max(link_counts) = 6
choose the nmaes with the highest grid_id_ct to give.

lname_forename_longgrid_id_ctgrid_idname_seqgroup_seqlinkslink_counts
obrien-timothy3grid.6142.12881
obrien-terry2grid.237081.f273,4,73
obrien-terrence j1grid.416153.4161,2,3,4,5,66
obrien-terrence2grid.416153.4151,2,3,4,5,66
obrien-terence john2grid.416153.4541,2,3,4,5,66
obrien-terence j36grid.416153.4931,2,3,4,5,66
obrien-terence4grid.416153.4221,2,3,4,5,66
obrien-t j11grid.416153.4211,2,3,4,5,66

If any of the new names can link to the link_counts = 6 update a column the links that can intersect.

lname_forename_longgrid_id_ctgrid_idname_seqgroup_seqlinkslink_countsis_intersect_links
obrien-timothy3grid.6142.12881
obrien-terry2grid.237081.f273,4,733,4
obrien-terrence j1grid.416153.4161,2,3,4,5,663,4
obrien-terrence2grid.416153.4151,2,3,4,5,663,4
obrien-terence john2grid.416153.4541,2,3,4,5,663,4
obrien-terence j36grid.416153.4931,2,3,4,5,663,4
obrien-terence4grid.416153.4221,2,3,4,5,663,4
obrien-t j11grid.416153.4211,2,3,4,5,663,4

Since we can now link obrien-terry to the other obrien-t..... names update his grid_id to be the same as obrien-t..... grid.416153.4

lname_forename_longgrid_id_ctgrid_idname_seqgroup_seqlinkslink_countsis_intersect_linksis_merged
obrien-timothy3grid.6142.12881''FALSE
obrien-terry2grid.416153.4273,4,733,4TRUE
obrien-terrence j1grid.416153.4161,2,3,4,5,663,4FALSE
obrien-terrence2grid.416153.4151,2,3,4,5,663,4FALSE
obrien-terence john2grid.416153.4541,2,3,4,5,663,4FALSE
obrien-terence j36grid.416153.4931,2,3,4,5,663,4FALSE
obrien-terence4grid.416153.4221,2,3,4,5,663,4FALSE
obrien-t j11grid.416153.4211,2,3,4,5,663,4FALSE

I also added is_merged to indicate that a grid_id was updated.
I have added multiple steps to make it clear but its possible its one or two steps.
I have tried multiple ways of doing this using cartesain joins, intersect distinct to find a common grid between names but they all came up short.
In simple terms I am trying to find figure out how many unique obriens I have based on being able to assign them to a common grid_id which is basically an address.

I'm not sure if I have over complicated it with all the intermediate steps. I don't need all the meta data columns I just need to end up with .

lname_forename_longgrid_idis_merged
obrien-timothygrid.6142.1FALSE
obrien-terrygrid.416153.4TRUE
obrien-terrence jgrid.416153.4FALSE
obrien-terrencegrid.416153.4FALSE
obrien-terence johngrid.416153.4FALSE
obrien-terence jgrid.416153.4FALSE
obrien-terencegrid.416153.4FALSE
obrien-t jgrid.416153.4FALSE

MY effort for samuel.

with t2 as (
with t1 as
(
Select "o'brien-t j" lname,11 grid_ct ,'grid.416153.4' grid_id,2 name_seq ,1 group_seq ,'1,2,3,4,5,6' links UNION ALL
Select "o'brien-terence",2,'grid.1008.9',1,2,'' UNION ALL
Select "o'brien-terence",4,'grid.416153.4',2,2,'' UNION ALL
Select "o'brien-terence",1,'grid.484852.7',3,2,'1,2,3,4,5,6' UNION ALL
Select "o'brien-terence j",14,'grid.1002.3',1,3,'3,7' UNION ALL
Select "o'brien-terence j",25,'grid.1008.9',2,3,'' UNION ALL
Select "o'brien-terence j",3,'grid.1019.9',3,3,'' UNION ALL
Select "o'brien-terence j",9,'grid.1623.6',4,3,'' UNION ALL
Select "o'brien-terence j",40,'grid.237081.f',5,3,'' UNION ALL
Select "o'brien-terence j",1,'grid.267362.4',6,3,'' UNION ALL
Select "o'brien-terence j",2,'grid.414094.c',7,3,'' UNION ALL
Select "o'brien-terence j",1,'grid.416060.5',8,3,'' UNION ALL
Select "o'brien-terence j",36,'grid.416153.4',9,3,'' UNION ALL
Select "o'brien-terence j",4,'grid.453219.8',10,3,'' UNION ALL
Select "o'brien-terence j",3,'grid.454055.5',11,3,'' UNION ALL
Select "o'brien-terence j",6,'grid.474069.8',12,3,'1,2,3,4,5,6' UNION ALL
Select "o'brien-terence j",13,'grid.481253.9',13,3,'3,4' UNION ALL
Select "o'brien-terence john",1,'grid.1002.3',1,4,'' UNION ALL
Select "o'brien-terence john",1,'grid.1008.9',2,4,'' UNION ALL
Select "o'brien-terence john",1,'grid.1623.6',3,4,'' UNION ALL
Select "o'brien-terence john",1,'grid.237081.f',4,4,'3,4' UNION ALL
Select "o'brien-terence john",2,'grid.416153.4',5,4,'1,2,3,4,5,6' UNION ALL
Select "o'brien-terrence",2,'grid.416153.4',1,5,'1,2,3,4,5,6' UNION ALL
Select "o'brien-terrence j",1,'grid.416153.4',1,6,'1,2,3,4,5,6' UNION ALL
Select "o'brien-terry",1,'grid.137628.9',1,7,'' UNION ALL
Select "o'brien-terry",2,'grid.237081.f',2,7,'3,7' UNION ALL
Select "o'brien-terry",1,'grid.267362.4',3,7,'' UNION ALL
Select "o'brien-timothy",1,'grid.496867.2',1,8,'' UNION ALL
Select "o'brien-timothy",3,'grid.6142.1',2,8,''
)
 select distinct a.lname, a.grid_id
 from t1 a, t1 b
 where a.lname <> b.lname
 and a.grid_id = b.grid_id
)
 select  distinct  lname,
 grid_id ,
 DENSE_RANK() OVER
                   (
                   --PARTITION BY a.lname_init1
                   ORDER BY grid_id
                   )  seq_num,
 from t2
)
select
'matched' is_matched,
lname
,grid_id
,seq_num
from t3
group by lname  ,grid_id,seq_num
having seq_num = (select max(seq_num )x from t3)
------------------------------------------
union all
--intersect distinct
------------------------------------------
select
'not_matched' is_matched,
lname
,grid_id
,seq_num
from t3
group by lname  ,grid_id,seq_num
having seq_num != (select max(seq_num )x from t3);

My result. I could not figure out how to merge o'brien-terry to the matched group. It also missed o'brien-timothy

is_matchedlnamegrid_idseq_num
not_matchedo'brien-terence jgrid.1002.31
not_matchedo'brien-terence johngrid.1002.31
not_matchedo'brien-terencegrid.1008.92
not_matchedo'brien-terence jgrid.1008.92
not_matchedo'brien-terence johngrid.1008.92
not_matchedo'brien-terence jgrid.1623.63
not_matchedo'brien-terence johngrid.1623.63
not_matchedo'brien-terence jgrid.237081.f4
not_matchedo'brien-terence johngrid.237081.f4
not_matchedo'brien-terrygrid.237081.f4
not_matchedo'brien-terence jgrid.267362.45
not_matchedo'brien-terrygrid.267362.45
matchedo'brien-t jgrid.416153.46
matchedo'brien-terencegrid.416153.46
matchedo'brien-terence jgrid.416153.46
matchedo'brien-terence johngrid.416153.46
matchedo'brien-terrencegrid.416153.46
matchedo'brien-terrence jgrid.416153.46

Samuel result.

lname_forename_longgrid_id_ctgrid_idname_seqgroup_seqlinkslink_countsis_intersect_links
obrien-t j1grid.1002.3111,3,431,3,4
obrien-terence2grid.1008.9122,3,432,3,4
obrien-terence j14grid.1002.3131,3,431,3,4
obrien-terence john1grid.1002.3141,3,431,3,4
obrien-terry2grid.237081.f273,4,73
obrien-timothy1grid.496867.21881

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

盗琴音 2025-01-24 00:08:20

请考虑以下方法

with temp as (
  select *, array_length(split(links)) link_counts, 
    array_length(split(links)) < max(array_length(split(links))) over() merge_candidate
  from (
    select *, if(count(*) over win > 1, string_agg('' || group_seq) over win, '') links
    from t1
    window win as (partition by grid_id)
  )
  qualify 1 = row_number() over(partition by group_seq order by array_length(split(links)) desc, grid_id_ct desc)
)
select lname_forename_long, grid_id, merge_candidate as is_merged  
from temp where not merge_candidate
union all
select lname_forename_long, ifnull(merged_grid_id, grid_id), if(merged_grid_id is null, false, true) 
from (
  select any_value(t1).*, 
    any_value(( select t2.grid_id
      from unnest(split(t1.links)) link
      join unnest(split(t2.links)) link
      using(link)
      limit 1
    )) as merged_grid_id
  from (select * from temp where merge_candidate) t1
  cross join (select * from temp where not merge_candidate) t2
  group by to_json_string(t1)
)
order by grid_id desc, lname_forename_long desc    

如果应用于问题中的示例数据,

- 输出为在此处输入图像描述

Consider below approach

with temp as (
  select *, array_length(split(links)) link_counts, 
    array_length(split(links)) < max(array_length(split(links))) over() merge_candidate
  from (
    select *, if(count(*) over win > 1, string_agg('' || group_seq) over win, '') links
    from t1
    window win as (partition by grid_id)
  )
  qualify 1 = row_number() over(partition by group_seq order by array_length(split(links)) desc, grid_id_ct desc)
)
select lname_forename_long, grid_id, merge_candidate as is_merged  
from temp where not merge_candidate
union all
select lname_forename_long, ifnull(merged_grid_id, grid_id), if(merged_grid_id is null, false, true) 
from (
  select any_value(t1).*, 
    any_value(( select t2.grid_id
      from unnest(split(t1.links)) link
      join unnest(split(t2.links)) link
      using(link)
      limit 1
    )) as merged_grid_id
  from (select * from temp where merge_candidate) t1
  cross join (select * from temp where not merge_candidate) t2
  group by to_json_string(t1)
)
order by grid_id desc, lname_forename_long desc    

if applied to sample data in your question - output is

enter image description here

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文