如何在数组中排序值，将其汇总，有效地计算SQL中的类似记录

发布于 2025-02-11 21:53:45 字数 1758 浏览 2 评论 0原文

我需要按表中的类型属性进行计数和汇总。

这些是“标题” 专栏的一些示例：

排名	Station
1	Watlington Station -9 Houses and Industrial Units
2	Road Yate -1 Bungalow＆amp; 2座房屋
3	旧牧师卡尔巷 - 房屋＆amp;稳定的块

SO，然后为了清洁这些行，我使用了替换（）和 Regex（）函数：

SELECT SPLIT(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(TRIM(REGEXP_REPLACE(REGEXP_SUBSTR(REPLACE("Title",'-2',''), '[^-]*$'), '[0-9]+')),'/', '&'), ' &  ','&'),' AND ','&'),' & ','&'),'& ','&'),'.',''),'&') AS "Category",
COUNT("Category") AS "Frequency"
FROM DATA.PROPERTIES
GROUP BY "Category" 
ORDER BY "Frequency" DESC;

我得到了类似：

类别	频率
“ house”	12432
“ flats） “	76736
”房屋“	23865
”公寓“	18814
“房屋”	”
，“公寓” 3245“平面	，“房屋” 1342
“平面”，“零售单位”，“	362
“ 461” Flats，“零售	“平面”，“ House” 461
”，“零售”，“零售”	890
“公寓”，“零售单位”	287

我给出了一些结果。如您所见应该加入一个，因为它相似，并且应该像 [“ houses”，“ flats”] 和4587。

因此，我认为我应该以某种方式重新订购ASC或DESC中的每个数组订购并汇总。还有关于[“ flats”，“零售单位”] ，[“ flats”，“零售”] ，[“ flats”， “零售单位”] 。

您有什么想法如何解决吗？

可能是使用array_agg或其他其他方法的方式。我现在没有任何想法，但是想解决这个问题。

原文

I need to count and aggregate by types properties from the table.

These are some examples of "Title" column:

Row	Title
1	WATLINGTON STATION - 9 HOUSES AND INDUSTRIAL UNITS
2	STATION ROAD YATE - 1 BUNGALOW & 2 HOUSES
3	THE OLD VICARAGE CARR LANE - HOUSE & STABLE BLOCK

So, then to clean these rows I've used REPLACE() AND REGEX() functions:

SELECT SPLIT(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(TRIM(REGEXP_REPLACE(REGEXP_SUBSTR(REPLACE("Title",'-2',''), '[^-]*
And I got rows like:



Category
Frequency




"HOUSES"
12432


"FLATS"
76736


"HOUSE"
23865


"APARTMENTS"
18814


"HOUSES", "FLATS"
3245


"FLATS", "HOUSES"
1342


"FLATS", "RETAIL UNIT"
362


"FLATS", "HOUSE"
461


"FLATS", "RETAIL"
890


"FLATS", "RETAIL UNITS"
287



I've gave some results. As you can see [ "HOUSES", "FLATS" ] and [ "FLATS", "HOUSES" ]

should be joined into one, cause it similar and it should be like [ "HOUSES", "FLATS" ] and 4587.
So, I assume that somehow I should re-order every array in ASC or DESC order and aggregate it.

And similar question about three rows like [ "FLATS", "RETAIL UNIT" ], [ "FLATS", "RETAIL" ], [ "FLATS", "RETAIL UNITS" ].
Do you have any ideas how to resolve it??
May be the way using ARRAY_AGG or something else. I have no any ideas now, but would like to resolve this question.
), '[0-9]+')),'/', '&'), ' &  ','&'),' AND ','&'),' & ','&'),'& ','&'),'.',''),'&') AS "Category",
COUNT("Category") AS "Frequency"
FROM DATA.PROPERTIES
GROUP BY "Category" 
ORDER BY "Frequency" DESC;

And I got rows like:

Category	Frequency
"HOUSES"	12432
"FLATS"	76736
"HOUSE"	23865
"APARTMENTS"	18814
"HOUSES", "FLATS"	3245
"FLATS", "HOUSES"	1342
"FLATS", "RETAIL UNIT"	362
"FLATS", "HOUSE"	461
"FLATS", "RETAIL"	890
"FLATS", "RETAIL UNITS"	287

I've gave some results. As you can see [ "HOUSES", "FLATS" ] and [ "FLATS", "HOUSES" ]
should be joined into one, cause it similar and it should be like [ "HOUSES", "FLATS" ] and 4587.

So, I assume that somehow I should re-order every array in ASC or DESC order and aggregate it.
And similar question about three rows like [ "FLATS", "RETAIL UNIT" ], [ "FLATS", "RETAIL" ], [ "FLATS", "RETAIL UNITS" ].

Do you have any ideas how to resolve it??

May be the way using ARRAY_AGG or something else. I have no any ideas now, but would like to resolve this question.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

北方。的韩爷 2025-02-18 21:53:45

示例显示重新排序和汇总 -
想法是首先将数组弄平，然后在聚合时重新排序。最后，一旦可以使用简单的分组，就可以将所有以前像['a'，'b']，[b'，'a']的数组分组。

with cte (category, frequency) as (
    select array_construct('HOUSES'),12432 union all
    select array_construct('FLATS'),76736 union all
    select array_construct('HOUSE'),23865 union all
    select array_construct('APARTMENTS'),18814 union all
    select array_construct('HOUSES', 'FLATS'),3245 union all
    select array_construct('FLATS', 'HOUSES'),1342 union all
    select array_construct('FLATS', 'RETAIL UNIT'),362 union all
    select array_construct('FLATS', 'HOUSE'),461 union all
    select array_construct('FLATS', 'RETAIL'),890 union all
    select array_construct('FLATS', 'RETAIL UNITS'),287
), cte_1 as
    (select seq,value,c1.frequency
        from cte c1,
        lateral flatten(input=>category)
        order by seq, value
    ), cte_2 as
        (select array_agg(value) category, frequency
            from cte_1
            group by seq,frequency)
    select category, sum(frequency)
        from cte_2
        group by category;

类别	和总和（频率）
[“ flats”，“ house”]	461
[“ flats”，“零售”]	890
[“ house”]	23865
[“ flats”]	76736
[“ flats”，“零售单位”，“零售单位”	287
[] [公寓”]	18814
[“ flats”，“零售单位”]	362
[“ Houses”]	12432
[“ Flats”，“ Houses”]	4587

Example showing reorder and aggregate -
Idea is to first flatten the array and then reorder them while aggregating. Finally once they are in order simple grouping can be used, which will group all arrays that were previously like ['a','b'], ['b','a'] under one group.

with cte (category, frequency) as (
    select array_construct('HOUSES'),12432 union all
    select array_construct('FLATS'),76736 union all
    select array_construct('HOUSE'),23865 union all
    select array_construct('APARTMENTS'),18814 union all
    select array_construct('HOUSES', 'FLATS'),3245 union all
    select array_construct('FLATS', 'HOUSES'),1342 union all
    select array_construct('FLATS', 'RETAIL UNIT'),362 union all
    select array_construct('FLATS', 'HOUSE'),461 union all
    select array_construct('FLATS', 'RETAIL'),890 union all
    select array_construct('FLATS', 'RETAIL UNITS'),287
), cte_1 as
    (select seq,value,c1.frequency
        from cte c1,
        lateral flatten(input=>category)
        order by seq, value
    ), cte_2 as
        (select array_agg(value) category, frequency
            from cte_1
            group by seq,frequency)
    select category, sum(frequency)
        from cte_2
        group by category;

CATEGORY	SUM(FREQUENCY)
[ "FLATS", "HOUSE" ]	461
[ "FLATS", "RETAIL" ]	890
[ "HOUSE" ]	23865
[ "FLATS" ]	76736
[ "FLATS", "RETAIL UNITS" ]	287
[ "APARTMENTS" ]	18814
[ "FLATS", "RETAIL UNIT" ]	362
[ "HOUSES" ]	12432
[ "FLATS", "HOUSES" ]	4587

回复收藏 0 原文

太阳公公是暖光 2025-02-18 21:53:45

我真的很喜欢Panka的答案，但是顺序不应该在CTE的步骤中，应该在Arrray_agg内会燃烧你的魔力。

另外，CTE2和CTE可以合并为一个步骤，如果您别名，则可以看到正在发生的事情：

with fake_data_table (category, frequency) as (
    select split(column1, '|'), column2 
    from values 
        ('HOUSES', 12432),
        ('FLATS', 76736),
        ('HOUSE', 23865),
        ('APARTMENTS', 18814),
        ('HOUSES|FLATS', 3245),
        ('FLATS|HOUSES', 1342),
        ('FLATS|RETAIL UNIT', 362),
        ('FLATS|HOUSE', 461),
        ('FLATS|RETAIL', 890),
        ('FLATS|RETAIL UNITS', 287)
), part_a as (
    select array_agg(f.value) within group (order by f.seq, f.value) as category, 
        d.frequency
    from fake_data_table as d,
    lateral flatten(input=>category) f
    group by f.seq, d.frequency
)
select category, sum(frequency)
from part_a
group by category 
order by 1;

giss：

类别	总和（频率）
[“公寓”]	18,814
[“ flats”]	76,736
[“ flats” [“ flats”，“ house”，“ house” ”]	461
[“平面”，“房屋”]	4,587
[“ Flats”，“零售”]	890
[“ Flats”，“零售单位”]	362
[“ Flats”，“零售单位”	287
[House'] [“ House”]	23,865
[“房屋”]	12,432

I really like Panka's answer, but the ORDER BY should not be in the step of CTE and should be inside the ARRRAY_AGG because a) there is a command to solve this b) the order of CTE is unimportant, and relying on order of rows is a magic that will burn you.

Also the CTE2 and CTE can be merged into one step, if you alias the FLATEN you can see what is happening:

with fake_data_table (category, frequency) as (
    select split(column1, '|'), column2 
    from values 
        ('HOUSES', 12432),
        ('FLATS', 76736),
        ('HOUSE', 23865),
        ('APARTMENTS', 18814),
        ('HOUSES|FLATS', 3245),
        ('FLATS|HOUSES', 1342),
        ('FLATS|RETAIL UNIT', 362),
        ('FLATS|HOUSE', 461),
        ('FLATS|RETAIL', 890),
        ('FLATS|RETAIL UNITS', 287)
), part_a as (
    select array_agg(f.value) within group (order by f.seq, f.value) as category, 
        d.frequency
    from fake_data_table as d,
    lateral flatten(input=>category) f
    group by f.seq, d.frequency
)
select category, sum(frequency)
from part_a
group by category 
order by 1;

gives:

CATEGORY	SUM(FREQUENCY)
[ "APARTMENTS" ]	18,814
[ "FLATS" ]	76,736
[ "FLATS", "HOUSE" ]	461
[ "FLATS", "HOUSES" ]	4,587
[ "FLATS", "RETAIL" ]	890
[ "FLATS", "RETAIL UNIT" ]	362
[ "FLATS", "RETAIL UNITS" ]	287
[ "HOUSE" ]	23,865
[ "HOUSES" ]	12,432