GROUP BY 和聚合连续数值

发布于 2024-12-13 21:30:32 字数 971 浏览 2 评论 0原文

使用 PostgreSQL 9.0。

假设我有一个包含字段的表：company、profession 和 year。我想返回一个包含独特公司和职业的结果，但根据数字序列聚合（到一个数组中即可）年份：

示例表：

+-----------------------------+
| company | profession | year |
+---------+------------+------+
| Google  | Programmer | 2000 |
| Google  | Sales      | 2000 |
| Google  | Sales      | 2001 |
| Google  | Sales      | 2002 |
| Google  | Sales      | 2004 |
| Mozilla | Sales      | 2002 |
+-----------------------------+

我对一个查询感兴趣，该查询将输出类似于以下内容的行：

+-----------------------------------------+
| company | profession | year             |
+---------+------------+------------------+
| Google  | Programmer | [2000]           |
| Google  | Sales      | [2000,2001,2002] |
| Google  | Sales      | [2004]           |
| Mozilla | Sales      | [2002]           |
+-----------------------------------------+

基本功能是只有连续年份才可以分组在一起。

原文

Using PostgreSQL 9.0.

Let's say I have a table containing the fields: company, profession and year. I want to return a result which contains unique companies and professions, but aggregates (into an array is fine) years based on numeric sequence:

Example Table:

+-----------------------------+
| company | profession | year |
+---------+------------+------+
| Google  | Programmer | 2000 |
| Google  | Sales      | 2000 |
| Google  | Sales      | 2001 |
| Google  | Sales      | 2002 |
| Google  | Sales      | 2004 |
| Mozilla | Sales      | 2002 |
+-----------------------------+

I'm interested in a query which would output rows similar to the following:

+-----------------------------------------+
| company | profession | year             |
+---------+------------+------------------+
| Google  | Programmer | [2000]           |
| Google  | Sales      | [2000,2001,2002] |
| Google  | Sales      | [2004]           |
| Mozilla | Sales      | [2002]           |
+-----------------------------------------+

The essential feature is that only consecutive years shall be grouped together.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

轻拂→两袖风尘 2024-12-20 21:30:33

@a_horse_with_no_name 的答案很有价值一个正确的解决方案，就像我在评论中已经说过的那样，作为学习如何在 PostgreSQL 中使用不同类型的窗口函数的好材料。

然而，我不禁感到，对于像这样的问题，该答案中所采取的方法有点太过了。基本上，在继续在数组中聚合年份之前，您需要的是一个附加的分组标准。您已经有了公司和职业，现在您只需要一些东西来区分属于不同序列的年份。

这正是上述答案所提供的，这正是我认为可以用更简单的方式完成的。方法如下：

WITH MarkedForGrouping AS (
  SELECT
    company,
    profession,
    year,
    year - ROW_NUMBER() OVER (
      PARTITION BY company, profession
      ORDER BY year
    ) AS seqID
  FROM atable
)
SELECT
  company,
  profession,
  array_agg(year) AS years
FROM MarkedForGrouping
GROUP BY
  company,
  profession,
  seqID

There's much value to @a_horse_with_no_name's answer, both as a correct solution and, like I already said in a comment, as a good material for learning how to use different kinds of window functions in PostgreSQL.

And yet I cannot help feeling that the approach taken in that answer is a bit too much of an effort for a problem like this one. Basically, what you need is an additional criterion for grouping before you go on aggregating years in arrays. You've already got company and profession, now you only need something to distinguish years that belong to different sequences.

That is just what the above mentioned answer provides and that is precisely what I think can be done in a simpler way. Here's how:

WITH MarkedForGrouping AS (
  SELECT
    company,
    profession,
    year,
    year - ROW_NUMBER() OVER (
      PARTITION BY company, profession
      ORDER BY year
    ) AS seqID
  FROM atable
)
SELECT
  company,
  profession,
  array_agg(year) AS years
FROM MarkedForGrouping
GROUP BY
  company,
  profession,
  seqID

回复收藏 0 原文

撞了怀 2024-12-20 21:30:33

使用 PL/pgSQL 的程序解决方案

对于具有聚合/Windows 函数的普通 SQL 来说，这个问题相当难以处理。虽然循环通常比使用纯 SQL 的基于集合的解决方案慢，但使用 PL/pgSQL 的过程解决方案可以通过对表进行单次顺序扫描（FOR 循环），并且在这种特殊情况下应该明显更快：

测试表：

CREATE TEMP TABLE tbl (company text, profession text, year int);
INSERT INTO tbl VALUES
  ('Google',  'Programmer', 2000)
, ('Google',  'Sales',      2000)
, ('Google',  'Sales',      2001)
, ('Google',  'Sales',      2002)
, ('Google',  'Sales',      2004)
, ('Mozilla', 'Sales',      2002)
;

函数：

CREATE OR REPLACE FUNCTION f_periods()
  RETURNS TABLE (company text, profession text, years int[])
  LANGUAGE plpgsql AS
$func$
DECLARE
   r  tbl; -- use table type as row variable
   r0 tbl;
BEGIN
   FOR r IN
      SELECT * FROM tbl t ORDER BY t.company, t.profession, t.year
   LOOP
      IF ( r.company,  r.profession,  r.year)
      <> (r0.company, r0.profession, r0.year + 1) THEN -- not true for first row

         RETURN QUERY
         SELECT r0.company, r0.profession, years; -- output row

         years := ARRAY[r.year];     -- start new array
      ELSE
         years := years || r.year;   -- add to array - year can be NULL, too
      END IF;

      r0 := r;                       -- remember last row
   END LOOP;

   RETURN QUERY                      -- output last iteration
   SELECT r0.company, r0.profession, years;
END
$func$;

调用：

SELECT * FROM f_periods();

db>>fiddle 此处

生成请求的结果。

Procedural solution with PL/pgSQL

The problem is rather unwieldy for plain SQL with aggregate / windows functions. While looping is typically slower than set-based solutions with plain SQL, a procedural solution with PL/pgSQL can make do with a single sequential scan over the table (implicit cursor of a FOR loop) and should be substantially faster in this particular case:

Test table:

CREATE TEMP TABLE tbl (company text, profession text, year int);
INSERT INTO tbl VALUES
  ('Google',  'Programmer', 2000)
, ('Google',  'Sales',      2000)
, ('Google',  'Sales',      2001)
, ('Google',  'Sales',      2002)
, ('Google',  'Sales',      2004)
, ('Mozilla', 'Sales',      2002)
;

Function:

CREATE OR REPLACE FUNCTION f_periods()
  RETURNS TABLE (company text, profession text, years int[])
  LANGUAGE plpgsql AS
$func$
DECLARE
   r  tbl; -- use table type as row variable
   r0 tbl;
BEGIN
   FOR r IN
      SELECT * FROM tbl t ORDER BY t.company, t.profession, t.year
   LOOP
      IF ( r.company,  r.profession,  r.year)
      <> (r0.company, r0.profession, r0.year + 1) THEN -- not true for first row

         RETURN QUERY
         SELECT r0.company, r0.profession, years; -- output row

         years := ARRAY[r.year];     -- start new array
      ELSE
         years := years || r.year;   -- add to array - year can be NULL, too
      END IF;

      r0 := r;                       -- remember last row
   END LOOP;

   RETURN QUERY                      -- output last iteration
   SELECT r0.company, r0.profession, years;
END
$func$;

Call:

SELECT * FROM f_periods();

db<>fiddle here

Produces the requested result.

回复收藏 0 原文

感性不性感 2024-12-20 21:30:32

识别非连续值总是有点棘手，并且涉及多个嵌套子查询（至少我无法想出更好的解决方案）。

第一步是识别年份的非连续值：

步骤 1) 识别非连续值

select company, 
       profession,
       year,
       case 
          when row_number() over (partition by company, profession order by year) = 1 or 
               year - lag(year,1,year) over (partition by company, profession order by year) > 1 then 1
          else 0
       end as group_cnt
from qualification

这将返回以下结果：

 company | profession | year | group_cnt
---------+------------+------+-----------
 Google  | Programmer | 2000 |         1
 Google  | Sales      | 2000 |         1
 Google  | Sales      | 2001 |         0
 Google  | Sales      | 2002 |         0
 Google  | Sales      | 2004 |         1
 Mozilla | Sales      | 2002 |         1

现在，使用 group_cnt 值，我们可以为具有连续年份的每个组创建“组 ID”：

步骤 2 ) 定义组 ID

select company,
   profession,
   year,
   sum(group_cnt) over (order by company, profession, year) as group_nr
from ( 
select company, 
       profession,
       year,
       case 
          when row_number() over (partition by company, profession order by year) = 1 or 
               year - lag(year,1,year) over (partition by company, profession order by year) > 1 then 1
          else 0
       end as group_cnt
from qualification
) t1

这将返回以下结果：

 company | profession | year | group_nr
---------+------------+------+----------
 Google  | Programmer | 2000 |        1
 Google  | Sales      | 2000 |        2
 Google  | Sales      | 2001 |        2
 Google  | Sales      | 2002 |        2
 Google  | Sales      | 2004 |        3
 Mozilla | Sales      | 2002 |        4
(6 rows)

如您所见，每个“组”都有自己的 group_nr，我们最终可以通过添加另一个派生表来使用它进行聚合：

步骤 3) 最终查询

select company,
       profession,
       array_agg(year) as years
from (
  select company,
       profession,
       year,
       sum(group_cnt) over (order by company, profession, year) as group_nr
  from ( 
    select company, 
           profession,
           year,
           case 
              when row_number() over (partition by company, profession order by year) = 1 or 
                   year - lag(year,1,year) over (partition by company, profession order by year) > 1 then 1
              else 0
           end as group_cnt
    from qualification
  ) t1
) t2
group by company, profession, group_nr
order by company, profession, group_nr

以下结果：

 company | profession |      years
---------+------------+------------------
 Google  | Programmer | {2000}
 Google  | Sales      | {2000,2001,2002}
 Google  | Sales      | {2004}
 Mozilla | Sales      | {2002}
(4 rows)

这将返回正是如果我没记错的话，你想要什么。

Identifying non-consecutive values is always a bit tricky and involves several nested sub-queries (at least I cannot come up with a better solution).

The first step is to identify non-consecutive values for the year:

Step 1) Identify non-consecutive values

select company, 
       profession,
       year,
       case 
          when row_number() over (partition by company, profession order by year) = 1 or 
               year - lag(year,1,year) over (partition by company, profession order by year) > 1 then 1
          else 0
       end as group_cnt
from qualification

This returns the following result:

 company | profession | year | group_cnt
---------+------------+------+-----------
 Google  | Programmer | 2000 |         1
 Google  | Sales      | 2000 |         1
 Google  | Sales      | 2001 |         0
 Google  | Sales      | 2002 |         0
 Google  | Sales      | 2004 |         1
 Mozilla | Sales      | 2002 |         1

Now with the group_cnt value we can create "group IDs" for each group that has consecutive years:

Step 2) Define group IDs

select company,
   profession,
   year,
   sum(group_cnt) over (order by company, profession, year) as group_nr
from ( 
select company, 
       profession,
       year,
       case 
          when row_number() over (partition by company, profession order by year) = 1 or 
               year - lag(year,1,year) over (partition by company, profession order by year) > 1 then 1
          else 0
       end as group_cnt
from qualification
) t1

This returns the following result:

 company | profession | year | group_nr
---------+------------+------+----------
 Google  | Programmer | 2000 |        1
 Google  | Sales      | 2000 |        2
 Google  | Sales      | 2001 |        2
 Google  | Sales      | 2002 |        2
 Google  | Sales      | 2004 |        3
 Mozilla | Sales      | 2002 |        4
(6 rows)

As you can see each "group" got its own group_nr and this we can finally use to aggregate over by adding yet another derived table:

Step 3) Final query

select company,
       profession,
       array_agg(year) as years
from (
  select company,
       profession,
       year,
       sum(group_cnt) over (order by company, profession, year) as group_nr
  from ( 
    select company, 
           profession,
           year,
           case 
              when row_number() over (partition by company, profession order by year) = 1 or 
                   year - lag(year,1,year) over (partition by company, profession order by year) > 1 then 1
              else 0
           end as group_cnt
    from qualification
  ) t1
) t2
group by company, profession, group_nr
order by company, profession, group_nr

This returns the following result:

 company | profession |      years
---------+------------+------------------
 Google  | Programmer | {2000}
 Google  | Sales      | {2000,2001,2002}
 Google  | Sales      | {2004}
 Mozilla | Sales      | {2002}
(4 rows)

Which is exactly what you wanted, if I'm not mistaken.

回复收藏 0 原文

~没有更多了~