按年龄组计算总和和库尔（按日期）（但随着时间的流逝，人们会改变年龄组）

发布于 2025-02-13 23:07:56 字数 7355 浏览 0 评论 0原文

DBMS ：PostgreSQL

我的问题：
在我的数据库中，我有一个person具有ID和出生日期的表，链接一个人的表，事件（id_event）和一个日期，年龄用于分组年龄的表。在实际数据库中，Person表约为4000万obs，事件大3倍。

我需要按年龄（age_group）和date（event_date）产生报告（x事件的总和和库尔）。按日期计算事件的数量没有任何问题。问题在于库尔：与其他变量相反（例如性别），一个人长大并改变年龄段随着时间的流逝，对于给定的年龄组，库尔可以增加然后减少。我希望该事件在报告中的每个日期都使用这些日期的人的年龄。

输入和所需输出的

我的

示例表上的笛卡尔产品在表上age_group。下面的代码使用此方法。
但是我不能在实际数据上使用笛卡尔产品（使表格太大），我需要使用另一种方法。

可生殖的例子
在这个简化的示例中，我想在2020-07-01到2022-07-01的月份之前制作报告（查看v_dates）。实际上，我需要一日生产相同的报告，但逻辑保持不变。

我的输入

/* create table person*/
DROP TABLE IF EXISTS person;
CREATE TABLE person 
(
    person_id varchar(1), 
    person_birth_date date 
);
INSERT INTO person 
VALUES ('A', '2017-01-01'), 
       ('B', '2016-07-01');

person_id	person_birth_date
a	2000-10-01
b	2010-02-01

/* create table events*/
DROP TABLE IF EXISTS events;
CREATE TABLE events 
(
    person_id   varchar(1), 
    event_id integer, 
    event_date date
);
INSERT INTO events 
VALUES ('A', 1, '2020-07-01'), 
       ('A', 2, '2021-07-01'), 
       ('B', 1, '2021-01-01'), 
       ('B', 2, '2022-01-01');

person_id	event_id	event_id event_date
a	1	2020-01-01
A	2	2021-01-01-01
b	1	2020-07-07-01
b	2	2021-01-01-01

/* create table age*/
DROP TABLE IF EXISTS age;
CREATE TABLE age 
(
    age integer, 
    age_group varchar(8)
);
INSERT INTO age 
VALUES (0,'[0-4]'), 
       (1,'[0-4]'), 
       (2,'[0-4]'), 
       (3,'[0-4]'), 
       (4,'[0-4]'), 
       (5,'[5-9]'), 
       (6,'[5-9]'), 
       (7,'[5-9]'), 
       (8,'[5-9]'), 
       (9,'[5-9]');

/* create view dates : contains monthly dates from 2020-07-01 to 2022-07-01*/
CREATE or replace view v_dates AS 
    SELECT GENERATE_SERIES('2020-07-01'::date, '2022-07-01'::date, '6 month')::date as event_date;

age_group	01 age_group
0	[0-4]
1	[0-4]

5	[5-9]

我的当前方法使用笛卡尔产品

交叉加入 person * v_dates
使用左JOIN从表事件获取信息
使用左JOIN从表格中获取age_group从表格

CREATE or replace view v_person_event AS
  SELECT 
      pdev.person_id, 
      pdev.event_date, 
      pdev.age, 
      ag.age_group, 
      pdev.event1, 
      pdev.event2
  FROM 
  (
      SELECT  pd.person_id,
              pd.event_date, 
              date_part('year', age(pd.event_date::TIMESTAMP, pd.person_birth_date::TIMESTAMP)) as age,
              CASE WHEN ev.event_id = 1 THEN 1 else 0 END as event1,
              CASE WHEN ev.event_id = 2 THEN 1 else 0 END as event2
      FROM 
      (
            SELECT * 
            FROM person 
            CROSS JOIN v_dates
      ) pd
      LEFT JOIN events ev 
      on pd.person_id = ev.person_id 
      and pd.event_date = ev.event_date
  ) pdev
  Left JOIN age as ag on pdev.age = ag.age
  ORDER by pdev.person_id, pdev.event_date;

添加列event1_cum and code> event2_cum

CREATE or replace view v_person_event_cum AS
  SELECT *,
      SUM(event1) OVER (PARTITION BY person_id ORDER BY event_date) event1_cum,
      SUM(event2) OVER (PARTITION BY person_id ORDER BY event_date) event2_cum
  FROM v_person_event;
SELECT * FROM v_person_event_cum;

person_id	event_date event_date	age	age_group age_group	age	age age age age age age age age age age age age age evep1 event1_cum	event1_cum	event2_cum
a	2020-07-07-01-07-07-01	3	[0-4]	1	0	1	0
A	2021-01-01	4	[0-4]	0	0	1	0
A	2021-07-01	4	[0-4]	0	1	1	1
A A	2022-01-01	5	[5-9]	0	1	1	1
A	2022-07-01	5	[5-9]	0 0	1 1 1 1	1	B
1	2020-- 2020- 07-01	4	[0-4]	0	0	0	0
B	2021-01-01	4	[0-4]	1	0	1	0
B	2021-07-01	5	[5-9]	0 0 0 1	0 0	1	0
b	2022-01-01	5	[5-9]	0	1	1	1
B	2022-07-01	6	[5-9]	0 0	1 1 1	1 1 1 1	1

所需的输出：创建由变量分组的报告age_group event_date

SELECT
    age_group,
    event_date, 
    SUM(event1) as event1, 
    SUM(event2) as event2, 
    SUM(event1_cum) as event1_cum, 
    SUM(event2_cum) as event2_cum
FROM v_person_event_cum 
GROUP BY age_group, event_date
ORDER BY age_group, event_date;

age_group	event_date	event1	event2	event1_cum	event2_cum
[0-4]	`2020-07-07-01`	1	0	1	0
[0-4]	`2021-01-01`	1	0	2	0
[0-4]	`2021-07-01`	0	1 1	1	1 1
[5-9 ]	`2021-07-01`	0	0	1	0
[5-9]	`2022-01-01`	0	1	2	2

这就是为什么这不是一个普通的cumul： age_group [0-4]，event1_cum从'2021-01-01'到1到'2021-07-01'的1，因为a在[0-] 4]事件1时，仍在[0-4]的“ 2021-01-01”中，但在[5-9]中，在2021-07-01中，

当我们阅读报告

： 2021-01-01，有2人在0到4（当时）有Event1和0个Event2的人。
2021-07-01，有1人在0到4之间有1人，其中有Event1和1人患有Event2。

如果不使用笛卡尔产品，我无法解决这个问题...
提前致谢！

原文

DBMS : postgreSQL

my problem :
In my database I have a person table with id and birth date, an events table that links a person, an event (id_event) and a date, an age table used for grouping ages. In the real database the person table is about 40 millions obs, and events 3 times bigger.

I need to produce a report (sum and cumul of X events) by age (age_group) and date (event_date). There isn't any problem to count the number of events by date. The problem lies with the cumul : contrary to other variables (sex for example), a person grow older and changes age group
as time passes, so for a given age group the cumul can increase then decrease. I want that the event's cumul, on every date in my report, uses the age of the persons on these dates.

Example of my inputs and desired output

The only way I found is to do a Cartesian product on the tables person and the dates v_dates, so it's easy to follow an event and make it change age_group. The code below uses this method.
BUT I can't use a cartesian product on my real data (makes a table way too big) and I need to use another method.

reproductible example
In this simplified example I want to produce a report by month from 2020-07-01 to 2022-07-01 (view v_dates). In reality I need to produce the same report by day but the logic remains the same.

My inputs

/* create table person*/
DROP TABLE IF EXISTS person;
CREATE TABLE person 
(
    person_id varchar(1), 
    person_birth_date date 
);
INSERT INTO person 
VALUES ('A', '2017-01-01'), 
       ('B', '2016-07-01');

person_id	person_birth_date
A	2000-10-01
B	2010-02-01

/* create table events*/
DROP TABLE IF EXISTS events;
CREATE TABLE events 
(
    person_id   varchar(1), 
    event_id integer, 
    event_date date
);
INSERT INTO events 
VALUES ('A', 1, '2020-07-01'), 
       ('A', 2, '2021-07-01'), 
       ('B', 1, '2021-01-01'), 
       ('B', 2, '2022-01-01');

person_id	event_id	event_date
A	1	2020-01-01
A	2	2021-01-01
B	1	2020-07-01
B	2	2021-01-01

/* create table age*/
DROP TABLE IF EXISTS age;
CREATE TABLE age 
(
    age integer, 
    age_group varchar(8)
);
INSERT INTO age 
VALUES (0,'[0-4]'), 
       (1,'[0-4]'), 
       (2,'[0-4]'), 
       (3,'[0-4]'), 
       (4,'[0-4]'), 
       (5,'[5-9]'), 
       (6,'[5-9]'), 
       (7,'[5-9]'), 
       (8,'[5-9]'), 
       (9,'[5-9]');

/* create view dates : contains monthly dates from 2020-07-01 to 2022-07-01*/
CREATE or replace view v_dates AS 
    SELECT GENERATE_SERIES('2020-07-01'::date, '2022-07-01'::date, '6 month')::date as event_date;

age	age_group
0	[0-4]
1	[0-4]

5	[5-9]

My current method using a cartesian product

CROSS JOIN person * v_dates
with a LEFT JOIN to get info from table events
with a LEFT JOIN to get age_group from table age

CREATE or replace view v_person_event AS
  SELECT 
      pdev.person_id, 
      pdev.event_date, 
      pdev.age, 
      ag.age_group, 
      pdev.event1, 
      pdev.event2
  FROM 
  (
      SELECT  pd.person_id,
              pd.event_date, 
              date_part('year', age(pd.event_date::TIMESTAMP, pd.person_birth_date::TIMESTAMP)) as age,
              CASE WHEN ev.event_id = 1 THEN 1 else 0 END as event1,
              CASE WHEN ev.event_id = 2 THEN 1 else 0 END as event2
      FROM 
      (
            SELECT * 
            FROM person 
            CROSS JOIN v_dates
      ) pd
      LEFT JOIN events ev 
      on pd.person_id = ev.person_id 
      and pd.event_date = ev.event_date
  ) pdev
  Left JOIN age as ag on pdev.age = ag.age
  ORDER by pdev.person_id, pdev.event_date;

add columns event1_cum and event2_cum

CREATE or replace view v_person_event_cum AS
  SELECT *,
      SUM(event1) OVER (PARTITION BY person_id ORDER BY event_date) event1_cum,
      SUM(event2) OVER (PARTITION BY person_id ORDER BY event_date) event2_cum
  FROM v_person_event;
SELECT * FROM v_person_event_cum;

person_id	event_date	age	age_group	event1	event2	event1_cum	event2_cum
A	2020-07-01	3	[0-4]	1	0	1	0
A	2021-01-01	4	[0-4]	0	0	1	0
A	2021-07-01	4	[0-4]	0	1	1	1
A	2022-01-01	5	[5-9]	0	0	1	1
A	2022-07-01	5	[5-9]	0	0	1	1
B	2020-07-01	4	[0-4]	0	0	0	0
B	2021-01-01	4	[0-4]	1	0	1	0
B	2021-07-01	5	[5-9]	0	0	1	0
B	2022-01-01	5	[5-9]	0	1	1	1
B	2022-07-01	6	[5-9]	0	0	1	1

desired output : create a report grouped by variables age_group and event_date

SELECT
    age_group,
    event_date, 
    SUM(event1) as event1, 
    SUM(event2) as event2, 
    SUM(event1_cum) as event1_cum, 
    SUM(event2_cum) as event2_cum
FROM v_person_event_cum 
GROUP BY age_group, event_date
ORDER BY age_group, event_date;

age_group	event_date	event1	event2	event1_cum	event2_cum
[0-4]	`2020-07-01`	1	0	1	0
[0-4]	`2021-01-01`	1	0	2	0
[0-4]	`2021-07-01`	0	1	1	1
[5-9]	`2021-07-01`	0	0	1	0
[5-9]	`2022-01-01`	0	1	2	2

This is why this is not an ordinary cumul : for the age_group [0-4], event1_cum goes from 2 at '2021-01-01' to 1 at '2021-07-01' because A was in [0-4] at the time of the event 1, still in [0-4] at '2021-01-01' but in [5-9] at 2021-07-01

When we read the report:

the 2021-01-01, there was 2 person between 0 and 4 (at that date) who had event1 and 0 person who had event2.
the 2021-07-01, there was 1 person between 0 and 4 who had event1 and 1 person who had event2.

I can't get a solution to this problem without using a cartesian Product...
Thanks in advance!

分享到QQ

分享到微博