按年龄组计算总和和库尔(按日期)(但随着时间的流逝,人们会改变年龄组)

发布于 2025-02-13 23:07:56 字数 7355 浏览 0 评论 0原文

DBMS :PostgreSQL

我的问题
在我的数据库中,我有一个person具有ID和出生日期的表,链接一个人的表,事件(id_event)和一个日期,年龄用于分组年龄的表。在实际数据库中,Person表约为4000万obs,事件大3倍。

我需要按年龄(age_group)和date(event_date)产生报告(x事件的总和和库尔)。按日期计算事件的数量没有任何问题。问题在于库尔:与其他变量相反(例如性别),一个人长大并改变年龄段 随着时间的流逝,对于给定的年龄组,库尔可以增加然后减少。我希望该事件在报告中的每个日期都使用这些日期的人的年龄。

输入和所需输出的

我的

示例表上的笛卡尔产品在表上age_group。下面的代码使用此方法。
但是我不能在实际数据上使用笛卡尔产品(使表格太大),我需要使用另一种方法。

可生殖的例子
在这个简化的示例中,我想在2020-07-01到2022-07-01的月份之前制作报告(查看v_dates)。实际上,我需要一日生产相同的报告,但逻辑保持不变。

我的输入

/* create table person*/
DROP TABLE IF EXISTS person;
CREATE TABLE person 
(
    person_id varchar(1), 
    person_birth_date date 
);
INSERT INTO person 
VALUES ('A', '2017-01-01'), 
       ('B', '2016-07-01');
person_idperson_birth_date
a2000-10-01
b2010-02-01
/* create table events*/
DROP TABLE IF EXISTS events;
CREATE TABLE events 
(
    person_id   varchar(1), 
    event_id integer, 
    event_date date
);
INSERT INTO events 
VALUES ('A', 1, '2020-07-01'), 
       ('A', 2, '2021-07-01'), 
       ('B', 1, '2021-01-01'), 
       ('B', 2, '2022-01-01');
person_idevent_idevent_id event_date
a12020-01-01
A22021-01-01-01
b12020-07-07-01
b22021-01-01-01
/* create table age*/
DROP TABLE IF EXISTS age;
CREATE TABLE age 
(
    age integer, 
    age_group varchar(8)
);
INSERT INTO age 
VALUES (0,'[0-4]'), 
       (1,'[0-4]'), 
       (2,'[0-4]'), 
       (3,'[0-4]'), 
       (4,'[0-4]'), 
       (5,'[5-9]'), 
       (6,'[5-9]'), 
       (7,'[5-9]'), 
       (8,'[5-9]'), 
       (9,'[5-9]');

/* create view dates : contains monthly dates from 2020-07-01 to 2022-07-01*/
CREATE or replace view v_dates AS 
    SELECT GENERATE_SERIES('2020-07-01'::date, '2022-07-01'::date, '6 month')::date as event_date;
age_group01 age_group
0[0-4]
1[0-4]
5[5-9]

我的当前方法使用笛卡尔产品

  1. 交叉加入 person * v_dates
    使用左JOIN从表事件获取信息
    使用左JOIN从表格中获取age_group从表格
CREATE or replace view v_person_event AS
  SELECT 
      pdev.person_id, 
      pdev.event_date, 
      pdev.age, 
      ag.age_group, 
      pdev.event1, 
      pdev.event2
  FROM 
  (
      SELECT  pd.person_id,
              pd.event_date, 
              date_part('year', age(pd.event_date::TIMESTAMP, pd.person_birth_date::TIMESTAMP)) as age,
              CASE WHEN ev.event_id = 1 THEN 1 else 0 END as event1,
              CASE WHEN ev.event_id = 2 THEN 1 else 0 END as event2
      FROM 
      (
            SELECT * 
            FROM person 
            CROSS JOIN v_dates
      ) pd
      LEFT JOIN events ev 
      on pd.person_id = ev.person_id 
      and pd.event_date = ev.event_date
  ) pdev
  Left JOIN age as ag on pdev.age = ag.age
  ORDER by pdev.person_id, pdev.event_date;
  1. 添加列event1_cum and code> event2_cum
CREATE or replace view v_person_event_cum AS
  SELECT *,
      SUM(event1) OVER (PARTITION BY person_id ORDER BY event_date) event1_cum,
      SUM(event2) OVER (PARTITION BY person_id ORDER BY event_date) event2_cum
  FROM v_person_event;
SELECT * FROM v_person_event_cum;       
person_idevent_date event_dateageage_group age_groupageage age age age age age age age age age age age age evep1 event1_cumevent1_cumevent2_cum
a2020-07-07-01-07-07-013[0-4]1010
A2021-01-014[0-4]0010
A2021-07-014[0-4]0111
A A2022-01-015[5-9]0111
A2022-07-015[5-9]0 01 1 1 11B
12020-- 2020- 07-014[0-4]0000
B2021-01-014[0-4]1010
B2021-07-015[5-9]0 0 0 10 010
b2022-01-015[5-9]0111
B2022-07-016[5-9]0 01 1 11 1 1 11
  1. 所需的输出:创建由变量分组的报告age_group event_date
SELECT
    age_group,
    event_date, 
    SUM(event1) as event1, 
    SUM(event2) as event2, 
    SUM(event1_cum) as event1_cum, 
    SUM(event2_cum) as event2_cum
FROM v_person_event_cum 
GROUP BY age_group, event_date
ORDER BY age_group, event_date;
age_groupevent_dateevent1event2event1_cumevent2_cum
[0-4]2020-07-07-011010
[0-4]2021-01-011020
[0-4]2021-07-0101 111 1
[5-9 ]2021-07-010010
[5-9]2022-01-010122

这就是为什么这不是一个普通的cumul: age_group [0-4],event1_cum从'2021-01-01'到1到'2021-07-01'的1,因为a在[0-] 4]事件1时,仍在[0-4]的“ 2021-01-01”中,但在[5-9]中,在2021-07-01中,

当我们阅读报告

  • : 2021-01-01,有2人在0到4(当时)有Event1和0个Event2的人。
  • 2021-07-01,有1人在0到4之间有1人,其中有Event1和1人患有Event2。

如果不使用笛卡尔产品,我无法解决这个问题...
提前致谢!

DBMS : postgreSQL

my problem :
In my database I have a person table with id and birth date, an events table that links a person, an event (id_event) and a date, an age table used for grouping ages. In the real database the person table is about 40 millions obs, and events 3 times bigger.

I need to produce a report (sum and cumul of X events) by age (age_group) and date (event_date). There isn't any problem to count the number of events by date. The problem lies with the cumul : contrary to other variables (sex for example), a person grow older and changes age group
as time passes, so for a given age group the cumul can increase then decrease. I want that the event's cumul, on every date in my report, uses the age of the persons on these dates.

Example of my inputs and desired output

Example of my inputs and desired output

The only way I found is to do a Cartesian product on the tables person and the dates v_dates, so it's easy to follow an event and make it change age_group. The code below uses this method.
BUT I can't use a cartesian product on my real data (makes a table way too big) and I need to use another method.

reproductible example
In this simplified example I want to produce a report by month from 2020-07-01 to 2022-07-01 (view v_dates). In reality I need to produce the same report by day but the logic remains the same.

My inputs

/* create table person*/
DROP TABLE IF EXISTS person;
CREATE TABLE person 
(
    person_id varchar(1), 
    person_birth_date date 
);
INSERT INTO person 
VALUES ('A', '2017-01-01'), 
       ('B', '2016-07-01');
person_idperson_birth_date
A2000-10-01
B2010-02-01
/* create table events*/
DROP TABLE IF EXISTS events;
CREATE TABLE events 
(
    person_id   varchar(1), 
    event_id integer, 
    event_date date
);
INSERT INTO events 
VALUES ('A', 1, '2020-07-01'), 
       ('A', 2, '2021-07-01'), 
       ('B', 1, '2021-01-01'), 
       ('B', 2, '2022-01-01');
person_idevent_idevent_date
A12020-01-01
A22021-01-01
B12020-07-01
B22021-01-01
/* create table age*/
DROP TABLE IF EXISTS age;
CREATE TABLE age 
(
    age integer, 
    age_group varchar(8)
);
INSERT INTO age 
VALUES (0,'[0-4]'), 
       (1,'[0-4]'), 
       (2,'[0-4]'), 
       (3,'[0-4]'), 
       (4,'[0-4]'), 
       (5,'[5-9]'), 
       (6,'[5-9]'), 
       (7,'[5-9]'), 
       (8,'[5-9]'), 
       (9,'[5-9]');

/* create view dates : contains monthly dates from 2020-07-01 to 2022-07-01*/
CREATE or replace view v_dates AS 
    SELECT GENERATE_SERIES('2020-07-01'::date, '2022-07-01'::date, '6 month')::date as event_date;
ageage_group
0[0-4]
1[0-4]
5[5-9]

My current method using a cartesian product

  1. CROSS JOIN person * v_dates
    with a LEFT JOIN to get info from table events
    with a LEFT JOIN to get age_group from table age
CREATE or replace view v_person_event AS
  SELECT 
      pdev.person_id, 
      pdev.event_date, 
      pdev.age, 
      ag.age_group, 
      pdev.event1, 
      pdev.event2
  FROM 
  (
      SELECT  pd.person_id,
              pd.event_date, 
              date_part('year', age(pd.event_date::TIMESTAMP, pd.person_birth_date::TIMESTAMP)) as age,
              CASE WHEN ev.event_id = 1 THEN 1 else 0 END as event1,
              CASE WHEN ev.event_id = 2 THEN 1 else 0 END as event2
      FROM 
      (
            SELECT * 
            FROM person 
            CROSS JOIN v_dates
      ) pd
      LEFT JOIN events ev 
      on pd.person_id = ev.person_id 
      and pd.event_date = ev.event_date
  ) pdev
  Left JOIN age as ag on pdev.age = ag.age
  ORDER by pdev.person_id, pdev.event_date;
  1. add columns event1_cum and event2_cum
CREATE or replace view v_person_event_cum AS
  SELECT *,
      SUM(event1) OVER (PARTITION BY person_id ORDER BY event_date) event1_cum,
      SUM(event2) OVER (PARTITION BY person_id ORDER BY event_date) event2_cum
  FROM v_person_event;
SELECT * FROM v_person_event_cum;       
person_idevent_dateageage_groupevent1event2event1_cumevent2_cum
A2020-07-013[0-4]1010
A2021-01-014[0-4]0010
A2021-07-014[0-4]0111
A2022-01-015[5-9]0011
A2022-07-015[5-9]0011
B2020-07-014[0-4]0000
B2021-01-014[0-4]1010
B2021-07-015[5-9]0010
B2022-01-015[5-9]0111
B2022-07-016[5-9]0011
  1. desired output : create a report grouped by variables age_group and event_date
SELECT
    age_group,
    event_date, 
    SUM(event1) as event1, 
    SUM(event2) as event2, 
    SUM(event1_cum) as event1_cum, 
    SUM(event2_cum) as event2_cum
FROM v_person_event_cum 
GROUP BY age_group, event_date
ORDER BY age_group, event_date;
age_groupevent_dateevent1event2event1_cumevent2_cum
[0-4]2020-07-011010
[0-4]2021-01-011020
[0-4]2021-07-010111
[5-9]2021-07-010010
[5-9]2022-01-010122

This is why this is not an ordinary cumul : for the age_group [0-4], event1_cum goes from 2 at '2021-01-01' to 1 at '2021-07-01' because A was in [0-4] at the time of the event 1, still in [0-4] at '2021-01-01' but in [5-9] at 2021-07-01

When we read the report:

  • the 2021-01-01, there was 2 person between 0 and 4 (at that date) who had event1 and 0 person who had event2.
  • the 2021-07-01, there was 1 person between 0 and 4 who had event1 and 1 person who had event2.

I can't get a solution to this problem without using a cartesian Product...
Thanks in advance!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文