按年龄组计算总和和库尔(按日期)(但随着时间的流逝,人们会改变年龄组)
DBMS :PostgreSQL
我的问题:
在我的数据库中,我有一个person
具有ID和出生日期的表,链接一个人的表,事件(id_event
)和一个日期,年龄
用于分组年龄的表。在实际数据库中,Person
表约为4000万obs,事件
大3倍。
我需要按年龄(age_group
)和date(event_date
)产生报告(x事件的总和和库尔)。按日期计算事件的数量没有任何问题。问题在于库尔:与其他变量相反(例如性别),一个人长大并改变年龄段 随着时间的流逝,对于给定的年龄组,库尔可以增加然后减少。我希望该事件在报告中的每个日期都使用这些日期的人的年龄。
输入和所需输出的
我的
示例表上的笛卡尔产品在表上age_group
。下面的代码使用此方法。
但是我不能在实际数据上使用笛卡尔产品(使表格太大),我需要使用另一种方法。
可生殖的例子
在这个简化的示例中,我想在2020-07-01到2022-07-01的月份之前制作报告(查看v_dates
)。实际上,我需要一日生产相同的报告,但逻辑保持不变。
我的输入
/* create table person*/
DROP TABLE IF EXISTS person;
CREATE TABLE person
(
person_id varchar(1),
person_birth_date date
);
INSERT INTO person
VALUES ('A', '2017-01-01'),
('B', '2016-07-01');
person_id | person_birth_date |
---|---|
a | 2000-10-01 |
b | 2010-02-01 |
/* create table events*/
DROP TABLE IF EXISTS events;
CREATE TABLE events
(
person_id varchar(1),
event_id integer,
event_date date
);
INSERT INTO events
VALUES ('A', 1, '2020-07-01'),
('A', 2, '2021-07-01'),
('B', 1, '2021-01-01'),
('B', 2, '2022-01-01');
person_id | event_id | event_id event_date |
---|---|---|
a | 1 | 2020-01-01 |
A | 2 | 2021-01-01-01 |
b | 1 | 2020-07-07-01 |
b | 2 | 2021-01-01-01 |
/* create table age*/
DROP TABLE IF EXISTS age;
CREATE TABLE age
(
age integer,
age_group varchar(8)
);
INSERT INTO age
VALUES (0,'[0-4]'),
(1,'[0-4]'),
(2,'[0-4]'),
(3,'[0-4]'),
(4,'[0-4]'),
(5,'[5-9]'),
(6,'[5-9]'),
(7,'[5-9]'),
(8,'[5-9]'),
(9,'[5-9]');
/* create view dates : contains monthly dates from 2020-07-01 to 2022-07-01*/
CREATE or replace view v_dates AS
SELECT GENERATE_SERIES('2020-07-01'::date, '2022-07-01'::date, '6 month')::date as event_date;
age_group | 01 age_group |
---|---|
0 | [0-4] |
1 | [0-4] |
5 | [5-9] |
我的当前方法使用笛卡尔产品
交叉加入
person
*v_dates
使用左JOIN
从表事件获取信息
使用左JOIN
从表格中获取age_group从表格
CREATE or replace view v_person_event AS
SELECT
pdev.person_id,
pdev.event_date,
pdev.age,
ag.age_group,
pdev.event1,
pdev.event2
FROM
(
SELECT pd.person_id,
pd.event_date,
date_part('year', age(pd.event_date::TIMESTAMP, pd.person_birth_date::TIMESTAMP)) as age,
CASE WHEN ev.event_id = 1 THEN 1 else 0 END as event1,
CASE WHEN ev.event_id = 2 THEN 1 else 0 END as event2
FROM
(
SELECT *
FROM person
CROSS JOIN v_dates
) pd
LEFT JOIN events ev
on pd.person_id = ev.person_id
and pd.event_date = ev.event_date
) pdev
Left JOIN age as ag on pdev.age = ag.age
ORDER by pdev.person_id, pdev.event_date;
- 添加列
event1_cum
and code> event2_cum
CREATE or replace view v_person_event_cum AS
SELECT *,
SUM(event1) OVER (PARTITION BY person_id ORDER BY event_date) event1_cum,
SUM(event2) OVER (PARTITION BY person_id ORDER BY event_date) event2_cum
FROM v_person_event;
SELECT * FROM v_person_event_cum;
person_id | event_date event_date | age | age_group age_group | age | age age age age age age age age age age age age age evep1 event1_cum | event1_cum | event2_cum |
---|---|---|---|---|---|---|---|
a | 2020-07-07-01-07-07-01 | 3 | [0-4] | 1 | 0 | 1 | 0 |
A | 2021-01-01 | 4 | [0-4] | 0 | 0 | 1 | 0 |
A | 2021-07-01 | 4 | [0-4] | 0 | 1 | 1 | 1 |
A A | 2022-01-01 | 5 | [5-9] | 0 | 1 | 1 | 1 |
A | 2022-07-01 | 5 | [5-9] | 0 0 | 1 1 1 1 | 1 | B |
1 | 2020-- 2020- 07-01 | 4 | [0-4] | 0 | 0 | 0 | 0 |
B | 2021-01-01 | 4 | [0-4] | 1 | 0 | 1 | 0 |
B | 2021-07-01 | 5 | [5-9] | 0 0 0 1 | 0 0 | 1 | 0 |
b | 2022-01-01 | 5 | [5-9] | 0 | 1 | 1 | 1 |
B | 2022-07-01 | 6 | [5-9] | 0 0 | 1 1 1 | 1 1 1 1 | 1 |
- 所需的输出:创建由变量分组的报告
age_group
event_date
SELECT
age_group,
event_date,
SUM(event1) as event1,
SUM(event2) as event2,
SUM(event1_cum) as event1_cum,
SUM(event2_cum) as event2_cum
FROM v_person_event_cum
GROUP BY age_group, event_date
ORDER BY age_group, event_date;
age_group | event_date | event1 | event2 | event1_cum | event2_cum |
---|---|---|---|---|---|
[0-4] | 2020-07-07-01 | 1 | 0 | 1 | 0 |
[0-4] | 2021-01-01 | 1 | 0 | 2 | 0 |
[0-4] | 2021-07-01 | 0 | 1 1 | 1 | 1 1 |
[5-9 ] | 2021-07-01 | 0 | 0 | 1 | 0 |
[5-9] | 2022-01-01 | 0 | 1 | 2 | 2 |
这就是为什么这不是一个普通的cumul: age_group
[0-4],event1_cum
从'2021-01-01'到1到'2021-07-01'的1,因为a在[0-] 4]事件1时,仍在[0-4]的“ 2021-01-01”中,但在[5-9]中,在2021-07-01中,
当我们阅读报告
- : 2021-01-01,有2人在0到4(当时)有Event1和0个Event2的人。
- 2021-07-01,有1人在0到4之间有1人,其中有Event1和1人患有Event2。
如果不使用笛卡尔产品,我无法解决这个问题...
提前致谢!
DBMS : postgreSQL
my problem :
In my database I have a person
table with id and birth date, an events
table that links a person, an event (id_event
) and a date, an age
table used for grouping ages. In the real database the person
table is about 40 millions obs, and events
3 times bigger.
I need to produce a report (sum and cumul of X events) by age (age_group
) and date (event_date
). There isn't any problem to count the number of events by date. The problem lies with the cumul : contrary to other variables (sex for example), a person grow older and changes age group
as time passes, so for a given age group the cumul can increase then decrease. I want that the event's cumul, on every date in my report, uses the age of the persons on these dates.
Example of my inputs and desired output
The only way I found is to do a Cartesian product on the tables person
and the dates v_dates
, so it's easy to follow an event and make it change age_group
. The code below uses this method.
BUT I can't use a cartesian product on my real data (makes a table way too big) and I need to use another method.
reproductible example
In this simplified example I want to produce a report by month from 2020-07-01 to 2022-07-01 (view v_dates
). In reality I need to produce the same report by day but the logic remains the same.
My inputs
/* create table person*/
DROP TABLE IF EXISTS person;
CREATE TABLE person
(
person_id varchar(1),
person_birth_date date
);
INSERT INTO person
VALUES ('A', '2017-01-01'),
('B', '2016-07-01');
person_id | person_birth_date |
---|---|
A | 2000-10-01 |
B | 2010-02-01 |
/* create table events*/
DROP TABLE IF EXISTS events;
CREATE TABLE events
(
person_id varchar(1),
event_id integer,
event_date date
);
INSERT INTO events
VALUES ('A', 1, '2020-07-01'),
('A', 2, '2021-07-01'),
('B', 1, '2021-01-01'),
('B', 2, '2022-01-01');
person_id | event_id | event_date |
---|---|---|
A | 1 | 2020-01-01 |
A | 2 | 2021-01-01 |
B | 1 | 2020-07-01 |
B | 2 | 2021-01-01 |
/* create table age*/
DROP TABLE IF EXISTS age;
CREATE TABLE age
(
age integer,
age_group varchar(8)
);
INSERT INTO age
VALUES (0,'[0-4]'),
(1,'[0-4]'),
(2,'[0-4]'),
(3,'[0-4]'),
(4,'[0-4]'),
(5,'[5-9]'),
(6,'[5-9]'),
(7,'[5-9]'),
(8,'[5-9]'),
(9,'[5-9]');
/* create view dates : contains monthly dates from 2020-07-01 to 2022-07-01*/
CREATE or replace view v_dates AS
SELECT GENERATE_SERIES('2020-07-01'::date, '2022-07-01'::date, '6 month')::date as event_date;
age | age_group |
---|---|
0 | [0-4] |
1 | [0-4] |
5 | [5-9] |
My current method using a cartesian product
CROSS JOIN
person
*v_dates
with aLEFT JOIN
to get info from table events
with aLEFT JOIN
to get age_group from table age
CREATE or replace view v_person_event AS
SELECT
pdev.person_id,
pdev.event_date,
pdev.age,
ag.age_group,
pdev.event1,
pdev.event2
FROM
(
SELECT pd.person_id,
pd.event_date,
date_part('year', age(pd.event_date::TIMESTAMP, pd.person_birth_date::TIMESTAMP)) as age,
CASE WHEN ev.event_id = 1 THEN 1 else 0 END as event1,
CASE WHEN ev.event_id = 2 THEN 1 else 0 END as event2
FROM
(
SELECT *
FROM person
CROSS JOIN v_dates
) pd
LEFT JOIN events ev
on pd.person_id = ev.person_id
and pd.event_date = ev.event_date
) pdev
Left JOIN age as ag on pdev.age = ag.age
ORDER by pdev.person_id, pdev.event_date;
- add columns
event1_cum
andevent2_cum
CREATE or replace view v_person_event_cum AS
SELECT *,
SUM(event1) OVER (PARTITION BY person_id ORDER BY event_date) event1_cum,
SUM(event2) OVER (PARTITION BY person_id ORDER BY event_date) event2_cum
FROM v_person_event;
SELECT * FROM v_person_event_cum;
person_id | event_date | age | age_group | event1 | event2 | event1_cum | event2_cum |
---|---|---|---|---|---|---|---|
A | 2020-07-01 | 3 | [0-4] | 1 | 0 | 1 | 0 |
A | 2021-01-01 | 4 | [0-4] | 0 | 0 | 1 | 0 |
A | 2021-07-01 | 4 | [0-4] | 0 | 1 | 1 | 1 |
A | 2022-01-01 | 5 | [5-9] | 0 | 0 | 1 | 1 |
A | 2022-07-01 | 5 | [5-9] | 0 | 0 | 1 | 1 |
B | 2020-07-01 | 4 | [0-4] | 0 | 0 | 0 | 0 |
B | 2021-01-01 | 4 | [0-4] | 1 | 0 | 1 | 0 |
B | 2021-07-01 | 5 | [5-9] | 0 | 0 | 1 | 0 |
B | 2022-01-01 | 5 | [5-9] | 0 | 1 | 1 | 1 |
B | 2022-07-01 | 6 | [5-9] | 0 | 0 | 1 | 1 |
- desired output : create a report grouped by variables
age_group
andevent_date
SELECT
age_group,
event_date,
SUM(event1) as event1,
SUM(event2) as event2,
SUM(event1_cum) as event1_cum,
SUM(event2_cum) as event2_cum
FROM v_person_event_cum
GROUP BY age_group, event_date
ORDER BY age_group, event_date;
age_group | event_date | event1 | event2 | event1_cum | event2_cum |
---|---|---|---|---|---|
[0-4] | 2020-07-01 | 1 | 0 | 1 | 0 |
[0-4] | 2021-01-01 | 1 | 0 | 2 | 0 |
[0-4] | 2021-07-01 | 0 | 1 | 1 | 1 |
[5-9] | 2021-07-01 | 0 | 0 | 1 | 0 |
[5-9] | 2022-01-01 | 0 | 1 | 2 | 2 |
This is why this is not an ordinary cumul : for the age_group
[0-4], event1_cum
goes from 2 at '2021-01-01' to 1 at '2021-07-01' because A was in [0-4] at the time of the event 1, still in [0-4] at '2021-01-01' but in [5-9] at 2021-07-01
When we read the report:
- the 2021-01-01, there was 2 person between 0 and 4 (at that date) who had event1 and 0 person who had event2.
- the 2021-07-01, there was 1 person between 0 and 4 who had event1 and 1 person who had event2.
I can't get a solution to this problem without using a cartesian Product...
Thanks in advance!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论