SQL:匹配日期范围
我正在寻找随时间变化并由和 date字段定义的映射时,正在寻找最佳实践。例如,以此示例为例 postgres db db 使用员工信息。
它在员工和薪水之间有一个映射,可在不同的日期范围内有效:
> SELECT * FROM salaries LIMIT 5
| emp_no | salary | from_date | to_date |
|---------:|---------:|:------------|:-----------|
| 10001 | 60117 | 1986-06-26 | 1987-06-26 |
| 10001 | 62102 | 1987-06-26 | 1988-06-25 |
| 10001 | 66074 | 1988-06-25 | 1989-06-25 |
| 10001 | 66596 | 1989-06-25 | 1990-06-25 |
| 10001 | 66961 | 1990-06-25 | 1991-06-25 |
以及员工和头衔之间的映射:
> SELECT * FROM titles LIMIT 5
| emp_no | title | from_date | to_date |
|---------:|:----------------|:------------|:-----------|
| 10001 | Senior Engineer | 1986-06-26 | 9999-01-01 |
| 10002 | Staff | 1996-08-03 | 9999-01-01 |
| 10003 | Senior Engineer | 1995-12-03 | 9999-01-01 |
| 10004 | Engineer | 1986-12-01 | 1995-12-01 |
| 10004 | Senior Engineer | 1995-12-01 | 9999-01-01 |
将标题标题映射到薪水的最佳方法是什么,以便例如计算每个标题的平均薪水。请注意,雇员的薪水可能不会改变标题更改,反之亦然。
我目前的解决方案是将这些表与包含所有可能日期的表交叉加入:
WITH dates AS (
SELECT date_trunc('day', dd):: date AS the_date
FROM generate_series
( '1985-01-01'::timestamp
, '2003-01-01'::timestamp
, '1 day'::interval) dd
),
daily_salaries AS (
SELECT
emp_no,
salary,
the_date
FROM salaries t
CROSS JOIN dates d
WHERE the_date BETWEEN from_date AND to_date
),
daily_titles AS (
SELECT
emp_no,
title,
the_date
FROM titles t
CROSS JOIN dates d
WHERE the_date BETWEEN from_date AND to_date
)
SELECT
title,
AVG(salary) AS avg_salary
FROM daily_salaries
INNER JOIN daily_titles USING (emp_no, the_date)
GROUP BY 1
ORDER BY 2 DESC
但这似乎非常低效。做这件事的更好方法是什么?
编辑:正如卢克(Luuk)在答案中提到的那样,上面查询的问题是,十字架联接将产生大量记录。问题的症结在于,如何避免在每位员工的数天内扩大薪水和头衔,同时仍将员工的薪水与他们拥有头衔的时期相关联。
一般的问题是如何在开始和结束时定义的任意时间段,而无需每天在时间间隔内处理(如果可能)。
I'm looking for best practices when dealing with mappings that vary in time and are defined by from
and to
date fields. For example, take this sample Postgres DB with employees info.
It has a mapping between employees and salaries that is valid for different date ranges:
> SELECT * FROM salaries LIMIT 5
| emp_no | salary | from_date | to_date |
|---------:|---------:|:------------|:-----------|
| 10001 | 60117 | 1986-06-26 | 1987-06-26 |
| 10001 | 62102 | 1987-06-26 | 1988-06-25 |
| 10001 | 66074 | 1988-06-25 | 1989-06-25 |
| 10001 | 66596 | 1989-06-25 | 1990-06-25 |
| 10001 | 66961 | 1990-06-25 | 1991-06-25 |
And a mapping between employees and titles:
> SELECT * FROM titles LIMIT 5
| emp_no | title | from_date | to_date |
|---------:|:----------------|:------------|:-----------|
| 10001 | Senior Engineer | 1986-06-26 | 9999-01-01 |
| 10002 | Staff | 1996-08-03 | 9999-01-01 |
| 10003 | Senior Engineer | 1995-12-03 | 9999-01-01 |
| 10004 | Engineer | 1986-12-01 | 1995-12-01 |
| 10004 | Senior Engineer | 1995-12-01 | 9999-01-01 |
What's the best way to map titles to salaries so I can, for instance, calculate average salaries per title. Note that an employee's salary may change without a title change and vice versa.
My current solution is to CROSS JOIN these tables with a table containing all possible dates if interest:
WITH dates AS (
SELECT date_trunc('day', dd):: date AS the_date
FROM generate_series
( '1985-01-01'::timestamp
, '2003-01-01'::timestamp
, '1 day'::interval) dd
),
daily_salaries AS (
SELECT
emp_no,
salary,
the_date
FROM salaries t
CROSS JOIN dates d
WHERE the_date BETWEEN from_date AND to_date
),
daily_titles AS (
SELECT
emp_no,
title,
the_date
FROM titles t
CROSS JOIN dates d
WHERE the_date BETWEEN from_date AND to_date
)
SELECT
title,
AVG(salary) AS avg_salary
FROM daily_salaries
INNER JOIN daily_titles USING (emp_no, the_date)
GROUP BY 1
ORDER BY 2 DESC
But this seems terribly inefficient. What's a better way to accomplish this?
Edit: As Luuk mentioned in his answer, the problem with the query above is that the CROSS JOIN will generate a very large number of records. The crux of the question is how to avoid expanding salaries and titles over days for each employee while still associating an employee's salary to only the period in which they held a title.
The general question is how to merge over arbitrary time periods defined by a start and end without dealing with every day in the intervals, if at all possible.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
有点不清楚为什么您从2844047记录中选择(
从薪水中选择计数(*)
),然后将结果数乘以365(现有年度的每天),从而导致1038077155+记录的临时结果(+
,因为有些一年的时间超过365天...)。最后,您正在计算avg_salary
,在当年的365天中,这总是相同的。我确实认为无需为所有单个日期生成_eries,并且此查询应给出正确的结果:
在上述查询中,我没有从_date和/或to_date中选择,因为这将作为您的待办事项
It is, a bit unclear why you are selecting from 2844047 records (
select count(*) from salaries
), and multiplying the number of results by 365 (for every day in the existing year), resulting in a temporary result of 1038077155+ records (+
, because some year have more than 365 days...). Finally you are calculating theavg_salary
, which will always be the same for all of the 365 days in that year.I do think there is no need to generate_series for all individual dates, and this query should give the correct results:
In above query I did not select from_date and/or to_date, because that will be left as a TODO for you ????
Results of above query:
EDIT:
First you need to get the salary of an employee with the correct starting date and ending date, this deals with changing functions somewhere during the year
example output (with the WHERE enabled):
When you want to know the average salary on 1996-01-01, you can do:
output:
P.S. There is one small error in this, because the to_date is the same as the start_date. I did not handle this in this script.