标记不连续的日期范围
背景(输入)
全球历史气候网络在其天气收集中标记了无效或错误数据测量。删除这些元素后,大量数据不再具有连续的日期部分。数据类似于:
"2007-12-01";14 -- Start of December
"2007-12-29";8
"2007-12-30";11
"2007-12-31";7
"2008-01-01";8 -- Start of January
"2008-01-02";12
"2008-01-29";0
"2008-01-31";7
"2008-02-01";4 -- Start of February
... entire month is complete ...
"2008-02-29";12
"2008-03-01";14 -- Start of March
"2008-03-02";17
"2008-03-05";17
问题(输出)
虽然可以推断丢失的数据(例如,通过对其他年份进行平均)来提供连续范围,但为了简化系统,我想根据是否存在连续范围来标记非连续段填充月份的日期:
D;"2007-12-01";14 -- Start of December
D;"2007-12-29";8
D;"2007-12-30";11
D;"2007-12-31";7
D;"2008-01-01";8 -- Start of January
D;"2008-01-02";12
D;"2008-01-29";0
D;"2008-01-31";7
"2008-02-01";4 -- Start of February
... entire month is complete ...
"2008-02-29";12
D;"2008-03-01";14 -- Start of March
D;"2008-03-02";17
D;"2008-03-05";17
一些测量是在 1843 年进行的。
问题
对于所有气象站,如何标记月份中缺少一天或多天的所有天数?
源代码
选择数据的代码类似于:
select
m.id,
m.taken,
m.station_id,
m.amount
from
climate.measurement
相关想法
生成一个包含连续日期的表格,并将它们与测量的数据日期进行比较。
- 什么是在 sql 结果中填充空日期的最直接方法(在 mysql 或 perl 端)?
- 如何对连续范围进行分组
- http://msdn.microsoft.com/en-us/library/aa175780%28v=sql.80%29.aspx
更新
该问题可以使用本节中的 SQL 重新创建。
表
创建表如下:
CREATE TABLE climate.calendar
(
id serial NOT NULL,
n character varying(2) NOT NULL,
d date NOT NULL,
"valid" boolean NOT NULL DEFAULT true,
CONSTRAINT calendar_pk PRIMARY KEY (id)
)
WITH (
OIDS=FALSE
);
生成数据
以下 SQL 向表中插入数据 (id
[int], n
ame [varchar], date [date],
valid
[boolean]):
insert into climate.calendar (n, d)
select 'A', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
insert into climate.calendar (n, d)
select 'B', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
insert into climate.calendar (n, d)
select 'C', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
insert into climate.calendar (n, d)
select 'D', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
insert into climate.calendar (n, d)
select 'E', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
insert into climate.calendar (n, d)
select 'F', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
值 'A'
到 'F'
代表气象站的名称在特定的一天进行了测量。
删除随机行
删除一些行,如下所示:
delete from climate.calendar where id in (select id from climate.calendar order by random() limit 5000);
尝试 #1
以下操作不会将一个月中缺少一个或一个的所有天的 valid
标志切换为 false
更多天:
UPDATE climate.calendar
SET valid = false
WHERE date_trunc('month', d) IN (
SELECT DISTINCT date_trunc('month', d)
FROM climate.calendar A
WHERE NOT EXISTS (
SELECT 1
FROM climate.calendar B
WHERE A.d - 1 = B.d
)
);
尝试#2
以下 SQL 生成一个空结果集:
with gen_calendar as (
select (date('1982-01-1') + (n || ' days')::interval)::date cal_date
from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
)
select gc.cal_date
from gen_calendar gc
left join climate.calendar c on c.d = gc.cal_date
where c.d is null;
尝试 #3
以下 SQL 生成所有可能的站点名称和日期组合:
select
distinct( cc.n ), t.d
from
climate.calendar cc,
(
select (date('1982-01-1') + (n || ' days')::interval)::date d
from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
) t
order by
cc.n
然而,在实际数据中,有数百个站点,并且日期返回到1800年代中期,所以所有站点的所有日期的笛卡尔值太大。如果有足够的时间,这种方法可能会奏效……一定有更快的方法。
尝试 #4
PostgreSQL 有窗口函数。
谢谢!
Background (Input)
The Global Historical Climatology Network has flagged invalid or erroneous data in its collection of weather measurements. After removing these elements, there are swaths of data that no longer have contiguously dated sections. The data resembles:
"2007-12-01";14 -- Start of December
"2007-12-29";8
"2007-12-30";11
"2007-12-31";7
"2008-01-01";8 -- Start of January
"2008-01-02";12
"2008-01-29";0
"2008-01-31";7
"2008-02-01";4 -- Start of February
... entire month is complete ...
"2008-02-29";12
"2008-03-01";14 -- Start of March
"2008-03-02";17
"2008-03-05";17
Problem (Output)
Although possible to extrapolate missing data (e.g., by averaging from other years) to provide contiguous ranges, to simplify the system, I want to flag the non-contiguous segments based on whether there is a contiguous range of dates to fill the month:
D;"2007-12-01";14 -- Start of December
D;"2007-12-29";8
D;"2007-12-30";11
D;"2007-12-31";7
D;"2008-01-01";8 -- Start of January
D;"2008-01-02";12
D;"2008-01-29";0
D;"2008-01-31";7
"2008-02-01";4 -- Start of February
... entire month is complete ...
"2008-02-29";12
D;"2008-03-01";14 -- Start of March
D;"2008-03-02";17
D;"2008-03-05";17
Some measurements were taken in the year 1843.
Question
For all weather stations, how would you mark all the days in months that are missing one or more days?
Source Code
The code to select the data resembles:
select
m.id,
m.taken,
m.station_id,
m.amount
from
climate.measurement
Related Ideas
Generate a table filled with contiguous dates and compare them to the measured data dates.
- What is the most straightforward way to pad empty dates in sql results (on either mysql or perl end)?
- How do I group on continuous ranges
- http://msdn.microsoft.com/en-us/library/aa175780%28v=sql.80%29.aspx
Update
The problem can be recreated using the SQL in this section.
Table
The table is created as follows:
CREATE TABLE climate.calendar
(
id serial NOT NULL,
n character varying(2) NOT NULL,
d date NOT NULL,
"valid" boolean NOT NULL DEFAULT true,
CONSTRAINT calendar_pk PRIMARY KEY (id)
)
WITH (
OIDS=FALSE
);
Generate Data
The following SQL inserts data into a table (id
[int], n
ame [varchar], d
ate [date], valid
[boolean]):
insert into climate.calendar (n, d)
select 'A', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
insert into climate.calendar (n, d)
select 'B', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
insert into climate.calendar (n, d)
select 'C', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
insert into climate.calendar (n, d)
select 'D', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
insert into climate.calendar (n, d)
select 'E', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
insert into climate.calendar (n, d)
select 'F', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
The values 'A'
through 'F'
represent the names of weather stations that made a measurement on a particular day.
Remove Random Rows
Delete some rows as follows:
delete from climate.calendar where id in (select id from climate.calendar order by random() limit 5000);
Attempt #1
The following does not toggle the valid
flag to false
for all the days in a month where the month is missing one or more days:
UPDATE climate.calendar
SET valid = false
WHERE date_trunc('month', d) IN (
SELECT DISTINCT date_trunc('month', d)
FROM climate.calendar A
WHERE NOT EXISTS (
SELECT 1
FROM climate.calendar B
WHERE A.d - 1 = B.d
)
);
Attempt #2
The following SQL produces an empty result set:
with gen_calendar as (
select (date('1982-01-1') + (n || ' days')::interval)::date cal_date
from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
)
select gc.cal_date
from gen_calendar gc
left join climate.calendar c on c.d = gc.cal_date
where c.d is null;
Attempt #3
The following SQL generates all the possible combinations of station names and dates:
select
distinct( cc.n ), t.d
from
climate.calendar cc,
(
select (date('1982-01-1') + (n || ' days')::interval)::date d
from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
) t
order by
cc.n
However, in the real data there are several hundred stations and the dates go back to the mid 1800s, so the Cartesian of all dates for all stations is too large. Such an approach might work, given enough time... There must be a faster way.
Attempt #4
PostgreSQL has windowing functions.
How to select specific changes using windowing functions in postgres
Thank you!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
generate_series()
PostgreSQL 的
generate_series()
函数可以创建一个包含连续日期列表的视图:表达式
select max(date) - min(date) from test
可能是落后一位。计算每月天数
识别无效月份的一种方法是创建两个视图。第一个计算每个站每月应产生的每日读数数量。 (请注意,
climate.calendar
被翻译为climate_calendar
。)第二个返回每个站每月产生的实际每日读数。每个站点每月的最大天数
此视图将返回每个站点每月的实际天数。 (例如,二月始终有 28 天或 29 天。)
每个站点每月的实际天数
返回的总天数将少于计数。 (例如,一月总是有 31 天或更少。)
在生产中删除
ORDER BY
子句(它们对开发很有帮助)。比较视图
将两个视图连接起来以识别需要标记的电台和月份,将其放入新视图中:
num_days_missing
列不是必需的,但很有用。这些是需要更新的行:
更新数据库
要更新它们,
id
键很方便。generate_series()
PostgreSQL's
generate_series()
function can create a view that contains a consecutive list of dates:The expression
select max(date) - min(date) from test
might be off by one.Count Days Per Month
One way to identify invalid months is to create two views. The first counts the number of daily readings each station should produce in each month. (Note that
climate.calendar
is translated toclimate_calendar
.) The second returns the actual daily readings each station produced per month.Maximum Days Per Month Per Station
This view will return the actual number of days in a month, per station. (For example, February will always have either 28 or 29 days.)
Actual Days Per Month Per Station
The total number of days returned will be fewer than the tallies. (For example, January will always have 31 days or fewer.)
Drop the
ORDER BY
clauses in production (they're helpful in development).Compare Views
Join the two views to identify the stations and months that need to be flagged, into a new view:
The column
num_days_missing
is not necessary, but it is useful.These are the rows that need to be updated:
Update Database
To update them, the
id
key is convenient.这是一种方法,假设您有一个名为 is_contigious 的 BOOLEAN 字段。根据需要进行修改:
编辑:
我相信我误解了您的要求。我以为你想标记不连续的单个日期。但显然,如果缺少任意天数,您希望将整个月的日期标记为不连续。
编辑 2:
这是我的原始(不正确)查询的修改版本,它选择缺少任何日期的不同月份:
Here is one way you could do it, assuming you have a BOOLEAN field called is_contiguous. Modify as necessary:
Edit:
I believe I misunderstood your requirement. I thought you wanted to flag individual dates that were not contiguous. But apparently, you want to flag an entire month's worth of dates as discontiguous if it is missing any number of days.
Edit 2:
Here is a modified version of my original (incorrect) query which selects the distinct months that are missing any days:
假设每天不能超过一行,这应该返回行数不等于该月天数的所有月份。
Assuming that there can't be more than one row per day, this should return all the months for which the number of rows is not equal to the number of days in the month.