标记不连续的日期范围

发布于 2024-11-05 08:43:52 字数 4944 浏览 0 评论 0原文

背景(输入)

全球历史气候网络在其天气收集中标记了无效或错误数据测量。删除这些元素后,大量数据不再具有连续的日期部分。数据类似于:

"2007-12-01";14 -- Start of December
"2007-12-29";8
"2007-12-30";11
"2007-12-31";7
"2008-01-01";8 -- Start of January
"2008-01-02";12
"2008-01-29";0
"2008-01-31";7
"2008-02-01";4 -- Start of February
... entire month is complete ...
"2008-02-29";12
"2008-03-01";14  -- Start of March
"2008-03-02";17
"2008-03-05";17

问题(输出)

虽然可以推断丢失的数据(例如,通过对其他年份进行平均)来提供连续范围,但为了简化系统,我想根据是否存在连续范围来标记非连续段填充月份的日期:

D;"2007-12-01";14 -- Start of December
D;"2007-12-29";8
D;"2007-12-30";11
D;"2007-12-31";7
D;"2008-01-01";8 -- Start of January
D;"2008-01-02";12
D;"2008-01-29";0
D;"2008-01-31";7
"2008-02-01";4 -- Start of February
... entire month is complete ...
"2008-02-29";12
D;"2008-03-01";14  -- Start of March
D;"2008-03-02";17
D;"2008-03-05";17

一些测量是在 1843 年进行的。

问题

对于所有气象站,如何标记月份中缺少一天或多天的所有天数?

源代码

选择数据的代码类似于:

select
  m.id,
  m.taken,
  m.station_id,
  m.amount
from
  climate.measurement

相关想法

生成一个包含连续日期的表格,并将它们与测量的数据日期进行比较。

更新

该问题可以使用本节中的 SQL 重新创建。

创建表如下:

CREATE TABLE climate.calendar
(
  id serial NOT NULL,
  n character varying(2) NOT NULL,
  d date NOT NULL,
  "valid" boolean NOT NULL DEFAULT true,
  CONSTRAINT calendar_pk PRIMARY KEY (id)
)
WITH (
  OIDS=FALSE
);

生成数据

以下 SQL 向表中插入数据 (id [int], name [varchar], date [date], valid [boolean]):

insert into climate.calendar (n, d) 
    select 'A', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
    from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
insert into climate.calendar (n, d) 
    select 'B', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
    from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
insert into climate.calendar (n, d) 
    select 'C', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
    from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
insert into climate.calendar (n, d) 
    select 'D', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
    from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
insert into climate.calendar (n, d) 
    select 'E', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
    from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
insert into climate.calendar (n, d) 
    select 'F', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
    from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n

'A''F' 代表气象站的名称在特定的一天进行了测量。

删除随机行

删除一些行,如下所示:

delete from climate.calendar where id in (select id from climate.calendar order by random() limit 5000);

尝试 #1

以下操作不会将一个月中缺少一个或一个的所有天的 valid 标志切换为 false更多天:

UPDATE climate.calendar
SET valid = false
WHERE date_trunc('month', d) IN (
    SELECT DISTINCT date_trunc('month', d)
    FROM climate.calendar A
    WHERE NOT EXISTS (
        SELECT 1
        FROM climate.calendar B
        WHERE A.d - 1 = B.d
   )
);

尝试#2

以下 SQL 生成一个空结果集:

with gen_calendar as (
    select (date('1982-01-1') + (n || ' days')::interval)::date cal_date
    from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
)
select gc.cal_date
from gen_calendar gc
left join climate.calendar c on c.d = gc.cal_date
where c.d is null;

尝试 #3

以下 SQL 生成所有可能的站点名称和日期组合:

select
  distinct( cc.n ), t.d
from
  climate.calendar cc,
  (
    select (date('1982-01-1') + (n || ' days')::interval)::date d
    from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
  ) t
order by
  cc.n

然而,在实际数据中,有数百个站点,并且日期返回到1800年代中期,所以所有站点的所有日期的笛卡尔值太大。如果有足够的时间,这种方法可能会奏效……一定有更快的方法。

尝试 #4

PostgreSQL 有窗口函数。

如何使用 postgres 中的窗口函数选择特定更改

谢谢!

Background (Input)

The Global Historical Climatology Network has flagged invalid or erroneous data in its collection of weather measurements. After removing these elements, there are swaths of data that no longer have contiguously dated sections. The data resembles:

"2007-12-01";14 -- Start of December
"2007-12-29";8
"2007-12-30";11
"2007-12-31";7
"2008-01-01";8 -- Start of January
"2008-01-02";12
"2008-01-29";0
"2008-01-31";7
"2008-02-01";4 -- Start of February
... entire month is complete ...
"2008-02-29";12
"2008-03-01";14  -- Start of March
"2008-03-02";17
"2008-03-05";17

Problem (Output)

Although possible to extrapolate missing data (e.g., by averaging from other years) to provide contiguous ranges, to simplify the system, I want to flag the non-contiguous segments based on whether there is a contiguous range of dates to fill the month:

D;"2007-12-01";14 -- Start of December
D;"2007-12-29";8
D;"2007-12-30";11
D;"2007-12-31";7
D;"2008-01-01";8 -- Start of January
D;"2008-01-02";12
D;"2008-01-29";0
D;"2008-01-31";7
"2008-02-01";4 -- Start of February
... entire month is complete ...
"2008-02-29";12
D;"2008-03-01";14  -- Start of March
D;"2008-03-02";17
D;"2008-03-05";17

Some measurements were taken in the year 1843.

Question

For all weather stations, how would you mark all the days in months that are missing one or more days?

Source Code

The code to select the data resembles:

select
  m.id,
  m.taken,
  m.station_id,
  m.amount
from
  climate.measurement

Related Ideas

Generate a table filled with contiguous dates and compare them to the measured data dates.

Update

The problem can be recreated using the SQL in this section.

Table

The table is created as follows:

CREATE TABLE climate.calendar
(
  id serial NOT NULL,
  n character varying(2) NOT NULL,
  d date NOT NULL,
  "valid" boolean NOT NULL DEFAULT true,
  CONSTRAINT calendar_pk PRIMARY KEY (id)
)
WITH (
  OIDS=FALSE
);

Generate Data

The following SQL inserts data into a table (id [int], name [varchar], date [date], valid [boolean]):

insert into climate.calendar (n, d) 
    select 'A', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
    from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
insert into climate.calendar (n, d) 
    select 'B', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
    from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
insert into climate.calendar (n, d) 
    select 'C', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
    from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
insert into climate.calendar (n, d) 
    select 'D', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
    from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
insert into climate.calendar (n, d) 
    select 'E', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
    from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
insert into climate.calendar (n, d) 
    select 'F', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
    from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n

The values 'A' through 'F' represent the names of weather stations that made a measurement on a particular day.

Remove Random Rows

Delete some rows as follows:

delete from climate.calendar where id in (select id from climate.calendar order by random() limit 5000);

Attempt #1

The following does not toggle the valid flag to false for all the days in a month where the month is missing one or more days:

UPDATE climate.calendar
SET valid = false
WHERE date_trunc('month', d) IN (
    SELECT DISTINCT date_trunc('month', d)
    FROM climate.calendar A
    WHERE NOT EXISTS (
        SELECT 1
        FROM climate.calendar B
        WHERE A.d - 1 = B.d
   )
);

Attempt #2

The following SQL produces an empty result set:

with gen_calendar as (
    select (date('1982-01-1') + (n || ' days')::interval)::date cal_date
    from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
)
select gc.cal_date
from gen_calendar gc
left join climate.calendar c on c.d = gc.cal_date
where c.d is null;

Attempt #3

The following SQL generates all the possible combinations of station names and dates:

select
  distinct( cc.n ), t.d
from
  climate.calendar cc,
  (
    select (date('1982-01-1') + (n || ' days')::interval)::date d
    from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
  ) t
order by
  cc.n

However, in the real data there are several hundred stations and the dates go back to the mid 1800s, so the Cartesian of all dates for all stations is too large. Such an approach might work, given enough time... There must be a faster way.

Attempt #4

PostgreSQL has windowing functions.

How to select specific changes using windowing functions in postgres

Thank you!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

肤浅与狂妄 2024-11-12 08:43:52

generate_series()

PostgreSQL 的 generate_series() 函数可以创建一个包含连续日期列表的视图:

with calendar as (
    select ((select min(date) from test)::date + (n || ' days')::interval)::date cal_date
    from generate_series(0, (select max(date) - min(date) from test)) n
)
select cal_date
from calendar c
left join test t on t.date = c.cal_date
where t.date is null;

表达式 select max(date) - min(date) from test 可能是落后一位。

计算每月天数

识别无效月份的一种方法是创建两个视图。第一个计算每个站每月应产生的每日读数数量。 (请注意,climate.calendar 被翻译为climate_calendar。)第二个返回每个站每月产生的实际每日读数。

每个站点每月的最大天数

此视图将返回每个站点每月的实际天数。 (例如,二月始终有 28 天或 29 天。)

create view count_max_station_calendar_days as 
with calendar as (
    select ((select min(d) from climate_calendar)::date + (n || ' days')::interval)::date cal_date
    from generate_series(0, (select max(d) - min(d) from climate_calendar)) n
)
select n, extract(year from cal_date) yr, extract(month from cal_date) mo, count(*) num_days
from stations cross join calendar
group by n, yr, mo
order by n, yr, mo

每个站点每月的实际天数

返回的总天数将少于计数。 (例如,一月总是有 31 天或更少。)

create view count_actual_station_calendar_days as
select n, extract(year from d) yr, extract(month from d) mo, count(*) num_days
from climate_calendar
group by n, yr, mo
order by n, yr, mo;

在生产中删除 ORDER BY 子句(它们对开发很有帮助)。

比较视图

将两个视图连接起来以识别需要标记的电台和月份,将其放入新视图中:

create view invalid_station_months as 
select m.n, m.yr, m.mo, m.num_days - a.num_days num_days_missing
from count_max_station_calendar_days m
inner join count_actual_station_calendar_days a
       on (m.n = a.n and m.yr = a.yr and m.mo = a.mo and m.num_days <> a.num_days)

n   yr    mo  num_days_missing
--
A   1982  1   1
E   2007  3   1

num_days_missing 列不是必需的,但很有用。

这些是需要更新的行:

select cc.* 
from climate_calendar cc
inner join invalid_station_months im 
        on (cc.n = im.n and 
            extract(year from cc.d) = im.yr and
            extract(month from cc.d) = im.mo)
where valid = true

更新数据库

要更新它们,id 键很方便。

update climate_calendar
set valid = false
where id in (
    select id
    from climate_calendar cc
    inner join invalid_station_months im 
        on (cc.n = im.n and 
            extract(year from cc.d) = im.yr and
            extract(month from cc.d) = im.mo)
    where valid = true
);

generate_series()

PostgreSQL's generate_series() function can create a view that contains a consecutive list of dates:

with calendar as (
    select ((select min(date) from test)::date + (n || ' days')::interval)::date cal_date
    from generate_series(0, (select max(date) - min(date) from test)) n
)
select cal_date
from calendar c
left join test t on t.date = c.cal_date
where t.date is null;

The expression select max(date) - min(date) from test might be off by one.

Count Days Per Month

One way to identify invalid months is to create two views. The first counts the number of daily readings each station should produce in each month. (Note that climate.calendar is translated to climate_calendar.) The second returns the actual daily readings each station produced per month.

Maximum Days Per Month Per Station

This view will return the actual number of days in a month, per station. (For example, February will always have either 28 or 29 days.)

create view count_max_station_calendar_days as 
with calendar as (
    select ((select min(d) from climate_calendar)::date + (n || ' days')::interval)::date cal_date
    from generate_series(0, (select max(d) - min(d) from climate_calendar)) n
)
select n, extract(year from cal_date) yr, extract(month from cal_date) mo, count(*) num_days
from stations cross join calendar
group by n, yr, mo
order by n, yr, mo

Actual Days Per Month Per Station

The total number of days returned will be fewer than the tallies. (For example, January will always have 31 days or fewer.)

create view count_actual_station_calendar_days as
select n, extract(year from d) yr, extract(month from d) mo, count(*) num_days
from climate_calendar
group by n, yr, mo
order by n, yr, mo;

Drop the ORDER BY clauses in production (they're helpful in development).

Compare Views

Join the two views to identify the stations and months that need to be flagged, into a new view:

create view invalid_station_months as 
select m.n, m.yr, m.mo, m.num_days - a.num_days num_days_missing
from count_max_station_calendar_days m
inner join count_actual_station_calendar_days a
       on (m.n = a.n and m.yr = a.yr and m.mo = a.mo and m.num_days <> a.num_days)

n   yr    mo  num_days_missing
--
A   1982  1   1
E   2007  3   1

The column num_days_missing is not necessary, but it is useful.

These are the rows that need to be updated:

select cc.* 
from climate_calendar cc
inner join invalid_station_months im 
        on (cc.n = im.n and 
            extract(year from cc.d) = im.yr and
            extract(month from cc.d) = im.mo)
where valid = true

Update Database

To update them, the id key is convenient.

update climate_calendar
set valid = false
where id in (
    select id
    from climate_calendar cc
    inner join invalid_station_months im 
        on (cc.n = im.n and 
            extract(year from cc.d) = im.yr and
            extract(month from cc.d) = im.mo)
    where valid = true
);
挥剑断情 2024-11-12 08:43:52

这是一种方法,假设您有一个名为 is_contigious 的 BOOLEAN 字段。根据需要进行修改:

UPDATE measurement
SET is_contiguous = FALSE
WHERE NOT EXISTS (
  SELECT 1
    FROM measurement B
   WHERE measurement.taken - 1 = B.taken
);

编辑:

我相信我误解了您的要求。我以为你想标记不连续的单个日期。但显然,如果缺少任意天数,您希望将整个月的日期标记为不连续。

编辑 2:

这是我的原始(不正确)查询的修改版本,它选择缺少任何日期的不同月份:

UPDATE measurement
SET is_contiguous = FALSE
WHERE date_trunc('month', taken) IN (
    SELECT DISTINCT date_trunc('month', taken)
    FROM measurement A
    WHERE NOT EXISTS (
        SELECT 1
        FROM measurement B
        WHERE A.taken - 1 = B.taken
   )
);

Here is one way you could do it, assuming you have a BOOLEAN field called is_contiguous. Modify as necessary:

UPDATE measurement
SET is_contiguous = FALSE
WHERE NOT EXISTS (
  SELECT 1
    FROM measurement B
   WHERE measurement.taken - 1 = B.taken
);

Edit:

I believe I misunderstood your requirement. I thought you wanted to flag individual dates that were not contiguous. But apparently, you want to flag an entire month's worth of dates as discontiguous if it is missing any number of days.

Edit 2:

Here is a modified version of my original (incorrect) query which selects the distinct months that are missing any days:

UPDATE measurement
SET is_contiguous = FALSE
WHERE date_trunc('month', taken) IN (
    SELECT DISTINCT date_trunc('month', taken)
    FROM measurement A
    WHERE NOT EXISTS (
        SELECT 1
        FROM measurement B
        WHERE A.taken - 1 = B.taken
   )
);
最美的太阳 2024-11-12 08:43:52

假设每天不能超过一行,这应该返回行数不等于该月天数的所有月份。

SELECT station_id, DATE_TRUNC('month', d)
FROM climate.calendar
GROUP BY station_id, DATE_TRUNC('month', d)
HAVING COUNT(*) <> 
  DATE_PART('month',
            DATE_TRUNC('month', d) + INTERVAL '1 month' - INTERVAL '1 day')

Assuming that there can't be more than one row per day, this should return all the months for which the number of rows is not equal to the number of days in the month.

SELECT station_id, DATE_TRUNC('month', d)
FROM climate.calendar
GROUP BY station_id, DATE_TRUNC('month', d)
HAVING COUNT(*) <> 
  DATE_PART('month',
            DATE_TRUNC('month', d) + INTERVAL '1 month' - INTERVAL '1 day')
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文