Oracle:查找预测排名列表的先前记录
嗨,我面临着一个难题:
我有一个天气预报表(oracle 9i)(大小有数百或数百万条记录。) 其构成如下:
stationid forecastdate forecastinterval forecastcreated forecastvalue
---------------------------------------------------------------------------------
varchar (pk) datetime (pk) integer (pk) datetime (pk) integer
其中:
stationid
指的是可以创建预报的众多气象站之一;forecastdate
指的是预测的日期(仅日期而不是时间)。forecastinterval
指的是预测的forecastdate
中的小时(0 - 23)。forecastcreated
指的是进行预测的时间,可以提前很多天。forecastvalue
指的是预测的实际值(顾名思义)。
我需要确定给定的 stationid
和给定的 forecastdate
以及forecastinterval
对,即 forecastvalue
增量超过名义数字(例如 500)的记录。我将在这里显示一个条件表:
stationid forecastdate forecastinterval forecastcreated forecastvalue
---------------------------------------------------------------------------------
'stationa' 13-dec-09 10 10-dec-09 04:50:10 0
'stationa' 13-dec-09 10 10-dec-09 17:06:13 0
'stationa' 13-dec-09 10 12-dec-09 05:20:50 300
'stationa' 13-dec-09 10 13-dec-09 09:20:50 300
在上面的场景中,我想取出第三条记录。这是预测值名义上(例如 100)增加的记录。
事实证明,由于表的大小(数百条或数百万条记录)并且需要很长时间才能完成(事实上,太长的时间以至于我的查询从未返回),该任务非常困难。
这是我迄今为止的尝试获取这些值:
select
wtr.stationid,
wtr.forecastcreated,
wtr.forecastvalue,
(wtr.forecastdate + wtr.forecastinterval / 24) fcst_date
from
(select inner.*
rank() over (partition by stationid,
(inner.forecastdate + inner.forecastinterval),
inner.forecastcreated
order by stationid,
(inner.forecastdate + inner.forecastinterval) asc,
inner.forecastcreated asc
) rk
from weathertable inner) wtr
where
wtr.forecastvalue - 100 > (
select lastvalue
from (select y.*,
rank() over (partition by stationid,
(forecastdate + forecastinterval),
forecastcreated
order by stationid,
(forecastdate + forecastinterval) asc,
forecastcreated asc) rk
from weathertable y
) z
where z.stationid = wtr.stationid
and z.forecastdate = wtr.forecastdate
and (z.forecastinterval =
wtr.forecastinterval)
/* here is where i try to get the 'previous' forecast value.*/
and wtr.rk = z.rk + 1)
Hi I am faced with a difficult problem:
I have a table (oracle 9i) of weather forecasts (many 100's of millions of records in size.)
whose makeup looks like this:
stationid forecastdate forecastinterval forecastcreated forecastvalue
---------------------------------------------------------------------------------
varchar (pk) datetime (pk) integer (pk) datetime (pk) integer
where:
stationid
refers to one of the many weather stations that may create a forecast;forecastdate
refers to the date the forecast is for (date only not time.)forecastinterval
refers to the hour in theforecastdate
for the forecast (0 - 23).forecastcreated
refers to the time the forecast was made, can be many days beforehand.forecastvalue
refers to the actual value of the forecast (as the name implies.)
I need to determine for a given stationid
and a given forecastdate
and forecastinterval
pair, the records where a forecastvalue
increments more than a nominal number (say 500). I'll show a table of the condition here:
stationid forecastdate forecastinterval forecastcreated forecastvalue
---------------------------------------------------------------------------------
'stationa' 13-dec-09 10 10-dec-09 04:50:10 0
'stationa' 13-dec-09 10 10-dec-09 17:06:13 0
'stationa' 13-dec-09 10 12-dec-09 05:20:50 300
'stationa' 13-dec-09 10 13-dec-09 09:20:50 300
In the above scenario, I'd like to pull out the third record. This is the record where the forecast value increased by a nominal (say 100) amount.
The task is proving to be very difficult due to the sheer size of the table (many 100s of millions of records.) and taking so long to finish (so long in fact that my query has never returned.)
Here is my attempt so far to grab these values:
select
wtr.stationid,
wtr.forecastcreated,
wtr.forecastvalue,
(wtr.forecastdate + wtr.forecastinterval / 24) fcst_date
from
(select inner.*
rank() over (partition by stationid,
(inner.forecastdate + inner.forecastinterval),
inner.forecastcreated
order by stationid,
(inner.forecastdate + inner.forecastinterval) asc,
inner.forecastcreated asc
) rk
from weathertable inner) wtr
where
wtr.forecastvalue - 100 > (
select lastvalue
from (select y.*,
rank() over (partition by stationid,
(forecastdate + forecastinterval),
forecastcreated
order by stationid,
(forecastdate + forecastinterval) asc,
forecastcreated asc) rk
from weathertable y
) z
where z.stationid = wtr.stationid
and z.forecastdate = wtr.forecastdate
and (z.forecastinterval =
wtr.forecastinterval)
/* here is where i try to get the 'previous' forecast value.*/
and wtr.rk = z.rk + 1)
Rexem 建议使用 LAG() 是正确的方法,但我们需要使用分区子句。一旦我们为不同的时间间隔和不同的站点添加行,这一点就会变得清晰......
为了消除误报,我们按 STATIONID、FORECASTDATE 和 FORECASTINTERVAL 对 LAG() 进行分组。请注意,以下内容依赖于内部查询从每个分区窗口的第一次计算中返回 NULL。
处理大量数据
您将表描述为包含数亿行。如此巨大的桌子就像黑洞一样,它们具有不同的物理特性。有多种潜在的方法,具体取决于您的需求、时间范围、财务、数据库版本和版本以及系统数据的任何其他用途。答案超过五分钟。
但无论如何,这是五分钟的答案。
假设您的表是活动表,它可能是通过在发生时添加预测来填充的,这基本上是一个附加操作。这意味着任何给定气象站的预报都分散在整个表中。因此,仅 STATIONID 甚至 FORECASTDATE 上的索引的聚类因子会很差。
基于这个假设,我建议您首先尝试的一件事是在
(STATIONID, FORCASTDATE, FORECASTINTERVAL, FORECASTCREATED, FORECASTVALUE)
上构建索引。这将需要一些时间(和磁盘空间)来构建,但它应该会大大加快后续查询的速度,因为它具有满足 INDEX RANGE SCAN 查询所需的所有列,而根本不需要接触表。Rexem's suggestion of using LAG() is the right approach but we need to use a partitioning clause. This becomes clear once we add rows for different intervals and different stations...
To remove the false positives we group the LAG() by STATIONID, FORECASTDATE and FORECASTINTERVAL. Note that the following relies on the inner query returning NULL from the first calculation of each partition window.
Working with large volumes of data
You describe your tables as containing many hundreds of millions of rows. Such huge tables are like black holes, they have a different physics. There are various potential approaches, depending on your needs, timescales, finances, database version and edition, and any other usage of your system's data. It's more than a five minute answer.
But here's the five minute answer anyway.
Assuming your table is the live table it is presumably being populating by adding forecasts as they occur, which is basically an appending operation. This would mean forecasts for any given station are scattered throughout the table. Consequently indexes on just STATIONID or even FORECASTDATE would have a poor clustering factor.
On that assumption, the one thing I would suggest you try first is building an index on
(STATIONID, FORCASTDATE, FORECASTINTERVAL, FORECASTCREATED, FORECASTVALUE)
. This will take some time (and disk space) to build, but it ought to speed up your subsequent queries quite considerably, because it has all the columns needed to satisfy the query with an INDEX RANGE SCAN without touching the table at all.