Oracle：查找预测排名列表的先前记录

发布于 2024-08-06 21:18:57 字数 3118 浏览 9 评论 0原文

嗨，我面临着一个难题：

我有一个天气预报表（oracle 9i）（大小有数百或数百万条记录。）其构成如下：

stationid    forecastdate    forecastinterval    forecastcreated    forecastvalue
---------------------------------------------------------------------------------
varchar (pk) datetime (pk)   integer (pk)        datetime (pk)      integer

其中：

stationid 指的是可以创建预报的众多气象站之一；
forecastdate 指的是预测的日期（仅日期而不是时间）。
forecastinterval 指的是预测的 forecastdate 中的小时（0 - 23）。
forecastcreated 指的是进行预测的时间，可以提前很多天。
forecastvalue 指的是预测的实际值（顾名思义）。

我需要确定给定的 stationid 和给定的 forecastdate 以及forecastinterval 对，即 forecastvalue 增量超过名义数字（例如 500）的记录。我将在这里显示一个条件表：

stationid    forecastdate    forecastinterval    forecastcreated    forecastvalue
---------------------------------------------------------------------------------
'stationa'   13-dec-09       10                  10-dec-09 04:50:10  0
'stationa'   13-dec-09       10                  10-dec-09 17:06:13  0
'stationa'   13-dec-09       10                  12-dec-09 05:20:50  300
'stationa'   13-dec-09       10                  13-dec-09 09:20:50  300

在上面的场景中，我想取出第三条记录。这是预测值名义上（例如 100）增加的记录。

事实证明，由于表的大小（数百条或数百万条记录）并且需要很长时间才能完成（事实上，太长的时间以至于我的查询从未返回），该任务非常困难。

这是我迄今为止的尝试获取这些值：

select
    wtr.stationid,
    wtr.forecastcreated,
    wtr.forecastvalue,
    (wtr.forecastdate + wtr.forecastinterval / 24) fcst_date
from
    (select inner.*
            rank() over (partition by stationid, 
                                   (inner.forecastdate + inner.forecastinterval),
                                   inner.forecastcreated
                         order by stationid, 
                                  (inner.forecastdate + inner.forecastinterval) asc,
                                  inner.forecastcreated asc
            ) rk
      from weathertable inner) wtr 
      where
      wtr.forecastvalue - 100 > (
                     select lastvalue
                      from (select y.*,
                            rank() over (partition by stationid, 
                                            (forecastdate + forecastinterval),
                                            forecastcreated
                                         order by stationid, 
                                           (forecastdate + forecastinterval) asc,
                                           forecastcreated asc) rk
                             from weathertable y
                            ) z
                       where z.stationid = wtr.stationid
                             and z.forecastdate = wtr.forecastdate                                                   
                             and (z.forecastinterval =    
                                         wtr.forecastinterval)
/* here is where i try to get the 'previous' forecast value.*/
                             and wtr.rk = z.rk + 1)

原文

Hi I am faced with a difficult problem:

I have a table (oracle 9i) of weather forecasts (many 100's of millions of records in size.)
whose makeup looks like this:

stationid    forecastdate    forecastinterval    forecastcreated    forecastvalue
---------------------------------------------------------------------------------
varchar (pk) datetime (pk)   integer (pk)        datetime (pk)      integer

where:

stationid refers to one of the many weather stations that may create a forecast;
forecastdate refers to the date the forecast is for (date only not time.)
forecastinterval refers to the hour in the forecastdate for the forecast (0 - 23).
forecastcreated refers to the time the forecast was made, can be many days beforehand.
forecastvalue refers to the actual value of the forecast (as the name implies.)

I need to determine for a given stationid and a given forecastdate and forecastinterval pair, the records where a forecastvalue increments more than a nominal number (say 500). I'll show a table of the condition here:

stationid    forecastdate    forecastinterval    forecastcreated    forecastvalue
---------------------------------------------------------------------------------
'stationa'   13-dec-09       10                  10-dec-09 04:50:10  0
'stationa'   13-dec-09       10                  10-dec-09 17:06:13  0
'stationa'   13-dec-09       10                  12-dec-09 05:20:50  300
'stationa'   13-dec-09       10                  13-dec-09 09:20:50  300

In the above scenario, I'd like to pull out the third record. This is the record where the forecast value increased by a nominal (say 100) amount.

The task is proving to be very difficult due to the sheer size of the table (many 100s of millions of records.) and taking so long to finish (so long in fact that my query has never returned.)

Here is my attempt so far to grab these values:

select
    wtr.stationid,
    wtr.forecastcreated,
    wtr.forecastvalue,
    (wtr.forecastdate + wtr.forecastinterval / 24) fcst_date
from
    (select inner.*
            rank() over (partition by stationid, 
                                   (inner.forecastdate + inner.forecastinterval),
                                   inner.forecastcreated
                         order by stationid, 
                                  (inner.forecastdate + inner.forecastinterval) asc,
                                  inner.forecastcreated asc
            ) rk
      from weathertable inner) wtr 
      where
      wtr.forecastvalue - 100 > (
                     select lastvalue
                      from (select y.*,
                            rank() over (partition by stationid, 
                                            (forecastdate + forecastinterval),
                                            forecastcreated
                                         order by stationid, 
                                           (forecastdate + forecastinterval) asc,
                                           forecastcreated asc) rk
                             from weathertable y
                            ) z
                       where z.stationid = wtr.stationid
                             and z.forecastdate = wtr.forecastdate                                                   
                             and (z.forecastinterval =    
                                         wtr.forecastinterval)
/* here is where i try to get the 'previous' forecast value.*/
                             and wtr.rk = z.rk + 1)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

不醒的梦 2024-08-13 21:18:57

Rexem 建议使用 LAG() 是正确的方法，但我们需要使用分区子句。一旦我们为不同的时间间隔和不同的站点添加行，这一点就会变得清晰......

SQL> select * from t
  2  /    
STATIONID  FORECASTDATE INTERVAL FORECASTCREATED     FORECASTVALUE
---------- ------------ -------- ------------------- -------------
stationa   13-12-2009         10 10-12-2009 04:50:10             0
stationa   13-12-2009         10 10-12-2009 17:06:13             0
stationa   13-12-2009         10 12-12-2009 05:20:50           300
stationa   13-12-2009         10 13-12-2009 09:20:50           300
stationa   13-12-2009         11 13-12-2009 09:20:50           400
stationb   13-12-2009         11 13-12-2009 09:20:50           500

6 rows selected.

SQL> SELECT v.stationid,
  2         v.forecastcreated,
  3         v.forecastvalue,
  4         (v.forecastdate + v.forecastinterval / 24) fcst_date
  5    FROM (SELECT t.stationid,
  6                 t.forecastdate,
  7                 t.forecastinterval,
  8                 t.forecastcreated,
  9                 t.forecastvalue,
 10                 t.forecastvalue - LAG(t.forecastvalue, 1)
 11                      OVER (ORDER BY t.forecastcreated) as difference
 12            FROM t) v
 13   WHERE v.difference >= 100
 14  /    
STATIONID  FORECASTCREATED     FORECASTVALUE FCST_DATE
---------- ------------------- ------------- -------------------
stationa   12-12-2009 05:20:50           300 13-12-2009 10:00:00
stationa   13-12-2009 09:20:50           400 13-12-2009 11:00:00
stationb   13-12-2009 09:20:50           500 13-12-2009 11:00:00

SQL>

为了消除误报，我们按 STATIONID、FORECASTDATE 和 FORECASTINTERVAL 对 LAG() 进行分组。请注意，以下内容依赖于内部查询从每个分区窗口的第一次计算中返回 NULL。

SQL> SELECT v.stationid,
  2         v.forecastcreated,
  3         v.forecastvalue,
  4         (v.forecastdate + v.forecastinterval / 24) fcst_date
  5    FROM (SELECT t.stationid,
  6                 t.forecastdate,
  7                 t.forecastinterval,
  8                 t.forecastcreated,
  9                 t.forecastvalue,
 10                 t.forecastvalue - LAG(t.forecastvalue, 1)
 11                      OVER (PARTITION BY t.stationid
 12                                         , t.forecastdate
 13                                         , t.forecastinterval
 14                            ORDER BY t.forecastcreated) as difference
 15            FROM t) v
 16   WHERE v.difference >= 100
 17  /

STATIONID  FORECASTCREATED     FORECASTVALUE FCST_DATE
---------- ------------------- ------------- -------------------
stationa   12-12-2009 05:20:50           300 13-12-2009 10:00:00

SQL>

处理大量数据

您将表描述为包含数亿行。如此巨大的桌子就像黑洞一样，它们具有不同的物理特性。有多种潜在的方法，具体取决于您的需求、时间范围、财务、数据库版本和版本以及系统数据的任何其他用途。答案超过五分钟。

但无论如何，这是五分钟的答案。

假设您的表是活动表，它可能是通过在发生时添加预测来填充的，这基本上是一个附加操作。这意味着任何给定气象站的预报都分散在整个表中。因此，仅 STATIONID 甚至 FORECASTDATE 上的索引的聚类因子会很差。

基于这个假设，我建议您首先尝试的一件事是在 (STATIONID, FORCASTDATE, FORECASTINTERVAL, FORECASTCREATED, FORECASTVALUE) 上构建索引。这将需要一些时间（和磁盘空间）来构建，但它应该会大大加快后续查询的速度，因为它具有满足 INDEX RANGE SCAN 查询所需的所有列，而根本不需要接触表。

Rexem's suggestion of using LAG() is the right approach but we need to use a partitioning clause. This becomes clear once we add rows for different intervals and different stations...

SQL> select * from t
  2  /    
STATIONID  FORECASTDATE INTERVAL FORECASTCREATED     FORECASTVALUE
---------- ------------ -------- ------------------- -------------
stationa   13-12-2009         10 10-12-2009 04:50:10             0
stationa   13-12-2009         10 10-12-2009 17:06:13             0
stationa   13-12-2009         10 12-12-2009 05:20:50           300
stationa   13-12-2009         10 13-12-2009 09:20:50           300
stationa   13-12-2009         11 13-12-2009 09:20:50           400
stationb   13-12-2009         11 13-12-2009 09:20:50           500

6 rows selected.

SQL> SELECT v.stationid,
  2         v.forecastcreated,
  3         v.forecastvalue,
  4         (v.forecastdate + v.forecastinterval / 24) fcst_date
  5    FROM (SELECT t.stationid,
  6                 t.forecastdate,
  7                 t.forecastinterval,
  8                 t.forecastcreated,
  9                 t.forecastvalue,
 10                 t.forecastvalue - LAG(t.forecastvalue, 1)
 11                      OVER (ORDER BY t.forecastcreated) as difference
 12            FROM t) v
 13   WHERE v.difference >= 100
 14  /    
STATIONID  FORECASTCREATED     FORECASTVALUE FCST_DATE
---------- ------------------- ------------- -------------------
stationa   12-12-2009 05:20:50           300 13-12-2009 10:00:00
stationa   13-12-2009 09:20:50           400 13-12-2009 11:00:00
stationb   13-12-2009 09:20:50           500 13-12-2009 11:00:00

SQL>

To remove the false positives we group the LAG() by STATIONID, FORECASTDATE and FORECASTINTERVAL. Note that the following relies on the inner query returning NULL from the first calculation of each partition window.

SQL> SELECT v.stationid,
  2         v.forecastcreated,
  3         v.forecastvalue,
  4         (v.forecastdate + v.forecastinterval / 24) fcst_date
  5    FROM (SELECT t.stationid,
  6                 t.forecastdate,
  7                 t.forecastinterval,
  8                 t.forecastcreated,
  9                 t.forecastvalue,
 10                 t.forecastvalue - LAG(t.forecastvalue, 1)
 11                      OVER (PARTITION BY t.stationid
 12                                         , t.forecastdate
 13                                         , t.forecastinterval
 14                            ORDER BY t.forecastcreated) as difference
 15            FROM t) v
 16   WHERE v.difference >= 100
 17  /

STATIONID  FORECASTCREATED     FORECASTVALUE FCST_DATE
---------- ------------------- ------------- -------------------
stationa   12-12-2009 05:20:50           300 13-12-2009 10:00:00

SQL>

Working with large volumes of data

You describe your tables as containing many hundreds of millions of rows. Such huge tables are like black holes, they have a different physics. There are various potential approaches, depending on your needs, timescales, finances, database version and edition, and any other usage of your system's data. It's more than a five minute answer.

But here's the five minute answer anyway.

Assuming your table is the live table it is presumably being populating by adding forecasts as they occur, which is basically an appending operation. This would mean forecasts for any given station are scattered throughout the table. Consequently indexes on just STATIONID or even FORECASTDATE would have a poor clustering factor.

On that assumption, the one thing I would suggest you try first is building an index on (STATIONID, FORCASTDATE, FORECASTINTERVAL, FORECASTCREATED, FORECASTVALUE). This will take some time (and disk space) to build, but it ought to speed up your subsequent queries quite considerably, because it has all the columns needed to satisfy the query with an INDEX RANGE SCAN without touching the table at all.

回复收藏 0 原文

~没有更多了~