PostgreSQL 9 中的日历表

发布于 2024-11-03 21:43:14 字数 317 浏览 1 评论 0原文

我正在构建一个分析数据库(我对数据和业务目标有深入的了解,并且只有基本到中等的数据库技能)。

我遇到过一些关于构建类似仓库的参考资料,这些仓库实现了“日历表”的概念。这是有道理的,而且很容易做到。然而,我看到的大多数示例都是将范围限制为“天”的日历表。我的数据需要按小时进行分析。可能几分钟。

我的问题:小时/分钟级别粒度的日历表的实现在空间效率和查询/排序速度方面是否有价值?如果是这样,您能推荐一个表结构和填充方法/示例吗?

我的主数据表在任何给定时间都将包含 20+ 百万行数据,用于分析的典型子集在 1 到 5 百万范围内。因此,正如您所看到的,有很多时间戳字段。

I am building an analytics database (I have a firm understanding of the data and the business objectives and only basic-to-moderate database skills).

I have come across some references to building similar warehouses which implement the concept of 'calendar tables'. This makes sense and is easily enough done. Most examples I see, however, are calendar tables that limit scope to 'day'. My data will need to be analyzed down to hour-level. Possibly minutes.

My question: would an implementation of calendar tables for hour/minute-level granularity be of value in terms of space-efficiency and query/sorting speed? If so, can you recommend a table structure and population method/example?

My primary data table will contain 20+ million rows of data at any given time and typical subsets for analysis are in the 1 to 5 million range. So, as you can see, that is a lot of timestamp fields.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

无人接听 2024-11-10 21:43:14

在 PostgreSQL 中,您可以动态生成任意长度和粒度的日历表:

SELECT  CAST('2011-01-01' AS DATE) + (n || ' hour')::INTERVAL
FROM    generate_series(0, 23) n

这不需要递归(与其他系统一样),并且是生成易失性结果集的首选方法。

In PostgreSQL, you can generate calendar tables of arbitrary length and granularity on the fly:

SELECT  CAST('2011-01-01' AS DATE) + (n || ' hour')::INTERVAL
FROM    generate_series(0, 23) n

This does not require recursion (as with the other systems) and is a preferred method to generate the volatile resultsets.

走走停停 2024-11-10 21:43:14

日历表实现了空间/时间的权衡。通过使用更多空间,某些类型的查询可以在更短的时间内运行,因为它们可以利用索引。只要您小心处理 CHECK() 约束,并且只要您有管理流程来处理 dbms 不支持的任何约束,它们就是安全的。

如果粒度为一分钟,则每年需要生成大约 50 万行。最小的日历表看起来像这样。

2011-01-01 00:00:00
2011-01-01 00:01:00
2011-01-01 00:02:00
2011-01-01 00:03:00
2011-01-01 00:04:00

如果您正在进行“桶”分析,那么使用这样的方法可能会更好。

bucket_start         bucket_end
--
2011-01-01 00:00:00  2011-01-01 00:01:00
2011-01-01 00:01:00  2011-01-01 00:02:00
2011-01-01 00:02:00  2011-01-01 00:03:00
2011-01-01 00:03:00  2011-01-01 00:04:00
2011-01-01 00:04:00  2011-01-01 00:05:00

由于 SQL 的 BETWEEN 运算符包含端点,因此您通常需要避免使用它。这是因为它包含端点,并且很难将bucket_end表示为“bucket_start加一分钟,减去该服务器可以识别的最小时间”。 (危险是比bucket_end大一微秒的值,但仍然小于bucket_start的下一个值。)

如果我要构建该表,我可能会这样做。 (尽管我会更仔细地考虑是否应该将其称为“日历”。)

create table calendar (
  bucket_start timestamp primary key,
  bucket_end timestamp unique,
  CHECK (bucket_end = bucket_start + interval '1' minute)
  -- You also want a "no gaps" constraint, but I don't think you 
  -- can do that in a CHECK constraint in PostgreSQL. You might
  -- be able to use a trigger that counts the rows, and compares
  -- that count to the number of minutes between min(bucket_start)
  -- and max(bucket_start). Worst case, you can always run a report
  -- that counts the rows and sends you an email.
);

UNIQUE 约束在 PostgreSQL 中创建隐式索引。

此查询将一次插入一天的行(24 小时 * 60 分钟)。

insert into calendar
select coalesce(
                (select max(bucket_start) from calendar), 
                 cast('2011-01-01 00:00:00' as timestamp)
               ) 
             + cast((n || 'minute') as interval) as bucket_start, 
       coalesce(
                (select max(bucket_start) from calendar), 
                 cast('2011-01-01 00:00:00' as timestamp)
               ) 
             + cast((n + 1 || ' minute') as interval) as bucket_end
from generate_series(1, (24*60) ) n;

您可以将其包装在一个函数中以一次生成一年。我可能会尝试一次提交少于 50 万行。

生成 2000 万行用于测试以及另外 2000 万行“日历”分钟应该不会花费太长时间。漫长的午餐。也许是在阳光下的一个下午。

Calendar tables implement a space/time tradeoff. By using more space, some kinds of queries run in less time, because they can take advantage of indexes. They're safe as long as you're careful with the CHECK() constraints, and as long as you have administrative processes to take care of any constraints that your dbms doesn't support.

If your granularity is one minute, you'll need to generate about a half million rows for each year. A minimal calendar table would look like this.

2011-01-01 00:00:00
2011-01-01 00:01:00
2011-01-01 00:02:00
2011-01-01 00:03:00
2011-01-01 00:04:00

If you're doing "bucket" analysis, you might be better off with something like this.

bucket_start         bucket_end
--
2011-01-01 00:00:00  2011-01-01 00:01:00
2011-01-01 00:01:00  2011-01-01 00:02:00
2011-01-01 00:02:00  2011-01-01 00:03:00
2011-01-01 00:03:00  2011-01-01 00:04:00
2011-01-01 00:04:00  2011-01-01 00:05:00

Since SQL's BETWEEN operator includes the endpoints, you usually need to avoid using it. That's because it includes the endpoints, and it's hard to express bucket_end as "bucket_start plus one minute, minus the smallest bit of time this server can recognize". (The danger is a value that's a microsecond greater than bucket_end, but still less than the next value for bucket_start.)

If I were going to build that table, I'd probably do it like this. (Although I'd think harder about whether I should call it "calendar".)

create table calendar (
  bucket_start timestamp primary key,
  bucket_end timestamp unique,
  CHECK (bucket_end = bucket_start + interval '1' minute)
  -- You also want a "no gaps" constraint, but I don't think you 
  -- can do that in a CHECK constraint in PostgreSQL. You might
  -- be able to use a trigger that counts the rows, and compares
  -- that count to the number of minutes between min(bucket_start)
  -- and max(bucket_start). Worst case, you can always run a report
  -- that counts the rows and sends you an email.
);

The UNIQUE constraint creates an implicit index in PostgreSQL.

This query will insert one day's worth of rows (24 hours * 60 minutes) at a time.

insert into calendar
select coalesce(
                (select max(bucket_start) from calendar), 
                 cast('2011-01-01 00:00:00' as timestamp)
               ) 
             + cast((n || 'minute') as interval) as bucket_start, 
       coalesce(
                (select max(bucket_start) from calendar), 
                 cast('2011-01-01 00:00:00' as timestamp)
               ) 
             + cast((n + 1 || ' minute') as interval) as bucket_end
from generate_series(1, (24*60) ) n;

You can wrap that in a function to generate a year at a time. I'd probably try to commit fewer than a half million rows at a time.

It shouldn't take too long to generate 20 million rows for testing, and another 20 million rows of "calendar" minutes. Long lunch. Maybe an afternoon in the sun.

逐鹿 2024-11-10 21:43:14

在我构建的数据仓库中,我使用单独的 CALENDAR 和 TIME_OF_DAY 维度。第一个维度的粒度为 1 天,第二个维度的粒度为 1 分钟。

在另外两个案例中,我事先就知道不需要以小于 15 分钟的粒度进行报告。在这种情况下,为了简单起见,我使用单个日历维度,每天有 96 条记录。

到目前为止,我在 Oracle 仓库中使用了这种方法,但今年夏天我可能会参与 PostgreSQL 仓库项目。

In the data warehouses I built I was using separate CALENDAR and TIME_OF_DAY dimensions. The first dimension has a 1 day day granularity and the second one - 1 minute granularity.

In two other cases I knew beforehand that no reporting would be required at the granularity smaller than 15 minutes. In that case for simplicity I was using a single CALENDAR dimension with 96 records per day.

I was using this approach in the Oracle warehouses so far, but I might be involved in a PostgreSQL warehouse project this summer.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文