PostgreSQL 9 中的日历表

发布于 2024-11-03 21:43:14 字数 317 浏览 1 评论 0原文

我正在构建一个分析数据库（我对数据和业务目标有深入的了解，并且只有基本到中等的数据库技能）。

我遇到过一些关于构建类似仓库的参考资料，这些仓库实现了“日历表”的概念。这是有道理的，而且很容易做到。然而，我看到的大多数示例都是将范围限制为“天”的日历表。我的数据需要按小时进行分析。可能几分钟。

我的问题：小时/分钟级别粒度的日历表的实现在空间效率和查询/排序速度方面是否有价值？如果是这样，您能推荐一个表结构和填充方法/示例吗？

我的主数据表在任何给定时间都将包含 20+ 百万行数据，用于分析的典型子集在 1 到 5 百万范围内。因此，正如您所看到的，有很多时间戳字段。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

无人接听 2024-11-10 21:43:14

在 PostgreSQL 中，您可以动态生成任意长度和粒度的日历表：

SELECT  CAST('2011-01-01' AS DATE) + (n || ' hour')::INTERVAL
FROM    generate_series(0, 23) n

这不需要递归（与其他系统一样），并且是生成易失性结果集的首选方法。

In PostgreSQL, you can generate calendar tables of arbitrary length and granularity on the fly:

SELECT  CAST('2011-01-01' AS DATE) + (n || ' hour')::INTERVAL
FROM    generate_series(0, 23) n

This does not require recursion (as with the other systems) and is a preferred method to generate the volatile resultsets.

回复收藏 0 原文

走走停停 2024-11-10 21:43:14

日历表实现了空间/时间的权衡。通过使用更多空间，某些类型的查询可以在更短的时间内运行，因为它们可以利用索引。只要您小心处理 CHECK() 约束，并且只要您有管理流程来处理 dbms 不支持的任何约束，它们就是安全的。

如果粒度为一分钟，则每年需要生成大约 50 万行。最小的日历表看起来像这样。

2011-01-01 00:00:00
2011-01-01 00:01:00
2011-01-01 00:02:00
2011-01-01 00:03:00
2011-01-01 00:04:00

如果您正在进行“桶”分析，那么使用这样的方法可能会更好。

bucket_start         bucket_end
--
2011-01-01 00:00:00  2011-01-01 00:01:00
2011-01-01 00:01:00  2011-01-01 00:02:00
2011-01-01 00:02:00  2011-01-01 00:03:00
2011-01-01 00:03:00  2011-01-01 00:04:00
2011-01-01 00:04:00  2011-01-01 00:05:00

由于 SQL 的 BETWEEN 运算符包含端点，因此您通常需要避免使用它。这是因为它包含端点，并且很难将bucket_end表示为“bucket_start加一分钟，减去该服务器可以识别的最小时间”。（危险是比bucket_end大一微秒的值，但仍然小于bucket_start的下一个值。）

如果我要构建该表，我可能会这样做。（尽管我会更仔细地考虑是否应该将其称为“日历”。）

create table calendar (
  bucket_start timestamp primary key,
  bucket_end timestamp unique,
  CHECK (bucket_end = bucket_start + interval '1' minute)
  -- You also want a "no gaps" constraint, but I don't think you 
  -- can do that in a CHECK constraint in PostgreSQL. You might
  -- be able to use a trigger that counts the rows, and compares
  -- that count to the number of minutes between min(bucket_start)
  -- and max(bucket_start). Worst case, you can always run a report
  -- that counts the rows and sends you an email.
);

UNIQUE 约束在 PostgreSQL 中创建隐式索引。

此查询将一次插入一天的行（24 小时 * 60 分钟）。

insert into calendar
select coalesce(
                (select max(bucket_start) from calendar), 
                 cast('2011-01-01 00:00:00' as timestamp)
               ) 
             + cast((n || 'minute') as interval) as bucket_start, 
       coalesce(
                (select max(bucket_start) from calendar), 
                 cast('2011-01-01 00:00:00' as timestamp)
               ) 
             + cast((n + 1 || ' minute') as interval) as bucket_end
from generate_series(1, (24*60) ) n;

您可以将其包装在一个函数中以一次生成一年。我可能会尝试一次提交少于 50 万行。

生成 2000 万行用于测试以及另外 2000 万行“日历”分钟应该不会花费太长时间。漫长的午餐。也许是在阳光下的一个下午。

Calendar tables implement a space/time tradeoff. By using more space, some kinds of queries run in less time, because they can take advantage of indexes. They're safe as long as you're careful with the CHECK() constraints, and as long as you have administrative processes to take care of any constraints that your dbms doesn't support.

If your granularity is one minute, you'll need to generate about a half million rows for each year. A minimal calendar table would look like this.

2011-01-01 00:00:00
2011-01-01 00:01:00
2011-01-01 00:02:00
2011-01-01 00:03:00
2011-01-01 00:04:00

If you're doing "bucket" analysis, you might be better off with something like this.

bucket_start         bucket_end
--
2011-01-01 00:00:00  2011-01-01 00:01:00
2011-01-01 00:01:00  2011-01-01 00:02:00
2011-01-01 00:02:00  2011-01-01 00:03:00
2011-01-01 00:03:00  2011-01-01 00:04:00
2011-01-01 00:04:00  2011-01-01 00:05:00

Since SQL's BETWEEN operator includes the endpoints, you usually need to avoid using it. That's because it includes the endpoints, and it's hard to express bucket_end as "bucket_start plus one minute, minus the smallest bit of time this server can recognize". (The danger is a value that's a microsecond greater than bucket_end, but still less than the next value for bucket_start.)

If I were going to build that table, I'd probably do it like this. (Although I'd think harder about whether I should call it "calendar".)

create table calendar (
  bucket_start timestamp primary key,
  bucket_end timestamp unique,
  CHECK (bucket_end = bucket_start + interval '1' minute)
  -- You also want a "no gaps" constraint, but I don't think you 
  -- can do that in a CHECK constraint in PostgreSQL. You might
  -- be able to use a trigger that counts the rows, and compares
  -- that count to the number of minutes between min(bucket_start)
  -- and max(bucket_start). Worst case, you can always run a report
  -- that counts the rows and sends you an email.
);

The UNIQUE constraint creates an implicit index in PostgreSQL.

This query will insert one day's worth of rows (24 hours * 60 minutes) at a time.

insert into calendar
select coalesce(
                (select max(bucket_start) from calendar), 
                 cast('2011-01-01 00:00:00' as timestamp)
               ) 
             + cast((n || 'minute') as interval) as bucket_start, 
       coalesce(
                (select max(bucket_start) from calendar), 
                 cast('2011-01-01 00:00:00' as timestamp)
               ) 
             + cast((n + 1 || ' minute') as interval) as bucket_end
from generate_series(1, (24*60) ) n;

You can wrap that in a function to generate a year at a time. I'd probably try to commit fewer than a half million rows at a time.

It shouldn't take too long to generate 20 million rows for testing, and another 20 million rows of "calendar" minutes. Long lunch. Maybe an afternoon in the sun.

回复收藏 0 原文