PostgreSQL 9 中的日历表
我正在构建一个分析数据库(我对数据和业务目标有深入的了解,并且只有基本到中等的数据库技能)。
我遇到过一些关于构建类似仓库的参考资料,这些仓库实现了“日历表”的概念。这是有道理的,而且很容易做到。然而,我看到的大多数示例都是将范围限制为“天”的日历表。我的数据需要按小时进行分析。可能几分钟。
我的问题:小时/分钟级别粒度的日历表的实现在空间效率和查询/排序速度方面是否有价值?如果是这样,您能推荐一个表结构和填充方法/示例吗?
我的主数据表在任何给定时间都将包含 20+ 百万行数据,用于分析的典型子集在 1 到 5 百万范围内。因此,正如您所看到的,有很多时间戳字段。
I am building an analytics database (I have a firm understanding of the data and the business objectives and only basic-to-moderate database skills).
I have come across some references to building similar warehouses which implement the concept of 'calendar tables'. This makes sense and is easily enough done. Most examples I see, however, are calendar tables that limit scope to 'day'. My data will need to be analyzed down to hour-level. Possibly minutes.
My question: would an implementation of calendar tables for hour/minute-level granularity be of value in terms of space-efficiency and query/sorting speed? If so, can you recommend a table structure and population method/example?
My primary data table will contain 20+ million rows of data at any given time and typical subsets for analysis are in the 1 to 5 million range. So, as you can see, that is a lot of timestamp fields.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
在 PostgreSQL 中,您可以动态生成任意长度和粒度的日历表:
这不需要递归(与其他系统一样),并且是生成易失性结果集的首选方法。
In
PostgreSQL
, you can generate calendar tables of arbitrary length and granularity on the fly:This does not require recursion (as with the other systems) and is a preferred method to generate the volatile resultsets.
日历表实现了空间/时间的权衡。通过使用更多空间,某些类型的查询可以在更短的时间内运行,因为它们可以利用索引。只要您小心处理 CHECK() 约束,并且只要您有管理流程来处理 dbms 不支持的任何约束,它们就是安全的。
如果粒度为一分钟,则每年需要生成大约 50 万行。最小的日历表看起来像这样。
如果您正在进行“桶”分析,那么使用这样的方法可能会更好。
由于 SQL 的 BETWEEN 运算符包含端点,因此您通常需要避免使用它。这是因为它包含端点,并且很难将bucket_end表示为“bucket_start加一分钟,减去该服务器可以识别的最小时间”。 (危险是比bucket_end大一微秒的值,但仍然小于bucket_start的下一个值。)
如果我要构建该表,我可能会这样做。 (尽管我会更仔细地考虑是否应该将其称为“日历”。)
UNIQUE 约束在 PostgreSQL 中创建隐式索引。
此查询将一次插入一天的行(24 小时 * 60 分钟)。
您可以将其包装在一个函数中以一次生成一年。我可能会尝试一次提交少于 50 万行。
生成 2000 万行用于测试以及另外 2000 万行“日历”分钟应该不会花费太长时间。漫长的午餐。也许是在阳光下的一个下午。
Calendar tables implement a space/time tradeoff. By using more space, some kinds of queries run in less time, because they can take advantage of indexes. They're safe as long as you're careful with the CHECK() constraints, and as long as you have administrative processes to take care of any constraints that your dbms doesn't support.
If your granularity is one minute, you'll need to generate about a half million rows for each year. A minimal calendar table would look like this.
If you're doing "bucket" analysis, you might be better off with something like this.
Since SQL's BETWEEN operator includes the endpoints, you usually need to avoid using it. That's because it includes the endpoints, and it's hard to express bucket_end as "bucket_start plus one minute, minus the smallest bit of time this server can recognize". (The danger is a value that's a microsecond greater than bucket_end, but still less than the next value for bucket_start.)
If I were going to build that table, I'd probably do it like this. (Although I'd think harder about whether I should call it "calendar".)
The UNIQUE constraint creates an implicit index in PostgreSQL.
This query will insert one day's worth of rows (24 hours * 60 minutes) at a time.
You can wrap that in a function to generate a year at a time. I'd probably try to commit fewer than a half million rows at a time.
It shouldn't take too long to generate 20 million rows for testing, and another 20 million rows of "calendar" minutes. Long lunch. Maybe an afternoon in the sun.
在我构建的数据仓库中,我使用单独的 CALENDAR 和 TIME_OF_DAY 维度。第一个维度的粒度为 1 天,第二个维度的粒度为 1 分钟。
在另外两个案例中,我事先就知道不需要以小于 15 分钟的粒度进行报告。在这种情况下,为了简单起见,我使用单个日历维度,每天有 96 条记录。
到目前为止,我在 Oracle 仓库中使用了这种方法,但今年夏天我可能会参与 PostgreSQL 仓库项目。
In the data warehouses I built I was using separate CALENDAR and TIME_OF_DAY dimensions. The first dimension has a 1 day day granularity and the second one - 1 minute granularity.
In two other cases I knew beforehand that no reporting would be required at the granularity smaller than 15 minutes. In that case for simplicity I was using a single CALENDAR dimension with 96 records per day.
I was using this approach in the Oracle warehouses so far, but I might be involved in a PostgreSQL warehouse project this summer.