查看表的时间戳字段中有哪些唯一日期的最快方法?
我有一个包含数十亿行的表。 “记录”字段上有每日分区,这是一个“无时区的时间戳”。我想知道表中当前有哪几天。我知道我可以做类似的事情:
SELECT recorded::date
FROM table
GROUP BY 1;
理想情况下应该有效,但是对此的解释相当高,并且表明需要相当长的时间才能工作......如果这是我能做的最好的事情,我可以接受(并且我们可以在数据进入时对其进行监视),但我想知道是否有更有效的方法来做到这一点,因为我每天都有分区?
I have a table with billions of rows. There are daily partitions on the "recorded" field, which is a "timestamp without time zone." I want to know which days are currently in the table. I know I could do something like:
SELECT recorded::date
FROM table
GROUP BY 1;
Which ideally should work, but the explain on that is rather high, and indicates it would take quite a while to work... if that's the best I can do, I can accept that (and we could keep tabs on the data as it goes in), but I was wondering if there might be a more efficient way to do this, given that I have daily partitioning?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可以创建这样的索引:
在我的测试中,PostgreSQL 9.something 在添加索引之前使用了顺序扫描,在简单地索引“recorded”列之后使用了顺序扫描,并在使用 date_trunc() 为其建立索引之后使用了索引扫描。在没有索引的情况下选择一天的行需要 66 毫秒,使用普通索引需要 68 毫秒,使用 date_trunc() 使用索引需要 13 毫秒。
对于数十亿行,创建该索引预计需要几分钟时间。 (咳嗽)
You can create an index something like this:
In my test, PostgreSQL 9.something used a sequential scan before adding the index, a sequential scan after simply indexing the column "recorded", and an index scan after indexing it with date_trunc(). Selecting a single day's rows took 66ms without an index, 68ms with a plain index, and 13ms with an index using date_trunc().
With billions of rows, expect creating that index to take a few minutes. (cough)
这里有一个非常相似的线程:
Slow select unique query on postgres
如果您知道最小/最大日期,则最好针对日期列表进行查询,而不是对整个表进行 seq 扫描。假设您有一个记录索引,看起来像这样的东西应该更快:
上面的查询可能需要做一些调整,但总体思路是:执行几千个子查询/会更快与对索引字段进行索引扫描相比,seq 扫描数十亿行并聚合它们以识别不同的日期。
There's a very similar thread here:
Slow select distinct query on postgres
If you know the min/max dates, you'll be better off querying against a list of dates than doing a seq scan over the whole table. Assuming you've an index on recorded, something that looks like this should be faster:
There might be a few tweaks to do to the above query, but the general idea is there: it'll be faster to do a few thousand subquery/index scans on an indexed field than it is to seq scan a few billions of rows and aggregate them in order to identify the distinct days.