Postgres 大表 Select Query With In Clause 的最佳索引选择
我们有一个非常大的表,总行数约为 ~40 亿
,每天的吞吐量约为 ~20-2500 万
。
以下是示例表定义。
table_name (
id bigint,
user_id bigint,
status smallint,
date timestamp,
.
.
some more columns
);
注意:状态是一个枚举,可以有 6-7 个可能的值。
下面是我们要优化的查询:
SELECT *
FROM table_name
WHERE user_id = $user_id
AND status in (1, 2, 3, 6, 7, 10)
ORDER BY date
DESC LIMIT 20;
最初,我们在 table_name (user_id, status)
上有 index1
。
由于该索引没有提供最佳性能。我们考虑将日期也包含在索引中。
现在,我们尝试在表上创建一堆不同的索引,但解释计划总是选择初始索引,即:index1。
以下是我们尝试过的索引:
index2: table_name (user_id, status, date)
index3: table_name (user_id, status, date) where (status = ANY (ARRAY [1, 2, 3, 6, 7, 10]))
index4: table_name (user_id, status) where (status = ANY (ARRAY [1, 2, 3, 6, 7, 10]));
以下是解释分析输出:
Limit (cost=27029.07..27029.12 rows=20 width=251) (actual time=32.466..32.470 rows=20 loops=1)
-> Sort (cost=27029.07..27049.17 rows=8041 width=251) (actual time=32.465..32.467 rows=20 loops=1)
Sort Key: id DESC
Sort Method: top-N heapsort Memory: 38kB
-> Index Scan using index1 on table_name wt (cost=0.58..26815.10 rows=8041 width=251) (actual time=0.027..26.991 rows=37362 loops=1)
Index Cond: ((user_id = 111111) AND (status = ANY ('{1,3,2,7,6,10,8}'::integer[])))
Planning Time: 0.320 ms
Execution Time: 32.498 ms
我们的数据库postgres版本是:< code>12.7 并且我们定期运行vaccuming
。
我们想要了解为什么其他索引没有用于我们的查询。
另外考虑到我们的用例,是否有更好的方法来创建索引,以便我们可以服务查询在可接受的响应时间内?
We have a very large table with a total number of rows around ~4 billion
and everyday throughput is around ~20-25 million
.
Below is the Sample Table definition.
table_name (
id bigint,
user_id bigint,
status smallint,
date timestamp,
.
.
some more columns
);
NOTE: status is an enum and can have 6-7 possible values.
Below is the query that we want to optimise:
SELECT *
FROM table_name
WHERE user_id = $user_id
AND status in (1, 2, 3, 6, 7, 10)
ORDER BY date
DESC LIMIT 20;
Initially, we were having index1
on table_name (user_id, status)
.
Since this index was not giving optimal performance. We thought of including the date in the index as well.
Now, we have tried creating a bunch of different indexes on the table, but the explain plan always picks up the initial index i.e: index1.
Below are the indexes that we tried:
index2: table_name (user_id, status, date)
index3: table_name (user_id, status, date) where (status = ANY (ARRAY [1, 2, 3, 6, 7, 10]))
index4: table_name (user_id, status) where (status = ANY (ARRAY [1, 2, 3, 6, 7, 10]));
Below is the explain analyse output:
Limit (cost=27029.07..27029.12 rows=20 width=251) (actual time=32.466..32.470 rows=20 loops=1)
-> Sort (cost=27029.07..27049.17 rows=8041 width=251) (actual time=32.465..32.467 rows=20 loops=1)
Sort Key: id DESC
Sort Method: top-N heapsort Memory: 38kB
-> Index Scan using index1 on table_name wt (cost=0.58..26815.10 rows=8041 width=251) (actual time=0.027..26.991 rows=37362 loops=1)
Index Cond: ((user_id = 111111) AND (status = ANY ('{1,3,2,7,6,10,8}'::integer[])))
Planning Time: 0.320 ms
Execution Time: 32.498 ms
Our database postgres version is: 12.7
and we run vaccuming
regularly.
We want to understand why other indexes are not being used for our query.
Also given our use case, Can there be a better way of creating index so that we can serve the query in acceptable response time?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
然后尝试以下操作:
Then try following:
您选择的是
*
,因此您将无法获得仅索引扫描,因为*
拖动的列多于索引中的列。您展示的其他索引(我可以看到)的唯一优点是启用仅索引扫描,因此如果这不起作用,那么不选择使用这些索引也就不足为奇了。您可以通过将*
更改为仅出现在索引中的列来测试这个理论,看看会发生什么。关于你的一个索引:
这个索引似乎毫无意义。 WHERE 子句的好处是它将“status”的不同限定值减少为单个值(“true”)。但是,将“状态”放入索引主体中只会再次将它们分解。更好的索引是:
这个索引可以跳转到特定 user_id 的末尾,向后扫描(以满足
order by date desc
),并在找到 20 行后停止。当你拥有闯入者的“地位”时,它就无法这样做。You are selecting
*
, so you will not be able to get an index-only scan due to the*
dragging in more columns than are in the index. The only advantage of those other indexes you showed (that I can see) would be to enable index-only scans, so if that cannot work, it is not surprising those indexes are not selected for use. You could test this theory by changing the*
to just the columns appearing in the index to see what happens.About one of your indexes:
This index seems rather pointless. The benefit of the WHERE clause is that it reduces the disparate qualifying values of "status" down to a single value ('true'). But then putting "status" into the index body just breaks them back up again. The better index would be:
This one can jump to the end of the specific user_id, scan backwards (to fulfill the
order by date desc
) and stop once it finds 20 rows. When you have "status" as an interloper, it prevents it from doing that.