Postgres 大表 Select Query With In Clause 的最佳索引选择

发布于 2025-01-13 01:44:47 字数 1869 浏览 3 评论 0原文

我们有一个非常大的表,总行数约为 ~40 亿,每天的吞吐量约为 ~20-2500 万

以下是示例表定义。

table_name (
   id bigint,
   user_id bigint,
   status  smallint,
   date    timestamp,
   .
   .
   some more columns
);

注意:状态是一个枚举,可以有 6-7 个可能的值。

下面是我们要优化的查询:

SELECT * 
FROM table_name 
WHERE user_id = $user_id 
AND status in (1, 2, 3, 6, 7, 10) 
ORDER BY date 
DESC LIMIT 20;

最初,我们在 table_name (user_id, status) 上有 index1

由于该索引没有提供最佳性能。我们考虑将日期也包含在索引中。

现在,我们尝试在表上创建一堆不同的索引,但解释计划总是选择初始索引,即:index1

以下是我们尝试过的索引

index2: table_name (user_id, status, date)
index3: table_name (user_id, status, date) where (status = ANY (ARRAY [1, 2, 3, 6, 7, 10]))
index4: table_name (user_id, status) where (status = ANY (ARRAY [1, 2, 3, 6, 7, 10]));

以下是解释分析输出:

Limit  (cost=27029.07..27029.12 rows=20 width=251) (actual time=32.466..32.470 rows=20 loops=1)
 ->  Sort  (cost=27029.07..27049.17 rows=8041 width=251) (actual time=32.465..32.467 rows=20 loops=1)
      Sort Key: id DESC
      Sort Method: top-N heapsort  Memory: 38kB
    ->  Index Scan using index1 on table_name wt  (cost=0.58..26815.10 rows=8041 width=251) (actual time=0.027..26.991 rows=37362 loops=1)
         Index Cond: ((user_id = 111111) AND (status = ANY ('{1,3,2,7,6,10,8}'::integer[])))

Planning Time: 0.320 ms
Execution Time: 32.498 ms

我们的数据库postgres版本是:< code>12.7 并且我们定期运行vaccuming

我们想要了解为什么其他索引没有用于我们的查询

另外考虑到我们的用例,是否有更好的方法来创建索引,以便我们可以服务查询在可接受的响应时间内?

We have a very large table with a total number of rows around ~4 billion and everyday throughput is around ~20-25 million.

Below is the Sample Table definition.

table_name (
   id bigint,
   user_id bigint,
   status  smallint,
   date    timestamp,
   .
   .
   some more columns
);

NOTE: status is an enum and can have 6-7 possible values.

Below is the query that we want to optimise:

SELECT * 
FROM table_name 
WHERE user_id = $user_id 
AND status in (1, 2, 3, 6, 7, 10) 
ORDER BY date 
DESC LIMIT 20;

Initially, we were having index1 on table_name (user_id, status).

Since this index was not giving optimal performance. We thought of including the date in the index as well.

Now, we have tried creating a bunch of different indexes on the table, but the explain plan always picks up the initial index i.e: index1.

Below are the indexes that we tried:

index2: table_name (user_id, status, date)
index3: table_name (user_id, status, date) where (status = ANY (ARRAY [1, 2, 3, 6, 7, 10]))
index4: table_name (user_id, status) where (status = ANY (ARRAY [1, 2, 3, 6, 7, 10]));

Below is the explain analyse output:

Limit  (cost=27029.07..27029.12 rows=20 width=251) (actual time=32.466..32.470 rows=20 loops=1)
 ->  Sort  (cost=27029.07..27049.17 rows=8041 width=251) (actual time=32.465..32.467 rows=20 loops=1)
      Sort Key: id DESC
      Sort Method: top-N heapsort  Memory: 38kB
    ->  Index Scan using index1 on table_name wt  (cost=0.58..26815.10 rows=8041 width=251) (actual time=0.027..26.991 rows=37362 loops=1)
         Index Cond: ((user_id = 111111) AND (status = ANY ('{1,3,2,7,6,10,8}'::integer[])))

Planning Time: 0.320 ms
Execution Time: 32.498 ms

Our database postgres version is: 12.7 and we run vaccuming regularly.

We want to understand why other indexes are not being used for our query.

Also given our use case, Can there be a better way of creating index so that we can serve the query in acceptable response time?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

呢古 2025-01-20 01:44:47
CREATE INDEX table_name_desc_index ON table_name (userid,_date DESC NULLS LAST);

然后尝试以下操作:

SELECT *
FROM table_name
inner join (values (1),(2),(3),(6),(7),(10)) val(v) ON (table_name.status = v )
WHERE user_id = $user_id
ORDER BY date
DESC LIMIT 20;
CREATE INDEX table_name_desc_index ON table_name (userid,_date DESC NULLS LAST);

Then try following:

SELECT *
FROM table_name
inner join (values (1),(2),(3),(6),(7),(10)) val(v) ON (table_name.status = v )
WHERE user_id = $user_id
ORDER BY date
DESC LIMIT 20;
ぃ弥猫深巷。 2025-01-20 01:44:47

您选择的是 *,因此您将无法获得仅索引扫描,因为 * 拖动的列多于索引中的列。您展示的其他索引(我可以看到)的唯一优点是启用仅索引扫描,因此如果这不起作用,那么不选择使用这些索引也就不足为奇了。您可以通过将 * 更改为仅出现在索引中的列来测试这个理论,看看会发生什么。

关于你的一个索引:

(user_id, status, date) where (status = ANY (ARRAY [1, 2, 3, 6, 7, 10]))

这个索引似乎毫无意义。 WHERE 子句的好处是它将“status”的不同限定值减少为单个值(“true”)。但是,将“状态”放入索引主体中只会再次将它们分解。更好的索引是:

(user_id, date) where (status = ANY (ARRAY [1, 2, 3, 6, 7, 10]))

这个索引可以跳转到特定 user_id 的末尾,向后扫描(以满足 order by date desc),并在找到 20 行后停止。当你拥有闯入者的“地位”时,它就无法这样做。

You are selecting *, so you will not be able to get an index-only scan due to the * dragging in more columns than are in the index. The only advantage of those other indexes you showed (that I can see) would be to enable index-only scans, so if that cannot work, it is not surprising those indexes are not selected for use. You could test this theory by changing the * to just the columns appearing in the index to see what happens.

About one of your indexes:

(user_id, status, date) where (status = ANY (ARRAY [1, 2, 3, 6, 7, 10]))

This index seems rather pointless. The benefit of the WHERE clause is that it reduces the disparate qualifying values of "status" down to a single value ('true'). But then putting "status" into the index body just breaks them back up again. The better index would be:

(user_id, date) where (status = ANY (ARRAY [1, 2, 3, 6, 7, 10]))

This one can jump to the end of the specific user_id, scan backwards (to fulfill the order by date desc) and stop once it finds 20 rows. When you have "status" as an interloper, it prevents it from doing that.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文