SQL:避免在Postgres中从组中进行外部合并

发布于 2025-01-21 00:00:37 字数 2676 浏览 3 评论 0原文

我有一个相当简单的查询,该查询获得了一张大桌子的前十名(不到1000万行),并以降序返回。分数由使用by by Laby子句汇总的总和组成,并且该组似乎尤其昂贵,使用Sort方法:外部合并磁盘:190080K

我有什么办法可以加快这一点吗?我已经有user.test_id and user.score上的索引(下降)。我宁愿不更改work_mem,因为我对Postgres设置的控制有限。

查询:

select 
    (select test from test where top_scores.test_id = test.id), 
    (select type from test where top_scores.test_id = test.id), 
    sum_score
    from (select sum(score) as sum_score, 
          test_id
          from user
          group by test_id
          order by sum_score desc
          limit 10
    ) top_scores

查询计划:

Subquery Scan on top_scores  (cost=1412662.62..1412831.69 rows=10 width=16) (actual time=164098.107..164098.714 rows=10 loops=1)"
  ->  Limit  (cost=1412662.62..1412662.64 rows=10 width=16) (actual time=164098.042..164098.144 rows=10 loops=1)"
        ->  Sort  (cost=1412662.62..1419366.96 rows=2681736 width=16) (actual time=164098.033..164098.067 rows=10 loops=1)"
              Sort Key: (sum(user.score)) DESC"
              Sort Method: top-N heapsort  Memory: 25kB"
              ->  GroupAggregate  (cost=1271799.65..1354711.27 rows=2681736 width=16) (actual time=72815.313..152605.093 rows=2499234 loops=1)"
                    Group Key: user.test_id"
                    ->  Sort  (cost=1271799.65..1290497.74 rows=7479234 width=16) (actual time=72815.273..107823.507 rows=7479234 loops=1)"
                          Sort Key: user.test_id"
                          Sort Method: external merge  Disk: 190080kB"
                          ->  Seq Scan on user  (cost=0.00..162238.34 rows=7479234 width=16) (actual time=0.009..33795.669 rows=7479234 loops=1)"
  SubPlan 1"
    ->  Index Scan using test_id_idx on test  (cost=0.43..8.45 rows=1 width=14) (actual time=0.012..0.016 rows=1 loops=10)"
          Index Cond: (top_scores.test_id = id)"
  SubPlan 2"
    ->  Index Scan using test_id_idx on test test_1  (cost=0.43..8.45 rows=1 width=3) (actual time=0.006..0.010 rows=1 loops=10)"
          Index Cond: (top_scores.test_id = id)"
Planning time: 0.724 ms"
Execution time: 164135.458 ms"

@Jjanes在答案中提出建议后,尝试创建以下索引:

create index user_score_test_id_idx on user (score, test_id); 
create index user_test_id_score_idx on user (test_id, score);` 
create index user_test_id_score_desc_idx on user (test_id, score desc nulls last);
create index user_score_desc_test_id_idx on user (score desc nulls last, test_id);

运行完整的真空用户

这对执行时间没有明显的影响,并且结果查询计划完全相同就像没有他们一样。 (将其放入DIFF检查器中,唯一不同的是Times)

编辑:显然完整的真空不是我想要的。只需要真空

I have a fairly simple query which gets the top ten scores of a large table (just under 10 million rows) and returns them in descending order. The scores are made up of sums aggregated using a group by clause, and the group by seems especially costly, using Sort Method: external merge Disk: 190080k

Is there some way I can speed this up? I already have indices on user.test_id and user.score (descending). I'd prefer not to change work_mem as I have limited control over our postgres settings.

Query:

select 
    (select test from test where top_scores.test_id = test.id), 
    (select type from test where top_scores.test_id = test.id), 
    sum_score
    from (select sum(score) as sum_score, 
          test_id
          from user
          group by test_id
          order by sum_score desc
          limit 10
    ) top_scores

Query Plan:

Subquery Scan on top_scores  (cost=1412662.62..1412831.69 rows=10 width=16) (actual time=164098.107..164098.714 rows=10 loops=1)"
  ->  Limit  (cost=1412662.62..1412662.64 rows=10 width=16) (actual time=164098.042..164098.144 rows=10 loops=1)"
        ->  Sort  (cost=1412662.62..1419366.96 rows=2681736 width=16) (actual time=164098.033..164098.067 rows=10 loops=1)"
              Sort Key: (sum(user.score)) DESC"
              Sort Method: top-N heapsort  Memory: 25kB"
              ->  GroupAggregate  (cost=1271799.65..1354711.27 rows=2681736 width=16) (actual time=72815.313..152605.093 rows=2499234 loops=1)"
                    Group Key: user.test_id"
                    ->  Sort  (cost=1271799.65..1290497.74 rows=7479234 width=16) (actual time=72815.273..107823.507 rows=7479234 loops=1)"
                          Sort Key: user.test_id"
                          Sort Method: external merge  Disk: 190080kB"
                          ->  Seq Scan on user  (cost=0.00..162238.34 rows=7479234 width=16) (actual time=0.009..33795.669 rows=7479234 loops=1)"
  SubPlan 1"
    ->  Index Scan using test_id_idx on test  (cost=0.43..8.45 rows=1 width=14) (actual time=0.012..0.016 rows=1 loops=10)"
          Index Cond: (top_scores.test_id = id)"
  SubPlan 2"
    ->  Index Scan using test_id_idx on test test_1  (cost=0.43..8.45 rows=1 width=3) (actual time=0.006..0.010 rows=1 loops=10)"
          Index Cond: (top_scores.test_id = id)"
Planning time: 0.724 ms"
Execution time: 164135.458 ms"

After suggestions by @jjanes in his answer, tried creating the following indices:

create index user_score_test_id_idx on user (score, test_id); 
create index user_test_id_score_idx on user (test_id, score);` 
create index user_test_id_score_desc_idx on user (test_id, score desc nulls last);
create index user_score_desc_test_id_idx on user (score desc nulls last, test_id);

and running full vacuum user

This had no discernible effect on execution time, and the resulting query plan was the exact same as without them. (put it in a diff checker and the only thing different was the times)

Edit: Apparently full vacuum wasn't what I wanted. Just needed vacuum

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

夏花。依旧 2025-01-28 00:00:37

我不会期望在test_id上单独的索引和得分在这里提供帮助。但是上的多列索引(test_id,score)应使用仅索引扫描,因此请避免使用。如果没有立即帮助,则真空表格设置可见性映射位。

另外,您的硬件似乎很糟糕,或者可能只是极其超负荷。

I wouldn't expect separate indexes on test_id and score to help here. But a multicolumn index on (test_id, score) should get use of an index-only scan and so avoid the sort. If it doesn't immediately help, then VACUUM the table to get the visibility map bits set.

Also, your hardware appears to be spectacularly bad, or maybe just extremely overloaded.

沧桑㈠ 2025-01-28 00:00:37

尝试:

select
    test, type, sum_score
from 
    test 
join
    (select 
        sum(score) as sum_score, 
        test_id
     from 
        user
     group by 
        test_id
     order by 
        sum_score desc
     limit 
        10
      ) top_scores
on 
   test.test_id = top_scores.test_id

Try:

select
    test, type, sum_score
from 
    test 
join
    (select 
        sum(score) as sum_score, 
        test_id
     from 
        user
     group by 
        test_id
     order by 
        sum_score desc
     limit 
        10
      ) top_scores
on 
   test.test_id = top_scores.test_id

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文