SQL：避免在Postgres中从组中进行外部合并

发布于 2025-01-21 00:00:37 字数 2676 浏览 3 评论 0原文

我有一个相当简单的查询，该查询获得了一张大桌子的前十名（不到1000万行），并以降序返回。分数由使用by by Laby子句汇总的总和组成，并且该组似乎尤其昂贵，使用Sort方法：外部合并磁盘：190080K

我有什么办法可以加快这一点吗？我已经有user.test_id and user.score上的索引（下降）。我宁愿不更改work_mem，因为我对Postgres设置的控制有限。

查询：

select 
    (select test from test where top_scores.test_id = test.id), 
    (select type from test where top_scores.test_id = test.id), 
    sum_score
    from (select sum(score) as sum_score, 
          test_id
          from user
          group by test_id
          order by sum_score desc
          limit 10
    ) top_scores

查询计划：

Subquery Scan on top_scores  (cost=1412662.62..1412831.69 rows=10 width=16) (actual time=164098.107..164098.714 rows=10 loops=1)"
  ->  Limit  (cost=1412662.62..1412662.64 rows=10 width=16) (actual time=164098.042..164098.144 rows=10 loops=1)"
        ->  Sort  (cost=1412662.62..1419366.96 rows=2681736 width=16) (actual time=164098.033..164098.067 rows=10 loops=1)"
              Sort Key: (sum(user.score)) DESC"
              Sort Method: top-N heapsort  Memory: 25kB"
              ->  GroupAggregate  (cost=1271799.65..1354711.27 rows=2681736 width=16) (actual time=72815.313..152605.093 rows=2499234 loops=1)"
                    Group Key: user.test_id"
                    ->  Sort  (cost=1271799.65..1290497.74 rows=7479234 width=16) (actual time=72815.273..107823.507 rows=7479234 loops=1)"
                          Sort Key: user.test_id"
                          Sort Method: external merge  Disk: 190080kB"
                          ->  Seq Scan on user  (cost=0.00..162238.34 rows=7479234 width=16) (actual time=0.009..33795.669 rows=7479234 loops=1)"
  SubPlan 1"
    ->  Index Scan using test_id_idx on test  (cost=0.43..8.45 rows=1 width=14) (actual time=0.012..0.016 rows=1 loops=10)"
          Index Cond: (top_scores.test_id = id)"
  SubPlan 2"
    ->  Index Scan using test_id_idx on test test_1  (cost=0.43..8.45 rows=1 width=3) (actual time=0.006..0.010 rows=1 loops=10)"
          Index Cond: (top_scores.test_id = id)"
Planning time: 0.724 ms"
Execution time: 164135.458 ms"

@Jjanes在答案中提出建议后，尝试创建以下索引：

create index user_score_test_id_idx on user (score, test_id); 
create index user_test_id_score_idx on user (test_id, score);` 
create index user_test_id_score_desc_idx on user (test_id, score desc nulls last);
create index user_score_desc_test_id_idx on user (score desc nulls last, test_id);

运行完整的真空用户

这对执行时间没有明显的影响，并且结果查询计划完全相同就像没有他们一样。（将其放入DIFF检查器中，唯一不同的是Times）

编辑：显然完整的真空不是我想要的。只需要真空

原文

I have a fairly simple query which gets the top ten scores of a large table (just under 10 million rows) and returns them in descending order. The scores are made up of sums aggregated using a group by clause, and the group by seems especially costly, using Sort Method: external merge Disk: 190080k

Is there some way I can speed this up? I already have indices on user.test_id and user.score (descending). I'd prefer not to change work_mem as I have limited control over our postgres settings.

Query:

select 
    (select test from test where top_scores.test_id = test.id), 
    (select type from test where top_scores.test_id = test.id), 
    sum_score
    from (select sum(score) as sum_score, 
          test_id
          from user
          group by test_id
          order by sum_score desc
          limit 10
    ) top_scores

Query Plan:

Subquery Scan on top_scores  (cost=1412662.62..1412831.69 rows=10 width=16) (actual time=164098.107..164098.714 rows=10 loops=1)"
  ->  Limit  (cost=1412662.62..1412662.64 rows=10 width=16) (actual time=164098.042..164098.144 rows=10 loops=1)"
        ->  Sort  (cost=1412662.62..1419366.96 rows=2681736 width=16) (actual time=164098.033..164098.067 rows=10 loops=1)"
              Sort Key: (sum(user.score)) DESC"
              Sort Method: top-N heapsort  Memory: 25kB"
              ->  GroupAggregate  (cost=1271799.65..1354711.27 rows=2681736 width=16) (actual time=72815.313..152605.093 rows=2499234 loops=1)"
                    Group Key: user.test_id"
                    ->  Sort  (cost=1271799.65..1290497.74 rows=7479234 width=16) (actual time=72815.273..107823.507 rows=7479234 loops=1)"
                          Sort Key: user.test_id"
                          Sort Method: external merge  Disk: 190080kB"
                          ->  Seq Scan on user  (cost=0.00..162238.34 rows=7479234 width=16) (actual time=0.009..33795.669 rows=7479234 loops=1)"
  SubPlan 1"
    ->  Index Scan using test_id_idx on test  (cost=0.43..8.45 rows=1 width=14) (actual time=0.012..0.016 rows=1 loops=10)"
          Index Cond: (top_scores.test_id = id)"
  SubPlan 2"
    ->  Index Scan using test_id_idx on test test_1  (cost=0.43..8.45 rows=1 width=3) (actual time=0.006..0.010 rows=1 loops=10)"
          Index Cond: (top_scores.test_id = id)"
Planning time: 0.724 ms"
Execution time: 164135.458 ms"

After suggestions by @jjanes in his answer, tried creating the following indices:

create index user_score_test_id_idx on user (score, test_id); 
create index user_test_id_score_idx on user (test_id, score);` 
create index user_test_id_score_desc_idx on user (test_id, score desc nulls last);
create index user_score_desc_test_id_idx on user (score desc nulls last, test_id);

and running full vacuum user

This had no discernible effect on execution time, and the resulting query plan was the exact same as without them. (put it in a diff checker and the only thing different was the times)

Edit: Apparently full vacuum wasn't what I wanted. Just needed vacuum

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

夏花。依旧 2025-01-28 00:00:37

我不会期望在test_id上单独的索引和得分在这里提供帮助。但是上的多列索引（test_id，score）应使用仅索引扫描，因此请避免使用。如果没有立即帮助，则真空表格设置可见性映射位。

另外，您的硬件似乎很糟糕，或者可能只是极其超负荷。

回复收藏 0 原文

沧桑㈠ 2025-01-28 00:00:37

尝试：

select
    test, type, sum_score
from 
    test 
join
    (select 
        sum(score) as sum_score, 
        test_id
     from 
        user
     group by 
        test_id
     order by 
        sum_score desc
     limit 
        10
      ) top_scores
on 
   test.test_id = top_scores.test_id

Try:

select
    test, type, sum_score
from 
    test 
join
    (select 
        sum(score) as sum_score, 
        test_id
     from 
        user
     group by 
        test_id
     order by 
        sum_score desc
     limit 
        10
      ) top_scores
on 
   test.test_id = top_scores.test_id

回复收藏 0 原文

~没有更多了~