高性能聚合 where 子句

发布于 2024-10-20 19:38:50 字数 605 浏览 2 评论 0原文

我正在尝试查找自最后一个用户创建以来三个月内创建的用户数量。全部按州分组。

这是一个有效的查询：

select count(u.id) as numberOfUsers,
s.state
from users u
join states s on u.state_id = s.id
where u.creationdate > (
select max(u2.creationdate)
from users u2
where u2.state_id = s.id
) - interval '3 months'
group by s.state

但是，它需要 100 秒。有人能给我一个性能更好的吗？

我希望这能起作用：

select count(u.id) as numberOfUsers,
s.state, max(u2.creationdate) as lastCreated
from users u
join states s on u.state_id = s.id
where u.creationdate > lastCreated - interval '3 months'
group by s.state

原文

I'm trying to find the number of users that were created in the three months since the last user was created. All grouped by state.

Here's a query that works:

select count(u.id) as numberOfUsers,
s.state
from users u
join states s on u.state_id = s.id
where u.creationdate > (
select max(u2.creationdate)
from users u2
where u2.state_id = s.id
) - interval '3 months'
group by s.state

However, it takes 100 seconds. Can someone get me a more performant one?

I wish that this worked:

select count(u.id) as numberOfUsers,
s.state, max(u2.creationdate) as lastCreated
from users u
join states s on u.state_id = s.id
where u.creationdate > lastCreated - interval '3 months'
group by s.state

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

总攻大人 2024-10-27 19:38:50

由于只进行一次扫描，这可能会表现得更好：

select count(*) as numberofusers,
       state
from ( select id, state_id, creationdate,
              max(creationdate) over (partition by state_id) - '3 months'::interval as cutoff
       from users
     ) x
     join states on states.id = x.state_id
where creationdate > cutoff
group by state

但是，在进行初始窗口聚合时，它将消耗大量工作内存。

嗯，也许更像是：

with cutoffs as (
  select id, state,
         (select max(creationdate)
          from users
          where users.state_id = states.id) - '3 months'::interval as cutoff
  from states)
select count(*) as numberofusers, state
from users
     join cutoffs on users.state_id = cutoffs.id
where users.creationdate > cutoff
group by state

这是试图逗弄 PostgreSQL 进行正确的分区扫描，但它并不是很理想。它仍然进行全表扫描，但至少只有一次。迭代 CTE 的输出并在循环内发出外部查询结果的集合返回函数可能效果最好，因为它将能够为每个状态使用创建日期索引。

This might perform better due to only doing one scan:

select count(*) as numberofusers,
       state
from ( select id, state_id, creationdate,
              max(creationdate) over (partition by state_id) - '3 months'::interval as cutoff
       from users
     ) x
     join states on states.id = x.state_id
where creationdate > cutoff
group by state

However, it will chew through a lot of work memory doing the initial windowing aggregation.

Hmm, maybe something more like:

with cutoffs as (
  select id, state,
         (select max(creationdate)
          from users
          where users.state_id = states.id) - '3 months'::interval as cutoff
  from states)
select count(*) as numberofusers, state
from users
     join cutoffs on users.state_id = cutoffs.id
where users.creationdate > cutoff
group by state

This is trying to tease PostgreSQL into doing a properly partitioned scan, but it's not really ideal. It still does a full table scan, but at least only one. A set-returning function that iterated through the output of the CTE and emitted the result of the outer query inside the loop would probably work best, since that would be able to utilise a creationdate index for each state.

回复收藏 0 原文

迎风吟唱 2024-10-27 19:38:50

只是出于兴趣，以下查询的执行情况如何？我对 Postgresql 如何处理最里面的查询（状态表+标量子查询）特别感兴趣。

必须有用户的复合索引（state_id，creation_date）才能使其工作。

select s2.id
      ,s2.state
      ,(select count(*) 
          from users u 
         where u.state_id     = s2.id
           and u.creationdate > s2.max_date) as numberOfUsers
  from (select s.id
              ,s.state
              ,(select max(u.creationdate) - interval '3 months'
                  from users u
                 where u.state_id = s.id) as max_date
         from states s
       ) s2;

编辑这是为针对 3 个状态的 100,000 个用户行的查询生成的计划：

 Seq Scan on states s (actual time=4.033..13.949 rows=3 loops=1)
   Buffers: shared hit=1743
   SubPlan 3
     ->  Aggregate (actual time=4.636..4.636 rows=1 loops=3)
           Buffers: shared hit=1742
           InitPlan 2 (returns $2)
             ->  Result (actual time=0.028..0.028 rows=1 loops=3)
                   Buffers: shared hit=12
                   InitPlan 1 (returns $1)
                     ->  Limit (actual time=0.022..0.022 rows=1 loops=3)
                           Buffers: shared hit=12
                           ->  Index Scan Backward using users_state_id_creationdate_idx on users u (actual time=0.019..0.019 rows=1 loops=3)
                                 Index Cond: ((state_id = $0) AND (creationdate IS NOT NULL))
                                 Buffers: shared hit=12
           ->  Bitmap Heap Scan on users u (actual time=1.095..3.693 rows=8425 loops=3)
                 Recheck Cond: ((state_id = $0) AND (creationdate > $2))
                 Buffers: shared hit=1730
                 ->  Bitmap Index Scan on users_state_id_creationdate_idx (actual time=1.017..1.017 rows=8425 loops=3)
                       Index Cond: ((state_id = $0) AND (creationdate > $2))
                       Buffers: shared hit=107
 Total runtime: 14.017 ms

Just out of interest, how does the following query perform? I'm particularly interested in how Postgresql processes the innermost query (states table + scalar sub query).

There has to a compound index on users(state_id, creation_date) for this to work.

select s2.id
      ,s2.state
      ,(select count(*) 
          from users u 
         where u.state_id     = s2.id
           and u.creationdate > s2.max_date) as numberOfUsers
  from (select s.id
              ,s.state
              ,(select max(u.creationdate) - interval '3 months'
                  from users u
                 where u.state_id = s.id) as max_date
         from states s
       ) s2;

edit this is the plan produced for that query with 100,000 user rows against 3 states:

 Seq Scan on states s (actual time=4.033..13.949 rows=3 loops=1)
   Buffers: shared hit=1743
   SubPlan 3
     ->  Aggregate (actual time=4.636..4.636 rows=1 loops=3)
           Buffers: shared hit=1742
           InitPlan 2 (returns $2)
             ->  Result (actual time=0.028..0.028 rows=1 loops=3)
                   Buffers: shared hit=12
                   InitPlan 1 (returns $1)
                     ->  Limit (actual time=0.022..0.022 rows=1 loops=3)
                           Buffers: shared hit=12
                           ->  Index Scan Backward using users_state_id_creationdate_idx on users u (actual time=0.019..0.019 rows=1 loops=3)
                                 Index Cond: ((state_id = $0) AND (creationdate IS NOT NULL))
                                 Buffers: shared hit=12
           ->  Bitmap Heap Scan on users u (actual time=1.095..3.693 rows=8425 loops=3)
                 Recheck Cond: ((state_id = $0) AND (creationdate > $2))
                 Buffers: shared hit=1730
                 ->  Bitmap Index Scan on users_state_id_creationdate_idx (actual time=1.017..1.017 rows=8425 loops=3)
                       Index Cond: ((state_id = $0) AND (creationdate > $2))
                       Buffers: shared hit=107
 Total runtime: 14.017 ms

回复收藏 0 原文

挽心 2024-10-27 19:38:50

这是我用来将时间减少到 82 毫秒的查询：

with cutoffs as (
  select max(u.creationdate) as cuttoff, s.id, s.state,
          from users u
  join states s on u.state_id = s.id
group by s.state, s.id)
select count(*) as numberofusers, state
from users
     join cutoffs on users.state_id = cutoffs.id
where users.creationdate > cutoff
group by state

谢谢 araqnid。

This is the query that I used to reduce the time to 82ms:

with cutoffs as (
  select max(u.creationdate) as cuttoff, s.id, s.state,
          from users u
  join states s on u.state_id = s.id
group by s.state, s.id)
select count(*) as numberofusers, state
from users
     join cutoffs on users.state_id = cutoffs.id
where users.creationdate > cutoff
group by state

Thank you to araqnid.

回复收藏 0 原文

撩动你心 2024-10-27 19:38:50

您是否已确定查询的哪一部分速度较慢？可以添加索引吗？我不是 Postgres 专家，但我怀疑如果 users 没有在 users.creationdate 上建立索引，则 MAX() 函数将必须执行全表扫描。嗯，无论如何它可能必须做一件事......

也就是说，这里什么也没有！

SELECT u.numUsers, s.state FROM 
(SELECT count(id) as numUsers, state_id 
 FROM users
 WHERE creationdate > (MAX(creationdate) - interval '3 Months'
 GROUP BY state_id) u 
left join states s on u.state_id = s.state_id

Have you determined which part of the query is slow? Can you add indexing? I'm no Postgres guru, but I suspect that if users is not indexed on users.creationdate, the MAX() function will have to do a full table scan. Hmmm, it may have to do one anyway...

That said, here goes nothing!

SELECT u.numUsers, s.state FROM 
(SELECT count(id) as numUsers, state_id 
 FROM users
 WHERE creationdate > (MAX(creationdate) - interval '3 Months'
 GROUP BY state_id) u 
left join states s on u.state_id = s.state_id

回复收藏 0 原文

~没有更多了~