获取具有最高/最小的记录每组
如何获得每组最高/最小的记录?
这个问题的前标题是“在带有子查询的复杂查询中使用排名(@Rank := @Rank + 1) - 它会起作用吗?”因为我正在寻找使用排名的解决方案,但现在我看到了Bill 发布的解决方案要好得多。
原始问题:
我正在尝试编写一个查询,该查询将按照给定的某种定义顺序从每个组中获取最后一条记录:
SET @Rank=0;
select s.*
from (select GroupId, max(Rank) AS MaxRank
from (select GroupId, @Rank := @Rank + 1 AS Rank
from Table
order by OrderField
) as t
group by GroupId) as t
join (
select *, @Rank := @Rank + 1 AS Rank
from Table
order by OrderField
) as s
on t.GroupId = s.GroupId and t.MaxRank = s.Rank
order by OrderField
表达式 @Rank := @Rank + 1
通常用于排名,但对我来说在 2 个子查询中使用时看起来很可疑,但仅初始化一次。会这样吗?
其次,它是否适用于多次评估的一个子查询?就像where(或having)子句中的子查询(如何编写上述内容的另一种方式):
SET @Rank=0;
select Table.*, @Rank := @Rank + 1 AS Rank
from Table
having Rank = (select max(Rank) AS MaxRank
from (select GroupId, @Rank := @Rank + 1 AS Rank
from Table as t0
order by OrderField
) as t
where t.GroupId = table.GroupId
)
order by OrderField
提前致谢!
How can one get records with highest/smallest per group?
Former title of this question was "using rank (@Rank := @Rank + 1) in complex query with subqueries - will it work?" because I was looking for solution using ranks, but now I see that the solution posted by Bill is much much better.
Original question:
I'm trying to compose a query that would take last record from each group given some defined order:
SET @Rank=0;
select s.*
from (select GroupId, max(Rank) AS MaxRank
from (select GroupId, @Rank := @Rank + 1 AS Rank
from Table
order by OrderField
) as t
group by GroupId) as t
join (
select *, @Rank := @Rank + 1 AS Rank
from Table
order by OrderField
) as s
on t.GroupId = s.GroupId and t.MaxRank = s.Rank
order by OrderField
Expression @Rank := @Rank + 1
is normally used for rank, but for me it looks suspicious when used in 2 subqueries, but initialized only once. Will it work this way?
And second, will it work with one subquery that is evaluated multiple times? Like subquery in where (or having) clause (another way how to write the above):
SET @Rank=0;
select Table.*, @Rank := @Rank + 1 AS Rank
from Table
having Rank = (select max(Rank) AS MaxRank
from (select GroupId, @Rank := @Rank + 1 AS Rank
from Table as t0
order by OrderField
) as t
where t.GroupId = table.GroupId
)
order by OrderField
Thanks in advance!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
要获取每组
OrderField
最高的行:如果同一组中有更多具有相同 OrderField 的记录,而您恰好需要其中一个,则可能需要扩展条件:
换句话说,返回不存在具有相同
GroupId
和更大OrderField
的其他行t2
的行t1
。当t2.*
为NULL时,表示左外连接没有找到这样的匹配,因此t1
具有OrderField
中的最大值。团体。没有排名,没有子查询。如果您在
(GroupId, OrderField)
上有复合索引,那么这应该可以快速运行并通过“使用索引”优化对 t2 的访问。关于性能,请参阅我对检索每组中的最后一条记录的回答。我尝试了使用 Stack Overflow 数据转储的子查询方法和连接方法。差异非常显着:在我的测试中,join 方法的运行速度快了 278 倍。
拥有正确的索引以获得最佳结果非常重要!
关于使用 @Rank 变量的方法,它不会像您编写的那样工作,因为在查询处理第一个表后,@Rank 的值不会重置为零。我给你看一个例子。
我插入了一些虚拟数据,其中有一个额外的字段为空,除了我们知道每组最大的行之外:
我们可以显示第一组的排名增加到三,第二组的排名增加到六,并且内部查询返回这些正确:
现在运行没有连接条件的查询,以强制所有行的笛卡尔积,并且我们还获取所有列:
从上面我们可以看到每组的最大排名是正确的,但是@Rank继续在处理第二个派生表时增加到 7并更高。因此,第二个派生表中的排名永远不会与第一个派生表中的排名重叠。
您必须添加另一个派生表来强制 @Rank 在处理两个表之间重置为零(并希望优化器不会更改其评估表的顺序,或者使用 STRAIGHT_JOIN 来防止这种情况):
但是这个查询的优化很糟糕。它不能使用任何索引,它创建两个临时表,以困难的方式对它们进行排序,甚至使用连接缓冲区,因为它在连接临时表时也不能使用索引。这是来自
EXPLAIN
的示例输出:而我使用左外连接的解决方案优化得更好。它不使用临时表,甚至报告“使用索引”,这意味着它可以仅使用索引来解析连接,而无需接触数据。
您可能会在博客上看到人们声称“联接使 SQL 变慢”,但这是无稽之谈。优化不当会导致 SQL 变慢。
To get the row with the highest
OrderField
per group:If there are more records with the same OrderField within the same group and you need exactly one of them, you may want to extend the condition:
In other words, return the row
t1
for which no other rowt2
exists with the sameGroupId
and a greaterOrderField
. Whent2.*
is NULL, it means the left outer join found no such match, and thereforet1
has the greatest value ofOrderField
in the group.No ranks, no subqueries. This should run fast and optimize access to t2 with "Using index" if you have a compound index on
(GroupId, OrderField)
.Regarding performance, see my answer to Retrieving the last record in each group. I tried a subquery method and the join method using the Stack Overflow data dump. The difference is remarkable: the join method ran 278 times faster in my test.
It's important that you have the right index to get the best results!
Regarding your method using the @Rank variable, it won't work as you've written it, because the values of @Rank won't reset to zero after the query has processed the first table. I'll show you an example.
I inserted some dummy data, with an extra field that is null except on the row we know is the greatest per group:
We can show that the rank increases to three for the first group and six for the second group, and the inner query returns these correctly:
Now run the query with no join condition, to force a Cartesian product of all rows, and we also fetch all columns:
We can see from the above that the max rank per group is correct, but then the @Rank continues to increase as it processes the second derived table, to 7 and on higher. So the ranks from the second derived table will never overlap with the ranks from the first derived table at all.
You'd have to add another derived table to force @Rank to reset to zero in between processing the two tables (and hope the optimizer doesn't change the order in which it evaluates tables, or else use STRAIGHT_JOIN to prevent that):
But the optimization of this query is terrible. It can't use any indexes, it creates two temporary tables, sorts them the hard way, and even uses a join buffer because it can't use an index when joining temp tables either. This is example output from
EXPLAIN
:Whereas my solution using the left outer join optimizes much better. It uses no temp table and even reports
"Using index"
which means it can resolve the join using only the index, without touching the data.You'll probably read people making claims on their blogs that "joins make SQL slow," but that's nonsense. Poor optimization makes SQL slow.
如果您想要执行比仅第一个结果更复杂的逻辑,例如仅获取第二个结果或具有特定约束的第一个结果:
If you want to do more sophisticated logic than only the first result, like getting only the 2nd result or the first result with a certain constraint: