检索每组中的最后一条记录 - MySQL

发布于 2024-08-03 00:36:08 字数 727 浏览 10 评论 0 原文

有一个表messages,其中包含如下所示的数据:

Id   Name   Other_Columns
-------------------------
1    A       A_data_1
2    A       A_data_2
3    A       A_data_3
4    B       B_data_1
5    B       B_data_2
6    C       C_data_1

如果我运行查询select * from messages group by name,我将得到的结果为:

1    A       A_data_1
4    B       B_data_1
6    C       C_data_1

查询将返回什么下面的结果?

3    A       A_data_3
5    B       B_data_2
6    C       C_data_1

也就是说,应该返回每组中的最后一条记录。

目前,这是我使用的查询:

SELECT
  *
FROM (SELECT
  *
FROM messages
ORDER BY id DESC) AS x
GROUP BY name

但这看起来效率很低。还有其他方法可以达到相同的结果吗?

There is a table messages that contains data as shown below:

Id   Name   Other_Columns
-------------------------
1    A       A_data_1
2    A       A_data_2
3    A       A_data_3
4    B       B_data_1
5    B       B_data_2
6    C       C_data_1

If I run a query select * from messages group by name, I will get the result as:

1    A       A_data_1
4    B       B_data_1
6    C       C_data_1

What query will return the following result?

3    A       A_data_3
5    B       B_data_2
6    C       C_data_1

That is, the last record in each group should be returned.

At present, this is the query that I use:

SELECT
  *
FROM (SELECT
  *
FROM messages
ORDER BY id DESC) AS x
GROUP BY name

But this looks highly inefficient. Any other ways to achieve the same result?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(30

洛阳烟雨空心柳 2024-08-10 00:36:08

MySQL 8.0 现在支持窗口函数,就像几乎所有流行的函数一样SQL 实现。使用这种标准语法,我们可以编写每组最大n个查询:

WITH ranked_messages AS (
  SELECT m.*, ROW_NUMBER() OVER (PARTITION BY name ORDER BY id DESC) AS rn
  FROM messages AS m
)
SELECT * FROM ranked_messages WHERE rn = 1;

这种方法和其他方法来查找 分组最大行数在 MySQL 手册中进行了说明。

以下是我在 2009 年为这个问题写的原始答案:


我是这样写解决方案的:

SELECT m1.*
FROM messages m1 LEFT JOIN messages m2
 ON (m1.name = m2.name AND m1.id < m2.id)
WHERE m2.id IS NULL;

关于性能,一种解决方案或另一种可能会更好,具体取决于数据的性质。因此,您应该测试这两个查询,并根据您的数据库使用性能更好的查询。

例如,我有一份 StackOverflow 8 月数据转储 的副本。我将用它来进行基准测试。 Posts 表中有 1,114,357 行。它在我的 Macbook Pro 2.40GHz 上的 MySQL 5.0.75 上运行。

我将编写一个查询来查找给定用户 ID(我的)的最新帖子。

首先使用如图所示的技术 @Eric 在子查询中使用 GROUP BY

SELECT p1.postid
FROM Posts p1
INNER JOIN (SELECT pi.owneruserid, MAX(pi.postid) AS maxpostid
            FROM Posts pi GROUP BY pi.owneruserid) p2
  ON (p1.postid = p2.maxpostid)
WHERE p1.owneruserid = 20860;

1 row in set (1 min 17.89 sec)

甚至 EXPLAIN 分析 需要超过 16 秒:

+----+-------------+------------+--------+----------------------------+-------------+---------+--------------+---------+-------------+
| id | select_type | table      | type   | possible_keys              | key         | key_len | ref          | rows    | Extra       |
+----+-------------+------------+--------+----------------------------+-------------+---------+--------------+---------+-------------+
|  1 | PRIMARY     | <derived2> | ALL    | NULL                       | NULL        | NULL    | NULL         |   76756 |             | 
|  1 | PRIMARY     | p1         | eq_ref | PRIMARY,PostId,OwnerUserId | PRIMARY     | 8       | p2.maxpostid |       1 | Using where | 
|  2 | DERIVED     | pi         | index  | NULL                       | OwnerUserId | 8       | NULL         | 1151268 | Using index | 
+----+-------------+------------+--------+----------------------------+-------------+---------+--------------+---------+-------------+
3 rows in set (16.09 sec)

现在使用 我的技术LEFT JOIN

SELECT p1.postid
FROM Posts p1 LEFT JOIN posts p2
  ON (p1.owneruserid = p2.owneruserid AND p1.postid < p2.postid)
WHERE p2.postid IS NULL AND p1.owneruserid = 20860;

1 row in set (0.28 sec)

EXPLAIN 分析显示两个表都能够使用它们的索引:

+----+-------------+-------+------+----------------------------+-------------+---------+-------+------+--------------------------------------+
| id | select_type | table | type | possible_keys              | key         | key_len | ref   | rows | Extra                                |
+----+-------------+-------+------+----------------------------+-------------+---------+-------+------+--------------------------------------+
|  1 | SIMPLE      | p1    | ref  | OwnerUserId                | OwnerUserId | 8       | const | 1384 | Using index                          | 
|  1 | SIMPLE      | p2    | ref  | PRIMARY,PostId,OwnerUserId | OwnerUserId | 8       | const | 1384 | Using where; Using index; Not exists | 
+----+-------------+-------+------+----------------------------+-------------+---------+-------+------+--------------------------------------+
2 rows in set (0.00 sec)

这是我的 Posts 表的 DDL:

CREATE TABLE `posts` (
  `PostId` bigint(20) unsigned NOT NULL auto_increment,
  `PostTypeId` bigint(20) unsigned NOT NULL,
  `AcceptedAnswerId` bigint(20) unsigned default NULL,
  `ParentId` bigint(20) unsigned default NULL,
  `CreationDate` datetime NOT NULL,
  `Score` int(11) NOT NULL default '0',
  `ViewCount` int(11) NOT NULL default '0',
  `Body` text NOT NULL,
  `OwnerUserId` bigint(20) unsigned NOT NULL,
  `OwnerDisplayName` varchar(40) default NULL,
  `LastEditorUserId` bigint(20) unsigned default NULL,
  `LastEditDate` datetime default NULL,
  `LastActivityDate` datetime default NULL,
  `Title` varchar(250) NOT NULL default '',
  `Tags` varchar(150) NOT NULL default '',
  `AnswerCount` int(11) NOT NULL default '0',
  `CommentCount` int(11) NOT NULL default '0',
  `FavoriteCount` int(11) NOT NULL default '0',
  `ClosedDate` datetime default NULL,
  PRIMARY KEY  (`PostId`),
  UNIQUE KEY `PostId` (`PostId`),
  KEY `PostTypeId` (`PostTypeId`),
  KEY `AcceptedAnswerId` (`AcceptedAnswerId`),
  KEY `OwnerUserId` (`OwnerUserId`),
  KEY `LastEditorUserId` (`LastEditorUserId`),
  KEY `ParentId` (`ParentId`),
  CONSTRAINT `posts_ibfk_1` FOREIGN KEY (`PostTypeId`) REFERENCES `posttypes` (`PostTypeId`)
) ENGINE=InnoDB;

评论者注意:如果您想要另一个基准使用不同版本的 MySQL、不同的数据集或不同的表设计,您可以自行操作。我已经展示了上面的技术。 Stack Overflow 在这里向您展示如何进行软件开发工作,而不是为您完成所有工作。

MySQL 8.0 now supports windowing functions, like almost all popular SQL implementations. With this standard syntax, we can write greatest-n-per-group queries:

WITH ranked_messages AS (
  SELECT m.*, ROW_NUMBER() OVER (PARTITION BY name ORDER BY id DESC) AS rn
  FROM messages AS m
)
SELECT * FROM ranked_messages WHERE rn = 1;

This and other approaches to finding groupwise maximal rows are illustrated in the MySQL manual.

Below is the original answer I wrote for this question in 2009:


I write the solution this way:

SELECT m1.*
FROM messages m1 LEFT JOIN messages m2
 ON (m1.name = m2.name AND m1.id < m2.id)
WHERE m2.id IS NULL;

Regarding performance, one solution or the other can be better, depending on the nature of your data. So you should test both queries and use the one that is better at performance given your database.

For example, I have a copy of the StackOverflow August data dump. I'll use that for benchmarking. There are 1,114,357 rows in the Posts table. This is running on MySQL 5.0.75 on my Macbook Pro 2.40GHz.

I'll write a query to find the most recent post for a given user ID (mine).

First using the technique shown by @Eric with the GROUP BY in a subquery:

SELECT p1.postid
FROM Posts p1
INNER JOIN (SELECT pi.owneruserid, MAX(pi.postid) AS maxpostid
            FROM Posts pi GROUP BY pi.owneruserid) p2
  ON (p1.postid = p2.maxpostid)
WHERE p1.owneruserid = 20860;

1 row in set (1 min 17.89 sec)

Even the EXPLAIN analysis takes over 16 seconds:

+----+-------------+------------+--------+----------------------------+-------------+---------+--------------+---------+-------------+
| id | select_type | table      | type   | possible_keys              | key         | key_len | ref          | rows    | Extra       |
+----+-------------+------------+--------+----------------------------+-------------+---------+--------------+---------+-------------+
|  1 | PRIMARY     | <derived2> | ALL    | NULL                       | NULL        | NULL    | NULL         |   76756 |             | 
|  1 | PRIMARY     | p1         | eq_ref | PRIMARY,PostId,OwnerUserId | PRIMARY     | 8       | p2.maxpostid |       1 | Using where | 
|  2 | DERIVED     | pi         | index  | NULL                       | OwnerUserId | 8       | NULL         | 1151268 | Using index | 
+----+-------------+------------+--------+----------------------------+-------------+---------+--------------+---------+-------------+
3 rows in set (16.09 sec)

Now produce the same query result using my technique with LEFT JOIN:

SELECT p1.postid
FROM Posts p1 LEFT JOIN posts p2
  ON (p1.owneruserid = p2.owneruserid AND p1.postid < p2.postid)
WHERE p2.postid IS NULL AND p1.owneruserid = 20860;

1 row in set (0.28 sec)

The EXPLAIN analysis shows that both tables are able to use their indexes:

+----+-------------+-------+------+----------------------------+-------------+---------+-------+------+--------------------------------------+
| id | select_type | table | type | possible_keys              | key         | key_len | ref   | rows | Extra                                |
+----+-------------+-------+------+----------------------------+-------------+---------+-------+------+--------------------------------------+
|  1 | SIMPLE      | p1    | ref  | OwnerUserId                | OwnerUserId | 8       | const | 1384 | Using index                          | 
|  1 | SIMPLE      | p2    | ref  | PRIMARY,PostId,OwnerUserId | OwnerUserId | 8       | const | 1384 | Using where; Using index; Not exists | 
+----+-------------+-------+------+----------------------------+-------------+---------+-------+------+--------------------------------------+
2 rows in set (0.00 sec)

Here's the DDL for my Posts table:

CREATE TABLE `posts` (
  `PostId` bigint(20) unsigned NOT NULL auto_increment,
  `PostTypeId` bigint(20) unsigned NOT NULL,
  `AcceptedAnswerId` bigint(20) unsigned default NULL,
  `ParentId` bigint(20) unsigned default NULL,
  `CreationDate` datetime NOT NULL,
  `Score` int(11) NOT NULL default '0',
  `ViewCount` int(11) NOT NULL default '0',
  `Body` text NOT NULL,
  `OwnerUserId` bigint(20) unsigned NOT NULL,
  `OwnerDisplayName` varchar(40) default NULL,
  `LastEditorUserId` bigint(20) unsigned default NULL,
  `LastEditDate` datetime default NULL,
  `LastActivityDate` datetime default NULL,
  `Title` varchar(250) NOT NULL default '',
  `Tags` varchar(150) NOT NULL default '',
  `AnswerCount` int(11) NOT NULL default '0',
  `CommentCount` int(11) NOT NULL default '0',
  `FavoriteCount` int(11) NOT NULL default '0',
  `ClosedDate` datetime default NULL,
  PRIMARY KEY  (`PostId`),
  UNIQUE KEY `PostId` (`PostId`),
  KEY `PostTypeId` (`PostTypeId`),
  KEY `AcceptedAnswerId` (`AcceptedAnswerId`),
  KEY `OwnerUserId` (`OwnerUserId`),
  KEY `LastEditorUserId` (`LastEditorUserId`),
  KEY `ParentId` (`ParentId`),
  CONSTRAINT `posts_ibfk_1` FOREIGN KEY (`PostTypeId`) REFERENCES `posttypes` (`PostTypeId`)
) ENGINE=InnoDB;

Note to commenters: If you want another benchmark with a different version of MySQL, a different dataset, or different table design, feel free to do it yourself. I have shown the technique above. Stack Overflow is here to show you how to do software development work, not to do all the work for you.

您的好友蓝忘机已上羡 2024-08-10 00:36:08

UPD:2017-03-31,版本5.7.5 默认启用 ONLY_FULL_GROUP_BY 开关(因此,非确定性 GROUP BY 查询被禁用)。此外,他们更新了 GROUP BY 实现,即使禁用了开关,该解决方案也可能无法按预期工作。需要检查一下。

当组内的项目数相当小时,Bill Karwin 的上述解决方案工作正常,但当组相当大时,查询的性能会变得很差,因为该解决方案需要大约 n*n /2 + n/2 仅 IS NULL 比较。

我在包含 18684446 行和 1182 组的 InnoDB 表上进行了测试。该表包含功能测试的测试结果,并以 (test_id, request_id) 作为主键。因此,test_id 是一个组,我正在为每个test_id 搜索最后一个request_id

Bill 的解决方案已经在我的 Dell e4310 上运行了几个小时,尽管它在覆盖索引上运行(因此在 EXPLAIN 中使用索引),但我不知道它什么时候会完成。

我有几个基于相同想法的其他解决方案:

  • 如果基础索引是 BTREE 索引(通常是这种情况),则最大的 (group_id, item_value) 对是其中的最后一个值每个 group_id,如果我们按降序遍历索引,则这是每个 group_id 的第一个;
  • 如果我们读取索引覆盖的值,则按照索引的顺序读取值;
  • 每个索引隐式包含附加到该索引的主键列(即主键位于覆盖索引中)。在下面的解决方案中,我直接对主键进行操作,在您的情况下,您只需要在结果中添加主键列。
  • 在许多情况下,在子查询中按所需的顺序收集所需的行 id 并将子查询的结果连接到 id 上要便宜得多。由于对于子查询结果中的每一行,MySQL 将需要基于主键进行一次获取,因此子查询将首先放在连接中,并且行将按照子查询中 id 的顺序输出(如果我们省略显式 ORDER BY用于连接)

MySQL 使用索引的 3 种方式是一篇很棒的文章,可以帮助您了解一些细节。

解决方案 1

这个速度快得令人难以置信,在我的 18M+ 行上大约需要 0.8 秒:

SELECT test_id, MAX(request_id) AS request_id
FROM testresults
GROUP BY test_id DESC;

如果您想将顺序更改为 ASC,请将其放入子查询中,仅返回 ids 并使用它作为连接到其余列的子查询:

SELECT test_id, request_id
FROM (
    SELECT test_id, MAX(request_id) AS request_id
    FROM testresults
    GROUP BY test_id DESC) as ids
ORDER BY test_id;

这对我的数据大约需要 1.2 秒。

解决方案 2

这是另一个解决方案,我的表大约需要 19 秒:

SELECT test_id, request_id
FROM testresults, (SELECT @group:=NULL) as init
WHERE IF(IFNULL(@group, -1)=@group:=test_id, 0, 1)
ORDER BY test_id DESC, request_id DESC

它也按降序返回测试。它要慢得多,因为它执行完整索引扫描,但它可以让您了解如何为每个组输出 N 个最大行。

查询的缺点是查询缓存无法缓存其结果。

UPD: 2017-03-31, the version 5.7.5 of MySQL made the ONLY_FULL_GROUP_BY switch enabled by default (hence, non-deterministic GROUP BY queries became disabled). Moreover, they updated the GROUP BY implementation and the solution might not work as expected anymore even with the disabled switch. One needs to check.

Bill Karwin's solution above works fine when item count within groups is rather small, but the performance of the query becomes bad when the groups are rather large, since the solution requires about n*n/2 + n/2 of only IS NULL comparisons.

I made my tests on a InnoDB table of 18684446 rows with 1182 groups. The table contains testresults for functional tests and has the (test_id, request_id) as the primary key. Thus, test_id is a group and I was searching for the last request_id for each test_id.

Bill's solution has already been running for several hours on my dell e4310 and I do not know when it is going to finish even though it operates on a coverage index (hence using index in EXPLAIN).

I have a couple of other solutions that are based on the same ideas:

  • if the underlying index is BTREE index (which is usually the case), the largest (group_id, item_value) pair is the last value within each group_id, that is the first for each group_id if we walk through the index in descending order;
  • if we read the values which are covered by an index, the values are read in the order of the index;
  • each index implicitly contains primary key columns appended to that (that is the primary key is in the coverage index). In solutions below I operate directly on the primary key, in you case, you will just need to add primary key columns in the result.
  • in many cases it is much cheaper to collect the required row ids in the required order in a subquery and join the result of the subquery on the id. Since for each row in the subquery result MySQL will need a single fetch based on primary key, the subquery will be put first in the join and the rows will be output in the order of the ids in the subquery (if we omit explicit ORDER BY for the join)

3 ways MySQL uses indexes is a great article to understand some details.

Solution 1

This one is incredibly fast, it takes about 0,8 secs on my 18M+ rows:

SELECT test_id, MAX(request_id) AS request_id
FROM testresults
GROUP BY test_id DESC;

If you want to change the order to ASC, put it in a subquery, return the ids only and use that as the subquery to join to the rest of the columns:

SELECT test_id, request_id
FROM (
    SELECT test_id, MAX(request_id) AS request_id
    FROM testresults
    GROUP BY test_id DESC) as ids
ORDER BY test_id;

This one takes about 1,2 secs on my data.

Solution 2

Here is another solution that takes about 19 seconds for my table:

SELECT test_id, request_id
FROM testresults, (SELECT @group:=NULL) as init
WHERE IF(IFNULL(@group, -1)=@group:=test_id, 0, 1)
ORDER BY test_id DESC, request_id DESC

It returns tests in descending order as well. It is much slower since it does a full index scan but it is here to give you an idea how to output N max rows for each group.

The disadvantage of the query is that its result cannot be cached by the query cache.

情绪 2024-08-10 00:36:08

使用您的 子查询 返回正确的分组,因为您一半的地方。

试试这个:

select
    a.*
from
    messages a
    inner join 
        (select name, max(id) as maxid from messages group by name) as b on
        a.id = b.maxid

如果它不是id,你想要的最大值:

select
    a.*
from
    messages a
    inner join 
        (select name, max(other_col) as other_col 
         from messages group by name) as b on
        a.name = b.name
        and a.other_col = b.other_col

这样,你可以避免相关子查询和/或子查询中的排序,这往往非常慢/低效。

Use your subquery to return the correct grouping, because you're halfway there.

Try this:

select
    a.*
from
    messages a
    inner join 
        (select name, max(id) as maxid from messages group by name) as b on
        a.id = b.maxid

If it's not id you want the max of:

select
    a.*
from
    messages a
    inner join 
        (select name, max(other_col) as other_col 
         from messages group by name) as b on
        a.name = b.name
        and a.other_col = b.other_col

This way, you avoid correlated subqueries and/or ordering in your subqueries, which tend to be very slow/inefficient.

你好,陌生人 2024-08-10 00:36:08

我找到了一个不同的解决方案,即获取每个组中最后一篇文章的 ID,然后使用第一个查询的结果作为 WHERE x IN 构造的参数从消息表中进行选择:

SELECT id, name, other_columns
FROM messages
WHERE id IN (
    SELECT MAX(id)
    FROM messages
    GROUP BY name
);

我不知道与其他一些解决方案相比,它的性能如何,但它对于我的包含 3 多百万行的表来说效果非常好。 (4 秒执行,1200 多个结果)

这应该适用于 MySQL 和 SQL Server。

I arrived at a different solution, which is to get the IDs for the last post within each group, then select from the messages table using the result from the first query as the argument for a WHERE x IN construct:

SELECT id, name, other_columns
FROM messages
WHERE id IN (
    SELECT MAX(id)
    FROM messages
    GROUP BY name
);

I don't know how this performs compared to some of the other solutions, but it worked spectacularly for my table with 3+ million rows. (4 second execution with 1200+ results)

This should work both on MySQL and SQL Server.

命比纸薄 2024-08-10 00:36:08

通过子查询解决fiddle Link

select * from messages where id in
(select max(id) from messages group by Name)

通过连接条件解决小提琴链接

select m1.* from messages m1 
left outer join messages m2 
on ( m1.id<m2.id and m1.name=m2.name )
where m2.id is null

这篇文章的原因是仅提供小提琴链接。
其他答案中已经提供了相同的 SQL。

Solution by sub query fiddle Link

select * from messages where id in
(select max(id) from messages group by Name)

Solution By join condition fiddle link

select m1.* from messages m1 
left outer join messages m2 
on ( m1.id<m2.id and m1.name=m2.name )
where m2.id is null

Reason for this post is to give fiddle link only.
Same SQL is already provided in other answers.

来世叙缘 2024-08-10 00:36:08

我们将了解如何使用 MySQL 获取记录 Group By 中的最后一条记录。例如,如果您有此帖子结果集。

id Category_id post_title
1 1 标题 1
2 1 标题 2
3 1 标题 3
4 2 标题 4
5 2 标题 5
6 3 标题 6

我希望能够获取每个类别中的最后一篇文章,即标题 3、标题 5 和标题 6要按类别获取帖子,您将使用 MySQL Group By 键盘。

select * from posts group by category_id

但我们从这个查询中得到的结果是。

id Category_id post_title
1 1 标题 1
4 2 标题 4
6 3 标题 6

group by 将始终返回结果集上组中的第一条记录。

SELECT id, category_id, post_title
FROM posts
WHERE id IN (
    SELECT MAX(id)
    FROM posts
    GROUP BY category_id );

这将返回每个组中 ID 最高的帖子。

id Category_id post_title
3 1 标题 3
5 2 标题 5
6 3 标题 6

参考点击这里

We will look at how you can use MySQL at getting the last record in a Group By of records. For example if you have this result set of posts.

id category_id post_title
1 1 Title 1
2 1 Title 2
3 1 Title 3
4 2 Title 4
5 2 Title 5
6 3 Title 6

I want to be able to get the last post in each category which are Title 3, Title 5 and Title 6. To get the posts by the category you will use the MySQL Group By keyboard.

select * from posts group by category_id

But the results we get back from this query is.

id category_id post_title
1 1 Title 1
4 2 Title 4
6 3 Title 6

The group by will always return the first record in the group on the result set.

SELECT id, category_id, post_title
FROM posts
WHERE id IN (
    SELECT MAX(id)
    FROM posts
    GROUP BY category_id );

This will return the posts with the highest IDs in each group.

id category_id post_title
3 1 Title 3
5 2 Title 5
6 3 Title 6

Reference Click Here

孤云独去闲 2024-08-10 00:36:08

一种相当快的方法如下。

SELECT * 
FROM messages a
WHERE Id = (SELECT MAX(Id) FROM messages WHERE a.Name = Name)

结果

Id  Name    Other_Columns
3   A   A_data_3
5   B   B_data_2
6   C   C_data_1

An approach with considerable speed is as follows.

SELECT * 
FROM messages a
WHERE Id = (SELECT MAX(Id) FROM messages WHERE a.Name = Name)

Result

Id  Name    Other_Columns
3   A   A_data_3
5   B   B_data_2
6   C   C_data_1
青春有你 2024-08-10 00:36:08

这里有两个建议。首先,如果 mysql 支持 ROW_NUMBER(),则非常简单:

WITH Ranked AS (
  SELECT Id, Name, OtherColumns,
    ROW_NUMBER() OVER (
      PARTITION BY Name
      ORDER BY Id DESC
    ) AS rk
  FROM messages
)
  SELECT Id, Name, OtherColumns
  FROM messages
  WHERE rk = 1;

我假设“最后”是指 Id 顺序中的最后一个。如果不是,请相应地更改 ROW_NUMBER() 窗口的 ORDER BY 子句。如果 ROW_NUMBER() 不可用,这是另一种解决方案:

其次,如果不可用,这通常是继续的好方法:

SELECT
  Id, Name, OtherColumns
FROM messages
WHERE NOT EXISTS (
  SELECT * FROM messages as M2
  WHERE M2.Name = messages.Name
  AND M2.Id > messages.Id
)

换句话说,选择不存在具有相同名称的 Later-Id 消息的消息。

Here are two suggestions. First, if mysql supports ROW_NUMBER(), it's very simple:

WITH Ranked AS (
  SELECT Id, Name, OtherColumns,
    ROW_NUMBER() OVER (
      PARTITION BY Name
      ORDER BY Id DESC
    ) AS rk
  FROM messages
)
  SELECT Id, Name, OtherColumns
  FROM messages
  WHERE rk = 1;

I'm assuming by "last" you mean last in Id order. If not, change the ORDER BY clause of the ROW_NUMBER() window accordingly. If ROW_NUMBER() isn't available, this is another solution:

Second, if it doesn't, this is often a good way to proceed:

SELECT
  Id, Name, OtherColumns
FROM messages
WHERE NOT EXISTS (
  SELECT * FROM messages as M2
  WHERE M2.Name = messages.Name
  AND M2.Id > messages.Id
)

In other words, select messages where there is no later-Id message with the same Name.

蹲墙角沉默 2024-08-10 00:36:08

显然,有很多不同的方法可以获得相同的结果,您的问题似乎是在 MySQL 中获得每组最后结果的有效方法是什么。如果您正在处理大量数据,并且假设您将 InnoDB 与最新版本的 MySQL(例如 5.7.21 和 8.0.4-rc)一起使用,那么可能没有一种有效的方法来做到这一点。

有时我们需要对行数超过 6000 万的表执行此操作。

对于这些示例,我将使用仅包含约 150 万行的数据,其中查询需要查找数据中所有组的结果。在我们的实际案例中,我们经常需要返回大约 2,000 个组的数据(假设不需要检查太多数据)。

我将使用以下表格:

CREATE TABLE temperature(
  id INT UNSIGNED NOT NULL AUTO_INCREMENT, 
  groupID INT UNSIGNED NOT NULL, 
  recordedTimestamp TIMESTAMP NOT NULL, 
  recordedValue INT NOT NULL,
  INDEX groupIndex(groupID, recordedTimestamp), 
  PRIMARY KEY (id)
);

CREATE TEMPORARY TABLE selected_group(id INT UNSIGNED NOT NULL, PRIMARY KEY(id)); 

温度表填充了大约 150 万条随机记录,以及 100 个不同的组。
selected_group 填充了这 100 个组(在我们的例子中,所有组的比例通常小于 20%)。

由于此数据是随机的,这意味着多行可以具有相同的记录时间戳。我们想要的是按照 groupID 的顺序获取所有选定组的列表,其中包含每个组的最后记录时间戳,如果同一组有多个这样的匹配行,则获取这些行的最后一个匹配 id。

如果假设 MySQL 有一个 last() 函数,它从特殊 ORDER BY 子句中的最后一行返回值,那么我们可以简单地执行以下操作:

SELECT 
  last(t1.id) AS id, 
  t1.groupID, 
  last(t1.recordedTimestamp) AS recordedTimestamp, 
  last(t1.recordedValue) AS recordedValue
FROM selected_group g
INNER JOIN temperature t1 ON t1.groupID = g.id
ORDER BY t1.recordedTimestamp, t1.id
GROUP BY t1.groupID;

在这种情况下,只需要检查几百行,因为它不使用任何正常的 GROUP BY 功能。这将在 0 秒内执行,因此效率很高。
请注意,通常在 MySQL 中,我们会在 GROUP BY 子句之后看到 ORDER BY 子句,但是此 ORDER BY 子句用于确定 Last() 函数的 ORDER,如果它位于 GROUP BY 之后,那么它将对 GROUPS 进行排序。如果不存在 GROUP BY 子句,则所有返回行中的最后一个值将相同。

然而 MySQL 没有这个,所以让我们看看它所拥有的不同想法,并证明这些都不是有效的。

示例 1

SELECT t1.id, t1.groupID, t1.recordedTimestamp, t1.recordedValue
FROM selected_group g
INNER JOIN temperature t1 ON t1.id = (
  SELECT t2.id
  FROM temperature t2 
  WHERE t2.groupID = g.id
  ORDER BY t2.recordedTimestamp DESC, t2.id DESC
  LIMIT 1
);

这检查了 3,009,254 行,在 5.7.21 上花费了约 0.859 秒,在 8.0.4-rc 上花费了约

1.25 秒示例 2

SELECT t1.id, t1.groupID, t1.recordedTimestamp, t1.recordedValue 
FROM temperature t1
INNER JOIN ( 
  SELECT max(t2.id) AS id   
  FROM temperature t2
  INNER JOIN (
    SELECT t3.groupID, max(t3.recordedTimestamp) AS recordedTimestamp
    FROM selected_group g
    INNER JOIN temperature t3 ON t3.groupID = g.id
    GROUP BY t3.groupID
  ) t4 ON t4.groupID = t2.groupID AND t4.recordedTimestamp = t2.recordedTimestamp
  GROUP BY t2.groupID
) t5 ON t5.id = t1.id;

这检查了 1,505,331 行,在 8.0.4-rc 上花费了约 1.25 秒5.7.21 和 8.0.4-rc 上稍长

示例 3

SELECT t1.id, t1.groupID, t1.recordedTimestamp, t1.recordedValue 
FROM temperature t1
WHERE t1.id IN ( 
  SELECT max(t2.id) AS id   
  FROM temperature t2
  INNER JOIN (
    SELECT t3.groupID, max(t3.recordedTimestamp) AS recordedTimestamp
    FROM selected_group g
    INNER JOIN temperature t3 ON t3.groupID = g.id
    GROUP BY t3.groupID
  ) t4 ON t4.groupID = t2.groupID AND t4.recordedTimestamp = t2.recordedTimestamp
  GROUP BY t2.groupID
)
ORDER BY t1.groupID;

这检查了 3,009,685 行,在 5.7.21 上花费了约 1.95 秒,在 8.0.4-rc 上稍长

示例 4

SELECT t1.id, t1.groupID, t1.recordedTimestamp, t1.recordedValue
FROM selected_group g
INNER JOIN temperature t1 ON t1.id = (
  SELECT max(t2.id)
  FROM temperature t2 
  WHERE t2.groupID = g.id AND t2.recordedTimestamp = (
      SELECT max(t3.recordedTimestamp)
      FROM temperature t3 
      WHERE t3.groupID = g.id
    )
);

在 5.7.21 上检查了 6,137,810 行,花费了约 2.2 秒,在 8.0.4-rc 上花费了约 2.2 秒

示例 5

SELECT t1.id, t1.groupID, t1.recordedTimestamp, t1.recordedValue
FROM (
  SELECT 
    t2.id, 
    t2.groupID, 
    t2.recordedTimestamp, 
    t2.recordedValue, 
    row_number() OVER (
      PARTITION BY t2.groupID ORDER BY t2.recordedTimestamp DESC, t2.id DESC
    ) AS rowNumber
  FROM selected_group g 
  INNER JOIN temperature t2 ON t2.groupID = g.id
) t1 WHERE t1.rowNumber = 1;

在 8.0.4-rc 上检查了 6,017,808 行,花费了约 4.2 秒

>示例 6

SELECT t1.id, t1.groupID, t1.recordedTimestamp, t1.recordedValue 
FROM (
  SELECT 
    last_value(t2.id) OVER w AS id, 
    t2.groupID, 
    last_value(t2.recordedTimestamp) OVER w AS recordedTimestamp, 
    last_value(t2.recordedValue) OVER w AS recordedValue
  FROM selected_group g
  INNER JOIN temperature t2 ON t2.groupID = g.id
  WINDOW w AS (
    PARTITION BY t2.groupID 
    ORDER BY t2.recordedTimestamp, t2.id 
    RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
  )
) t1
GROUP BY t1.groupID;

这检查了 6,017,908 行,在 8.0.4-rc 上花费了大约 17.5 秒

示例 7

SELECT t1.id, t1.groupID, t1.recordedTimestamp, t1.recordedValue 
FROM selected_group g
INNER JOIN temperature t1 ON t1.groupID = g.id
LEFT JOIN temperature t2 
  ON t2.groupID = g.id 
  AND (
    t2.recordedTimestamp > t1.recordedTimestamp 
    OR (t2.recordedTimestamp = t1.recordedTimestamp AND t2.id > t1.id)
  )
WHERE t2.id IS NULL
ORDER BY t1.groupID;

这个花费了很长时间,所以我不得不杀死它。

Clearly there are lots of different ways of getting the same results, your question seems to be what is an efficient way of getting the last results in each group in MySQL. If you are working with huge amounts of data and assuming you are using InnoDB with even the latest versions of MySQL (such as 5.7.21 and 8.0.4-rc) then there might not be an efficient way of doing this.

We sometimes need to do this with tables with even more than 60 million rows.

For these examples I will use data with only about 1.5 million rows where the queries would need to find results for all groups in the data. In our actual cases we would often need to return back data from about 2,000 groups (which hypothetically would not require examining very much of the data).

I will use the following tables:

CREATE TABLE temperature(
  id INT UNSIGNED NOT NULL AUTO_INCREMENT, 
  groupID INT UNSIGNED NOT NULL, 
  recordedTimestamp TIMESTAMP NOT NULL, 
  recordedValue INT NOT NULL,
  INDEX groupIndex(groupID, recordedTimestamp), 
  PRIMARY KEY (id)
);

CREATE TEMPORARY TABLE selected_group(id INT UNSIGNED NOT NULL, PRIMARY KEY(id)); 

The temperature table is populated with about 1.5 million random records, and with 100 different groups.
The selected_group is populated with those 100 groups (in our cases this would normally be less than 20% for all of the groups).

As this data is random it means that multiple rows can have the same recordedTimestamps. What we want is to get a list of all of the selected groups in order of groupID with the last recordedTimestamp for each group, and if the same group has more than one matching row like that then the last matching id of those rows.

If hypothetically MySQL had a last() function which returned values from the last row in a special ORDER BY clause then we could simply do:

SELECT 
  last(t1.id) AS id, 
  t1.groupID, 
  last(t1.recordedTimestamp) AS recordedTimestamp, 
  last(t1.recordedValue) AS recordedValue
FROM selected_group g
INNER JOIN temperature t1 ON t1.groupID = g.id
ORDER BY t1.recordedTimestamp, t1.id
GROUP BY t1.groupID;

which would only need to examine a few 100 rows in this case as it doesn't use any of the normal GROUP BY functions. This would execute in 0 seconds and hence be highly efficient.
Note that normally in MySQL we would see an ORDER BY clause following the GROUP BY clause however this ORDER BY clause is used to determine the ORDER for the last() function, if it was after the GROUP BY then it would be ordering the GROUPS. If no GROUP BY clause is present then the last values will be the same in all of the returned rows.

However MySQL does not have this so let's look at different ideas of what it does have and prove that none of these are efficient.

Example 1

SELECT t1.id, t1.groupID, t1.recordedTimestamp, t1.recordedValue
FROM selected_group g
INNER JOIN temperature t1 ON t1.id = (
  SELECT t2.id
  FROM temperature t2 
  WHERE t2.groupID = g.id
  ORDER BY t2.recordedTimestamp DESC, t2.id DESC
  LIMIT 1
);

This examined 3,009,254 rows and took ~0.859 seconds on 5.7.21 and slightly longer on 8.0.4-rc

Example 2

SELECT t1.id, t1.groupID, t1.recordedTimestamp, t1.recordedValue 
FROM temperature t1
INNER JOIN ( 
  SELECT max(t2.id) AS id   
  FROM temperature t2
  INNER JOIN (
    SELECT t3.groupID, max(t3.recordedTimestamp) AS recordedTimestamp
    FROM selected_group g
    INNER JOIN temperature t3 ON t3.groupID = g.id
    GROUP BY t3.groupID
  ) t4 ON t4.groupID = t2.groupID AND t4.recordedTimestamp = t2.recordedTimestamp
  GROUP BY t2.groupID
) t5 ON t5.id = t1.id;

This examined 1,505,331 rows and took ~1.25 seconds on 5.7.21 and slightly longer on 8.0.4-rc

Example 3

SELECT t1.id, t1.groupID, t1.recordedTimestamp, t1.recordedValue 
FROM temperature t1
WHERE t1.id IN ( 
  SELECT max(t2.id) AS id   
  FROM temperature t2
  INNER JOIN (
    SELECT t3.groupID, max(t3.recordedTimestamp) AS recordedTimestamp
    FROM selected_group g
    INNER JOIN temperature t3 ON t3.groupID = g.id
    GROUP BY t3.groupID
  ) t4 ON t4.groupID = t2.groupID AND t4.recordedTimestamp = t2.recordedTimestamp
  GROUP BY t2.groupID
)
ORDER BY t1.groupID;

This examined 3,009,685 rows and took ~1.95 seconds on 5.7.21 and slightly longer on 8.0.4-rc

Example 4

SELECT t1.id, t1.groupID, t1.recordedTimestamp, t1.recordedValue
FROM selected_group g
INNER JOIN temperature t1 ON t1.id = (
  SELECT max(t2.id)
  FROM temperature t2 
  WHERE t2.groupID = g.id AND t2.recordedTimestamp = (
      SELECT max(t3.recordedTimestamp)
      FROM temperature t3 
      WHERE t3.groupID = g.id
    )
);

This examined 6,137,810 rows and took ~2.2 seconds on 5.7.21 and slightly longer on 8.0.4-rc

Example 5

SELECT t1.id, t1.groupID, t1.recordedTimestamp, t1.recordedValue
FROM (
  SELECT 
    t2.id, 
    t2.groupID, 
    t2.recordedTimestamp, 
    t2.recordedValue, 
    row_number() OVER (
      PARTITION BY t2.groupID ORDER BY t2.recordedTimestamp DESC, t2.id DESC
    ) AS rowNumber
  FROM selected_group g 
  INNER JOIN temperature t2 ON t2.groupID = g.id
) t1 WHERE t1.rowNumber = 1;

This examined 6,017,808 rows and took ~4.2 seconds on 8.0.4-rc

Example 6

SELECT t1.id, t1.groupID, t1.recordedTimestamp, t1.recordedValue 
FROM (
  SELECT 
    last_value(t2.id) OVER w AS id, 
    t2.groupID, 
    last_value(t2.recordedTimestamp) OVER w AS recordedTimestamp, 
    last_value(t2.recordedValue) OVER w AS recordedValue
  FROM selected_group g
  INNER JOIN temperature t2 ON t2.groupID = g.id
  WINDOW w AS (
    PARTITION BY t2.groupID 
    ORDER BY t2.recordedTimestamp, t2.id 
    RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
  )
) t1
GROUP BY t1.groupID;

This examined 6,017,908 rows and took ~17.5 seconds on 8.0.4-rc

Example 7

SELECT t1.id, t1.groupID, t1.recordedTimestamp, t1.recordedValue 
FROM selected_group g
INNER JOIN temperature t1 ON t1.groupID = g.id
LEFT JOIN temperature t2 
  ON t2.groupID = g.id 
  AND (
    t2.recordedTimestamp > t1.recordedTimestamp 
    OR (t2.recordedTimestamp = t1.recordedTimestamp AND t2.id > t1.id)
  )
WHERE t2.id IS NULL
ORDER BY t1.groupID;

This one was taking forever so I had to kill it.

明月松间行 2024-08-10 00:36:08

这是使用带有 order by 的 GROUP_CONCATSUBSTRING_INDEX 来获取最后一条相关记录的另一种方法,以从列表中选择一条记录

SELECT 
  `Id`,
  `Name`,
  SUBSTRING_INDEX(
    GROUP_CONCAT(
      `Other_Columns` 
      ORDER BY `Id` DESC 
      SEPARATOR '||'
    ),
    '||',
    1
  ) Other_Columns 
FROM
  messages 
GROUP BY `Name` 

以上查询将对所有 进行分组位于同一 Name 组中并使用 ORDER BY id DESC 的 Other_Columns 将以降序排列特定组中的所有 Other_Columns在我的例子中,使用提供的分隔符,我使用了 || ,在此列表上使用 SUBSTRING_INDEX 将选择第一个

小提琴演示

Here is another way to get the last related record using GROUP_CONCAT with order by and SUBSTRING_INDEX to pick one of the record from the list

SELECT 
  `Id`,
  `Name`,
  SUBSTRING_INDEX(
    GROUP_CONCAT(
      `Other_Columns` 
      ORDER BY `Id` DESC 
      SEPARATOR '||'
    ),
    '||',
    1
  ) Other_Columns 
FROM
  messages 
GROUP BY `Name` 

Above query will group the all the Other_Columns that are in same Name group and using ORDER BY id DESC will join all the Other_Columns in a specific group in descending order with the provided separator in my case i have used || ,using SUBSTRING_INDEX over this list will pick the first one

Fiddle Demo

在风中等你 2024-08-10 00:36:08

@Vijay Dev,您好,如果您的表 messages 包含 Id (自动增量主键),那么要根据主键获取最新记录,您的查询应如下所示:

SELECT m1.* FROM messages m1 INNER JOIN (SELECT max(Id) as lastmsgId FROM messages GROUP BY Name) m2 ON m1.Id=m2.lastmsgId

Hi @Vijay Dev if your table messages contains Id which is auto increment primary key then to fetch the latest record basis on the primary key your query should read as below:

SELECT m1.* FROM messages m1 INNER JOIN (SELECT max(Id) as lastmsgId FROM messages GROUP BY Name) m2 ON m1.Id=m2.lastmsgId
染墨丶若流云 2024-08-10 00:36:08

如果您需要分组查询中文本列的最新或最旧记录,并且您不想使用子查询,则可以执行此操作...

例如。您有一个电影列表,需要获取该系列的计数和最新电影

id 系列 名称
1 Star Wars A New Hope
2 Star Wars The Empire Strikes Back
3 Star Wars Return of the Jedi
SELECT COUNT(id), series, SUBSTRING(MAX(CONCAT(id, name)), LENGTH(id) + 1), 
FROM Movies
GROUP BY series

这返回...

id 系列 名称
3 Star Wars Return of The Jedi

MAX 将返回具有最高值的行,因此通过将 id 与名称连接,您现在将获得最新的记录,然后只需去掉 id 即可获得最终结果。

比使用子查询更有效。

因此,对于给定的示例:

SELECT MAX(Id), Name, SUBSTRING(MAX(CONCAT(Id, Other_Columns)), LENGTH(Id) + 1), 
FROM messages
GROUP BY Name

快乐编码,以及“愿原力与你同在”:)

If you need the most recent or oldest record of a text column in a grouped query, and you would rather not use a subquery, you can do this...

Ex. You have a list of movies and need to get the count in the series and the latest movie

id series name
1 Star Wars A New hope
2 Star Wars The Empire Strikes Back
3 Star Wars Return of The Jedi
SELECT COUNT(id), series, SUBSTRING(MAX(CONCAT(id, name)), LENGTH(id) + 1), 
FROM Movies
GROUP BY series

This returns...

id series name
3 Star Wars Return of The Jedi

MAX will return the row with the highest value, so by concatenating the id to the name, you now will get the newest record, then just strip off the id for your final result.

More efficient than using a subquery.

So for the given example:

SELECT MAX(Id), Name, SUBSTRING(MAX(CONCAT(Id, Other_Columns)), LENGTH(Id) + 1), 
FROM messages
GROUP BY Name

Happy coding, and "May The Force Be With You" :)

指尖上的星空 2024-08-10 00:36:08
SELECT 
  column1,
  column2 
FROM
  table_name 
WHERE id IN 
  (SELECT 
    MAX(id) 
  FROM
    table_name 
  GROUP BY column1) 
ORDER BY column1 ;
SELECT 
  column1,
  column2 
FROM
  table_name 
WHERE id IN 
  (SELECT 
    MAX(id) 
  FROM
    table_name 
  GROUP BY column1) 
ORDER BY column1 ;
顾冷 2024-08-10 00:36:08

您也可以从这里观看风景。

http://sqlfiddle.com/#!9/ef42b/9

第一解决方案

SELECT d1.ID,Name,City FROM Demo_User d1
INNER JOIN
(SELECT MAX(ID) AS ID FROM Demo_User GROUP By NAME) AS P ON (d1.ID=P.ID);

第二种解决方案

SELECT * FROM (SELECT * FROM Demo_User ORDER BY ID DESC) AS T GROUP BY NAME ;

You can take view from here as well.

http://sqlfiddle.com/#!9/ef42b/9

FIRST SOLUTION

SELECT d1.ID,Name,City FROM Demo_User d1
INNER JOIN
(SELECT MAX(ID) AS ID FROM Demo_User GROUP By NAME) AS P ON (d1.ID=P.ID);

SECOND SOLUTION

SELECT * FROM (SELECT * FROM Demo_User ORDER BY ID DESC) AS T GROUP BY NAME ;
鞋纸虽美,但不合脚ㄋ〞 2024-08-10 00:36:08

**

嗨,此查询可能有帮助:

**

SELECT 
  *
FROM 
  message 

WHERE 
  `Id` IN (
    SELECT 
      MAX(`Id`) 
    FROM 
      message 
    GROUP BY 
      `Name`
  ) 
ORDER BY 
   `Id` DESC

**

Hi, this query might help :

**

SELECT 
  *
FROM 
  message 

WHERE 
  `Id` IN (
    SELECT 
      MAX(`Id`) 
    FROM 
      message 
    GROUP BY 
      `Name`
  ) 
ORDER BY 
   `Id` DESC
毁梦 2024-08-10 00:36:08

试试这个:

SELECT jos_categories.title AS name,
       joined .catid,
       joined .title,
       joined .introtext
FROM   jos_categories
       INNER JOIN (SELECT *
                   FROM   (SELECT `title`,
                                  catid,
                                  `created`,
                                  introtext
                           FROM   `jos_content`
                           WHERE  `sectionid` = 6
                           ORDER  BY `id` DESC) AS yes
                   GROUP  BY `yes`.`catid` DESC
                   ORDER  BY `yes`.`created` DESC) AS joined
         ON( joined.catid = jos_categories.id )  

Try this:

SELECT jos_categories.title AS name,
       joined .catid,
       joined .title,
       joined .introtext
FROM   jos_categories
       INNER JOIN (SELECT *
                   FROM   (SELECT `title`,
                                  catid,
                                  `created`,
                                  introtext
                           FROM   `jos_content`
                           WHERE  `sectionid` = 6
                           ORDER  BY `id` DESC) AS yes
                   GROUP  BY `yes`.`catid` DESC
                   ORDER  BY `yes`.`created` DESC) AS joined
         ON( joined.catid = jos_categories.id )  
带刺的爱情 2024-08-10 00:36:08

我还没有使用大型数据库进行测试,但我认为这可能比连接表更快:

SELECT *, Max(Id) FROM messages GROUP BY Name

I've not yet tested with large DB but I think this could be faster than joining tables:

SELECT *, Max(Id) FROM messages GROUP BY Name
那小子欠揍 2024-08-10 00:36:08
SELECT * FROM table_name WHERE primary_key IN (SELECT MAX(primary_key) FROM table_name GROUP BY column_name )
SELECT * FROM table_name WHERE primary_key IN (SELECT MAX(primary_key) FROM table_name GROUP BY column_name )
独留℉清风醉 2024-08-10 00:36:08

我在 https://dzone.com/articles 中找到了最佳解决方案/获取每个mysql组中的最后一条记录

select * from `data` where `id` in (select max(`id`) from `data` group by `name_id`)

i find best solution in https://dzone.com/articles/get-last-record-in-each-mysql-group

select * from `data` where `id` in (select max(`id`) from `data` group by `name_id`)
与君绝 2024-08-10 00:36:08

在我寻求通用分组最大值的过程中,我看到了许多关于该主题的答案和博客文章。甚至 我最喜欢的(实际上是有关该主题的精彩系列的一部分)未能找到可移植的解决方案,而是深入研究每个 RDMBS 的细节。

幸运的是,确实存在便携式解决方案

为此所需的二级索引是name。 (name, id 将是相同的,因为主键始终隐式包含在内。)

创建 message 组,并使用依赖子查询来获取最新行每组。

SELECT m.*

-- Step 1: Start by obtaining the groupwise maximums
FROM
(
    SELECT (
        -- Step 1b: Find the ID of the group maximum by seeking in the index
        SELECT id
        FROM messages m
        WHERE m.name = groups.name
        ORDER BY m.name DESC, m.id DESC -- Match the index EXACTLY, and indicate direction
        LIMIT 1
    ) AS id

    -- Step 1a: Find the groups by seeking through the index
    FROM messages AS groups
    GROUP BY groups.name
) AS maxes

-- Step 2: For each group, join the max row by ID
-- This neatly separates any potential followup SQL from the groupwise-max tactics
INNER JOIN messages m ON m.id = maxes.id
;

这是可移植的,因为它只需要以下构建块的组合:

  • 索引GROUP BY
  • 使用 ORDER BY [ASC/DESC]LIMIT/TOP 索引 SELECT
  • 依赖子查询。

只需确保具有正确的索引:GroupKeyColumn(s)、GroupWinnerColumn(s)、PrimaryKeyColumn(s)

在OP的例子中,组键是name,组的获胜者由id确定,并且主键已经被它覆盖,所以:name, id

许多人提出了涉及子查询的解决方案,但最容易被忽视的方面是高度特定的排序子句集,它导致使用正确的索引 - 在正确的遍历方向上,同样如此。

其他优势

  • 可轻松调整最小值 (ASC) 与最大值 (DESC)。
  • 每组的获胜者可以是复合的,例如时间戳、id。 (这也使我们能够消除非唯一获胜者的歧义,例如“最新时间戳”。)
  • 组键可以是复合的,例如company_id,department_name
  • 使用 WHERE 轻松扩展要选择的组。
  • 使用 WHERE 轻松扩展要忽略的项目,包括索引 (id >= 1000) 和非索引 (is_deleted = 0) 。

为什么这个工作[最佳]?

想象一下翻阅物理电话簿,找到每个城镇的最后一个条目,即具有该城镇字母顺序最大名称的条目。你会怎么做?

你将从最后开始。书中的最后一个条目是最后一个城镇的组最大值。这是您遇到的第一个结果行。

对于每个后续所需的结果行,您将向后进行二分搜索,直到下一个最大的城镇。在当前城镇过渡到其前任城镇的位置,存在前任城镇的最后一行(按字母顺序排列的最大名称),即下一个结果行。重复直到不再有城镇。

粗略地说,电话簿就像 { Town, Name, PhoneNumber } 上的二级索引,以 PhoneNumber 作为主键。 (为了论证,我简化了事情,假装电话号码分配给一个人,名字形成一列。)

您实际上是通过索引进行反向搜索。通过高效地重复跳转到下一个城镇(得益于二分搜索或 B 树结构),工作受到结果行数而不是总行数的限制。这是渐近最优的。由于反向遍历方向,您遇到的每个城镇都从其最大的行(您的目标)“开始”。这很重要:想象一下,如果您必须扫描一个城镇的所有行,将会有多少不必要的工作量。

将解决方案更改为 groupwise-min 与更改遍历方向(即从 DESCASC)一样简单。

RDBMS 注释

  • 虽然 MySQL 8 正确地显示了 Using index,但 MySQL 5.7 显示了令人担忧的 Using where;使用index,但实际上执行正确。 (在涉及非常大的组的庞大数据集上进行测试。在约 3 秒内获得分布在数亿条记录中的数万个结果。)
  • 对于 SQL Server,语法为 SELECT TOP 1 SELECT ... LIMIT 1

In my quest for a universal groupwise-max, I've seen many answers and blog posts on the subject. Even my favorite (actually part of a fantastic series on the subject) failed to identify a portable solution, instead diving deep into specifics per RDMBS.

Luckily, a portable solution does exist!

The secondary index you need for this is name. (name, id would be identical, as the primary key is always included implicitly.)

Create groups of the messages, and use a dependent subquery to get the latest row for each group.

SELECT m.*

-- Step 1: Start by obtaining the groupwise maximums
FROM
(
    SELECT (
        -- Step 1b: Find the ID of the group maximum by seeking in the index
        SELECT id
        FROM messages m
        WHERE m.name = groups.name
        ORDER BY m.name DESC, m.id DESC -- Match the index EXACTLY, and indicate direction
        LIMIT 1
    ) AS id

    -- Step 1a: Find the groups by seeking through the index
    FROM messages AS groups
    GROUP BY groups.name
) AS maxes

-- Step 2: For each group, join the max row by ID
-- This neatly separates any potential followup SQL from the groupwise-max tactics
INNER JOIN messages m ON m.id = maxes.id
;

This is portable because it requires only the following combination of building blocks:

  • Indexed GROUP BY.
  • Indexed SELECT with ORDER BY [ASC/DESC] and LIMIT/TOP.
  • Dependent subqueries.

Just be sure to have the correct index: GroupKeyColumn(s), GroupWinnerColumn(s), PrimaryKeyColumn(s).

In OP's case, the group key is name, the group's winner is determined by id, and the primary key is already covered by that, so: name, id.

Many have suggested solutions involving subqueries, but the most overlooked aspect is the highly specific set of ordering clauses that causes the correct index to be used - in the right traversal direction, no less.

Additional Advantages

  • Easily adjustable for min (ASC) vs. max (DESC).
  • Winner per group can be composite, e.g. timestamp, id. (This also allows us to disambiguate non-unique winners, such as "latest timestamp".)
  • Group key can be composite, e.g. company_id, department_name.
  • Easily extended with a WHERE on which groups to select.
  • Easily extended with a WHERE on what items to ignore, both indexed (id >= 1000) and non-indexed (is_deleted = 0).

Why does this work [optimally]?

Imagine leafing through the physical phone book, finding the last entry for each town, i.e. the entry with that town's alphabetically greatest name. How would you do it?

You would start at the very end. The very last entry in the book is the group maximum of the last town. It is the first result row that you encounter.

For each subsequent desired result row, you would binary search backwards, to the next-greatest town. At the point where the current town transitions into its predecessor, there is the predecessor's last row (alphabetically greatest name), i.e. your next result row. Repeat until no more towns.

Loosely speaking, the phone book is like a secondary index on { Town, Name, PhoneNumber }, with PhoneNumber serving as the primary key. (I'm simplifying things for agument's sake, pretending phone numbers are assigned to one person and names form a single column.)

You are effectively doing a reverse seek through the index. By repeatedly jumping to the next town efficiently (thanks to binary search or a B-tree structure), the work is constrained by the number of result rows rather than the total number of rows. This is asymptotically optimal. And thanks to the reverse traversal direction, each town you encounter "starts" with its greatest row, your target. That is important: imagine the absurd amount of needless work if you'd have to scan all rows for a town.

Changing the solution to a groupwise-min is as trivial as changing the traversal direction, i.e. from DESC to ASC.

RDBMS Notes

  • Whereas MySQL 8 correctly shows Using index for this, MySQL 5.7 shows a worrisome Using where; Using index, but it actually performs correctly. (Tested on a huge data set involving very large groups. Tens of thousands of results spread through hundreds of millions of records were obtained in ~3 seconds.)
  • For SQL Server, the syntax is SELECT TOP 1 instead of SELECT ... LIMIT 1.
删除会话 2024-08-10 00:36:08

下面的查询将根据您的问题正常工作。

SELECT M1.* 
FROM MESSAGES M1,
(
 SELECT SUBSTR(Others_data,1,2),MAX(Others_data) AS Max_Others_data
 FROM MESSAGES
 GROUP BY 1
) M2
WHERE M1.Others_data = M2.Max_Others_data
ORDER BY Others_data;

The below query will work fine as per your question.

SELECT M1.* 
FROM MESSAGES M1,
(
 SELECT SUBSTR(Others_data,1,2),MAX(Others_data) AS Max_Others_data
 FROM MESSAGES
 GROUP BY 1
) M2
WHERE M1.Others_data = M2.Max_Others_data
ORDER BY Others_data;
无人问我粥可暖 2024-08-10 00:36:08

如果您想要每个 Name 的最后一行,那么您可以通过 Name 为每个行组指定行号,并按 Id 排序降序排列。

查询

SELECT t1.Id, 
       t1.Name, 
       t1.Other_Columns
FROM 
(
     SELECT Id, 
            Name, 
            Other_Columns,
    (
        CASE Name WHEN @curA 
        THEN @curRow := @curRow + 1 
        ELSE @curRow := 1 AND @curA := Name END 
    ) + 1 AS rn 
    FROM messages t, 
    (SELECT @curRow := 0, @curA := '') r 
    ORDER BY Name,Id DESC 
)t1
WHERE t1.rn = 1
ORDER BY t1.Id;

SQL Fiddle

If you want the last row for each Name, then you can give a row number to each row group by the Name and order by Id in descending order.

QUERY

SELECT t1.Id, 
       t1.Name, 
       t1.Other_Columns
FROM 
(
     SELECT Id, 
            Name, 
            Other_Columns,
    (
        CASE Name WHEN @curA 
        THEN @curRow := @curRow + 1 
        ELSE @curRow := 1 AND @curA := Name END 
    ) + 1 AS rn 
    FROM messages t, 
    (SELECT @curRow := 0, @curA := '') r 
    ORDER BY Name,Id DESC 
)t1
WHERE t1.rn = 1
ORDER BY t1.Id;

SQL Fiddle

孤星 2024-08-10 00:36:08

这是我的解决方案:

SELECT 
  DISTINCT NAME,
  MAX(MESSAGES) OVER(PARTITION BY NAME) MESSAGES 
FROM MESSAGE;

Here is my solution:

SELECT 
  DISTINCT NAME,
  MAX(MESSAGES) OVER(PARTITION BY NAME) MESSAGES 
FROM MESSAGE;
清眉祭 2024-08-10 00:36:08

如果性能确实是您关心的问题,您可以在表中引入一个名为 IsLastInGroup 的 BIT 类型新列。

在最后的列上将其设置为 true 并在每行插入/更新/删除时维护它。写入会变慢,但读取会受益。这取决于您的用例,我仅在您专注于阅读时才推荐它。

所以你的查询将如下所示:

SELECT * FROM Messages WHERE IsLastInGroup = 1

If performance is really your concern you can introduce a new column on the table called IsLastInGroup of type BIT.

Set it to true on the columns which are last and maintain it with every row insert/update/delete. Writes will be slower, but you'll benefit on reads. It depends on your use case and I recommend it only if you're read-focused.

So your query will look like:

SELECT * FROM Messages WHERE IsLastInGroup = 1
跨年 2024-08-10 00:36:08

MariaDB 10.3 及更高版本使用 GROUP_CONCAT

这个想法是使用 ORDER BY + LIMIT

SELECT GROUP_CONCAT(id ORDER BY id DESC LIMIT 1) AS id,
       name,
       GROUP_CONCAT(Other_columns ORDER BY id DESC LIMIT 1) AS Other_columns
FROM t
GROUP BY name;

db<>fiddle 演示

MariaDB 10.3 and newer using GROUP_CONCAT.

The idea is to use ORDER BY + LIMIT:

SELECT GROUP_CONCAT(id ORDER BY id DESC LIMIT 1) AS id,
       name,
       GROUP_CONCAT(Other_columns ORDER BY id DESC LIMIT 1) AS Other_columns
FROM t
GROUP BY name;

db<>fiddle demo

娇柔作态 2024-08-10 00:36:08

怎么样:

SELECT DISTINCT ON (name) *
FROM messages
ORDER BY name, id DESC;

我有类似的问题(在 postgresql 上很难)和 1M 记录表。该解决方案需要 1.7 秒,而使用 LEFT JOIN 的解决方案需要 44 秒。
就我而言,我必须根据 NULL 值过滤 name 字段的对应项,从而获得更好的性能 0.2 秒

How about this:

SELECT DISTINCT ON (name) *
FROM messages
ORDER BY name, id DESC;

I had similar issue (on postgresql tough) and on a 1M records table. This solution takes 1.7s vs 44s produced by the one with LEFT JOIN.
In my case I had to filter the corrispondant of your name field against NULL values, resulting in even better performances by 0.2 secs

檐上三寸雪 2024-08-10 00:36:08

怎么样:

select *, max(id) from messages group by name 

我已经在 sqlite 上测试了它,它返回所有列和所有名称的最大 id 值。

What about:

select *, max(id) from messages group by name 

I have tested it on sqlite and it returns all columns and max id value for all names.

赠意 2024-08-10 00:36:08

从 MySQL 8.0.14 开始,这也可以使用 横向派生表

SELECT t.*
FROM messages t
JOIN LATERAL (
  SELECT name, MAX(id) AS id 
  FROM messages t1
  WHERE t.name = t1.name
  GROUP BY name
) trn ON t.name = trn.name AND t.id = trn.id

db<>fiddle

As of MySQL 8.0.14, this can also be achieved using Lateral Derived Tables:

SELECT t.*
FROM messages t
JOIN LATERAL (
  SELECT name, MAX(id) AS id 
  FROM messages t1
  WHERE t.name = t1.name
  GROUP BY name
) trn ON t.name = trn.name AND t.id = trn.id

db<>fiddle

少女情怀诗 2024-08-10 00:36:08

没有子查询的另一种选择。

此解决方案使用 MySQL LAST_VALUE 窗口函数,利用 Window Function Frame 可从 .

SELECT DISTINCT 
    LAST_VALUE(Id)            
        OVER(PARTITION BY Name 
             ORDER     BY Id 
             ROWS BETWEEN 0 PRECEDING 
                      AND UNBOUNDED FOLLOWING),
    Name,
    LAST_VALUE(Other_Columns)            
        OVER(PARTITION BY Name 
             ORDER     BY Id 
             ROWS BETWEEN 0 PRECEDING 
                      AND UNBOUNDED FOLLOWING)
FROM   
    tab

请在此处尝试一下。

Yet another option without subqueries.

This solution uses MySQL LAST_VALUE window function, exploiting Window Function Frame available MySQL tool from .

SELECT DISTINCT 
    LAST_VALUE(Id)            
        OVER(PARTITION BY Name 
             ORDER     BY Id 
             ROWS BETWEEN 0 PRECEDING 
                      AND UNBOUNDED FOLLOWING),
    Name,
    LAST_VALUE(Other_Columns)            
        OVER(PARTITION BY Name 
             ORDER     BY Id 
             ROWS BETWEEN 0 PRECEDING 
                      AND UNBOUNDED FOLLOWING)
FROM   
    tab

Try it here.

月竹挽风 2024-08-10 00:36:08

希望以下Oracle查询可以帮助您:

WITH Temp_table AS
(
    Select id, name, othercolumns, ROW_NUMBER() over (PARTITION BY name ORDER BY ID 
    desc)as rank from messages
)
Select id, name,othercolumns from Temp_table where rank=1

Hope below Oracle query can help:

WITH Temp_table AS
(
    Select id, name, othercolumns, ROW_NUMBER() over (PARTITION BY name ORDER BY ID 
    desc)as rank from messages
)
Select id, name,othercolumns from Temp_table where rank=1
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文