MySQL 中多对表中 SELECT 一行的最有效方法
假设我在一对多表 city 和 person 中分别获得了以下数据:
SELECT city.*, person.* FROM city, person WHERE city.city_id = person.person_city_id;
+---------+-------------+-----------+-------------+----------------+
| city_id | city_name | person_id | person_name | person_city_id |
+---------+-------------+-----------+-------------+----------------+
| 1 | chicago | 1 | charles | 1 |
| 1 | chicago | 2 | celia | 1 |
| 1 | chicago | 3 | curtis | 1 |
| 1 | chicago | 4 | chauncey | 1 |
| 2 | new york | 5 | nathan | 2 |
| 3 | los angeles | 6 | luke | 3 |
| 3 | los angeles | 7 | louise | 3 |
| 3 | los angeles | 8 | lucy | 3 |
| 3 | los angeles | 9 | larry | 3 |
+---------+-------------+-----------+-------------+----------------+
9 rows in set (0.00 sec)
我想使用某种特定的逻辑为每个唯一的城市从 person 中选择一条记录。例如:
SELECT city.*, person.* FROM city, person WHERE city.city_id = person.person_city_id
GROUP BY city_id ORDER BY person_name DESC
;
这里的含义是,在每个城市内,我想获得按字典顺序排列的最大值,例如:
+---------+-------------+-----------+-------------+----------------+
| city_id | city_name | person_id | person_name | person_city_id |
+---------+-------------+-----------+-------------+----------------+
| 2 | new york | 5 | nathan | 2 |
| 3 | los angeles | 6 | luke | 3 |
| 1 | chicago | 1 | curtis | 1 |
+---------+-------------+-----------+-------------+----------------+
然而,我得到的实际输出是:
+---------+-------------+-----------+-------------+----------------+
| city_id | city_name | person_id | person_name | person_city_id |
+---------+-------------+-----------+-------------+----------------+
| 2 | new york | 5 | nathan | 2 |
| 3 | los angeles | 6 | luke | 3 |
| 1 | chicago | 1 | charles | 1 |
+---------+-------------+-----------+-------------+----------------+
我理解这种差异的原因是MySQL首先执行GROUP BY,然后它执行 ORDER BY。这对我来说很不幸,因为我希望 GROUP BY 具有选择记录的选择逻辑。
我可以通过使用一些嵌套的 SELECT 语句来解决这个问题:
SELECT c.*, p.* FROM city c,
( SELECT p_inner.* FROM
( SELECT * FROM person ORDER BY person_city_id, person_name DESC ) p_inner
GROUP BY person_city_id ) p
WHERE c.city_id = p.person_city_id;
+---------+-------------+-----------+-------------+----------------+
| city_id | city_name | person_id | person_name | person_city_id |
+---------+-------------+-----------+-------------+----------------+
| 1 | chicago | 3 | curtis | 1 |
| 2 | new york | 5 | nathan | 2 |
| 3 | los angeles | 6 | luke | 3 |
+---------+-------------+-----------+-------------+----------------+
当 person
表增长到任意大时,这似乎会非常低效。我假设内部 SELECT 语句不知道最外面的 WHERE 过滤器。这是真的吗?
在 GROUP BY 之前有效执行 ORDER BY 的公认最佳方法是什么?
Let's say I've got the following data in one-to-many tables city and person, respectively:
SELECT city.*, person.* FROM city, person WHERE city.city_id = person.person_city_id;
+---------+-------------+-----------+-------------+----------------+
| city_id | city_name | person_id | person_name | person_city_id |
+---------+-------------+-----------+-------------+----------------+
| 1 | chicago | 1 | charles | 1 |
| 1 | chicago | 2 | celia | 1 |
| 1 | chicago | 3 | curtis | 1 |
| 1 | chicago | 4 | chauncey | 1 |
| 2 | new york | 5 | nathan | 2 |
| 3 | los angeles | 6 | luke | 3 |
| 3 | los angeles | 7 | louise | 3 |
| 3 | los angeles | 8 | lucy | 3 |
| 3 | los angeles | 9 | larry | 3 |
+---------+-------------+-----------+-------------+----------------+
9 rows in set (0.00 sec)
And I want to select a single record from person for each unique city using some particular logic. For example:
SELECT city.*, person.* FROM city, person WHERE city.city_id = person.person_city_id
GROUP BY city_id ORDER BY person_name DESC
;
The implication here is that within each city, I want to get the lexigraphically greatest value, eg:
+---------+-------------+-----------+-------------+----------------+
| city_id | city_name | person_id | person_name | person_city_id |
+---------+-------------+-----------+-------------+----------------+
| 2 | new york | 5 | nathan | 2 |
| 3 | los angeles | 6 | luke | 3 |
| 1 | chicago | 1 | curtis | 1 |
+---------+-------------+-----------+-------------+----------------+
The actual output I get, however, is:
+---------+-------------+-----------+-------------+----------------+
| city_id | city_name | person_id | person_name | person_city_id |
+---------+-------------+-----------+-------------+----------------+
| 2 | new york | 5 | nathan | 2 |
| 3 | los angeles | 6 | luke | 3 |
| 1 | chicago | 1 | charles | 1 |
+---------+-------------+-----------+-------------+----------------+
I understand that the reason for this discrepancy is that MySQL first performs the GROUP BY, then it does the ORDER BY. This is unfortunate for me, as I want the GROUP BY to have selection logic in which record it picks.
I can workaround this by using some nested SELECT statements:
SELECT c.*, p.* FROM city c,
( SELECT p_inner.* FROM
( SELECT * FROM person ORDER BY person_city_id, person_name DESC ) p_inner
GROUP BY person_city_id ) p
WHERE c.city_id = p.person_city_id;
+---------+-------------+-----------+-------------+----------------+
| city_id | city_name | person_id | person_name | person_city_id |
+---------+-------------+-----------+-------------+----------------+
| 1 | chicago | 3 | curtis | 1 |
| 2 | new york | 5 | nathan | 2 |
| 3 | los angeles | 6 | luke | 3 |
+---------+-------------+-----------+-------------+----------------+
This seems like it would be terribly inefficient when the person
table grows arbitrarily large. I assume the inner SELECT statements don't know about outermost WHERE filters. Is this true?
What is the accepted best approach for doing what effectively is an ORDER BY before the GROUP BY?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
执行此操作的常用方法(在 MySQL 中)是将表与其自身连接。
首先获取每个
city
中最大的person_name
(即person
表中的每个person_city_id
):此连接
person
在每个person_city_id
(您的GROUP BY
变量)中与其自身,并且还将表配对,使得p2
的person_name
是大于p
的person_name
。由于如果存在
p.person_name
且没有更大的p2.person_name
(在同一城市内),则它是左连接,那么p2.person_name
将为NULL
。这些正是每个城市“最伟大的”person_name
。因此,要将您的其他信息(来自
city
)加入其中,只需执行另一个连接即可:The usual way to do this (in MySQL) is with a join of your table to itself.
First to get the greatest
person_name
percity
(ie perperson_city_id
in theperson
table):This joins
person
to itself within eachperson_city_id
(yourGROUP BY
variable), and also pairs the tables up such thatp2
'sperson_name
is greater thanp
'sperson_name
.Since it's a left join if there's a
p.person_name
for which there is no greaterp2.person_name
(within that same city), then thep2.person_name
will beNULL
. These are precisely the "greatest"person_name
s per city.So to join your other information (from
city
) to it, just do another join:您的“解决方案”不是有效的 SQL,但它可以在 MySQL 中使用。但是,您无法确定它是否会随着查询优化器代码的未来更改而中断。可以稍微改进为只有 1 层嵌套(仍然不是有效的 SQL):
另一种方法(有效的 SQL 语法,也适用于其他 DBMS)是创建一个子查询来选择每个城市的姓氏,然后加入:
另一种方法是自连接(表
person
),使用 @mathematical_coffee 描述的<
技巧。另一种方法是使用
LIMIT 1
子查询来连接city
和person
:这将运行一个子查询(在表
person
)对于每个城市,如果您有 InnoDB 引擎的(person_city_id, person_name)
索引或(person_city_id, person_name, person_id)
为了MyISAM 引擎。这些选项之间有一个主要区别:
Oprions 2 和 3 将返回所有并列结果(如果您在一个城市中有两个或两个以上的人具有相同的姓名且按字母顺序排列在最后,则将显示两个或全部)。
选项 1 和 4 将为每个城市返回一个结果,即使存在平局。您可以通过更改
ORDER BY
子句来选择哪一个。哪种选项更有效还取决于数据的分布,因此最好的方法是尝试所有选项,检查它们的执行计划并找到最适合每个选项的索引。
(person_city_id, person_name)
上的索引很可能适合任何这些查询。对于分布,我的意思是:
您的城市很少,但每个城市都有很多人吗? (我认为在这种情况下选项 2 和 4 会表现得更好)
或者许多城市每个城市的人口很少? (对于这样的数据,选项 3 可能更好)。
Your "solution" is not valid SQL but it works in MySQL. You can't be sure however if it will break with a future change in the query optimizer code. It could be slightly improved to have just 1 level of nesting (still not valid SQL):
Another way (valid SQL syntax, works in other DBMS, too) is to make a subquery to select the last name for every city and then join:
Another way is the self join (of the table
person
), with the<
trick that @mathematical_coffee describes.Yet another way is to use a
LIMIT 1
subquery for the join ofcity
withperson
:This will run a subquery (on table
person
) for every city and it will be efficient if you have a(person_city_id, person_name)
index for InnoDB engine or an(person_city_id, person_name, person_id)
for MyISAM engine.There is one major difference between these options:
Oprions 2 and 3 will return all tied results (if you have two or more persons in a city with same name that is alphabetically last, then both or all will be shown).
Options 1 and 4 will return one result per city, even if there are ties. You can choose which one by altering the
ORDER BY
clause.Which option is more efficient depends also on the distribution of your data, so the best way is to try them all, check their execution plans and find the best indexes that work for each one. An index on
(person_city_id, person_name)
will most likely be good for any of those queries.With distribution I mean:
Do you have few cities with many persons per city? (I would think that options 2 and 4 would behave better in this case)
Or many cities with few persons per city? (option 3 may be better with such data).