MySQL GROUP BY 和 JOIN 查询的正确索引/优化
我已经对此进行了大量阅读和谷歌搜索,但找不到任何令人满意的答案,因此我将不胜感激。我找到的大多数答案都接近我的情况,但没有解决它(并且尝试遵循解决方案并没有给我带来任何好处)。
请参阅下面的编辑 #2 以获取最佳示例
[这是最初的问题,但并不能很好地表达我的要求。]
假设我有 2 个表,每个表有 4 列:
- key (int ,自动递增)
- c1 (日期)
- c2 (长度为 3 的 varchar)
- c3 (也是长度为 3 的 varchar)
我想执行以下查询:
SELECT t.c1, t.c2, COUNT(*)
FROM test1 t
LEFT JOIN test2 t2 ON t2.key = t.key
GROUP BY t.c1, t.c2
两个 key
字段都被索引为主键。我想获取 c1、c2 每个分组中返回的行数。
当我解释这个查询时,我得到“使用临时;使用文件排序”。我执行此查询的实际表超过 500,000 行,因此这意味着这是一个耗时的查询。
所以我的问题是(假设我在查询中没有做任何错误):有没有办法为此表建立索引以消除临时/文件排序的使用?
预先感谢您的任何帮助。
编辑
这是表定义(在本例中,两个表是相同的 - 实际上它们不是,但我不确定此时是否有区别):
CREATE TABLE `test1` (
`key` int(11) NOT NULL auto_increment,
`c1` date NOT NULL,
`c2` varchar(3) NOT NULL,
`c3` varchar(3) NOT NULL,
PRIMARY KEY (`key`),
UNIQUE KEY `c1` (`c1`,`c2`),
UNIQUE KEY `c2_2` (`c2`,`c1`),
KEY `c2` (`c2`,`c3`)
) ENGINE=MyISAM AUTO_INCREMENT=3 DEFAULT CHARSET=utf8
完整的 EXPLAIN 语句:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE t ALL NULL NULL NULL NULL 2 Using temporary; Using filesort
1 SIMPLE t2 eq_ref PRIMARY PRIMARY 4 tracking.t.key 1 Using index
这只是对于我的样本表。在我的真实表中,t 的行数为 500,000+(表中的每一行,尽管这可能与其他内容相关)。
编辑 #2
这是一个更具体的示例,可以更好地解释我的情况。
假设我有少年棒球联盟比赛的数据。我有两张桌子。一个保存有关比赛的数据:
CREATE TABLE `ex_games` (
`game_id` int(11) NOT NULL auto_increment,
`home_team` int(11) NOT NULL,
`date` date NOT NULL,
PRIMARY KEY (`game_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
另一个保存有关每场比赛的击球数的数据:
CREATE TABLE `ex_atbats` (
`ab_id` int(11) NOT NULL auto_increment,
`game` int(11) NOT NULL,
`team` int(11) NOT NULL,
`player` int(11) NOT NULL,
`result` tinyint(1) NOT NULL,
PRIMARY KEY (`hit_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
所以我有两个问题。让我们从简单的版本开始:我想返回一个游戏列表,其中包含每场比赛中击球数的计数。所以我想我会做这样的事情:
SELECT date, home_team, COUNT(h.ab_id) FROM `ex_atbats` h
LEFT JOIN ex_games g ON g.game_id = h.game
GROUP BY g.game_id
此查询使用 filesort/temporary。有没有更好的方法来构建这个或索引表来摆脱它?
然后,更棘手的部分:假设我现在不仅要包括击球数的计数,还要包括同一支球队在相同结果的击球数之前的击球数。我认为这会是这样的:
SELECT g.date, g.home_team, COUNT(ab.ab_id), COUNT(ab2.ab_id) FROM `ex_atbats` ab
LEFT JOIN ex_games g ON g.game_id = ab.game
LEFT JOIN ex_atbats ab2 ON ab2.ab_id = ab.ab_id - 1 AND ab2.result = ab.result
GROUP BY g.game_id
这是构建该查询的正确方法吗?这也使用文件排序/临时。
那么完成这些任务的最佳方法是什么?
再次感谢。
I've done a lot of reading and Googling on this and I cannot find any satisfactory answer so I'd appreciate any help. Most answers I find come close to my situation but do not address it (and attempting to follow the solutions has not done me any good).
See Edit #2 below for the best example
[This was the original question but is not a great representation of what I'm asking.]
Say I have 2 tables, each with 4 columns:
- key (int, auto increment)
- c1 (a date)
- c2 (a varchar of length 3)
- c3 (also a varchar of length 3)
And I want to perform the following query:
SELECT t.c1, t.c2, COUNT(*)
FROM test1 t
LEFT JOIN test2 t2 ON t2.key = t.key
GROUP BY t.c1, t.c2
Both key
fields are indexed as primary keys. I want to get the number of rows returned in each grouping of c1, c2.
When I explain this query I get "using temporary; using filesort". The actual table I'm performing this query on is over 500,000 rows, so that means it's a time consuming query.
So my question is (assuming I'm not doing anything wrong in the query): is there a way to index this table to eliminate the temporary/filesort usage?
Thanks in advance for any help.
Edit
Here is the table definition (in this example both tables are identical - in reality they're not but I'm not sure it makes a difference at this point):
CREATE TABLE `test1` (
`key` int(11) NOT NULL auto_increment,
`c1` date NOT NULL,
`c2` varchar(3) NOT NULL,
`c3` varchar(3) NOT NULL,
PRIMARY KEY (`key`),
UNIQUE KEY `c1` (`c1`,`c2`),
UNIQUE KEY `c2_2` (`c2`,`c1`),
KEY `c2` (`c2`,`c3`)
) ENGINE=MyISAM AUTO_INCREMENT=3 DEFAULT CHARSET=utf8
Full EXPLAIN statement:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE t ALL NULL NULL NULL NULL 2 Using temporary; Using filesort
1 SIMPLE t2 eq_ref PRIMARY PRIMARY 4 tracking.t.key 1 Using index
This is just for my sample tables. In my real tables the rows for t says 500,000+ (every row in the table, though that could be related to something else).
Edit #2
Here is a more concrete example to better explain my situation.
Let's say I have data on Little League baseball games. I have two tables. One holds data on the games:
CREATE TABLE `ex_games` (
`game_id` int(11) NOT NULL auto_increment,
`home_team` int(11) NOT NULL,
`date` date NOT NULL,
PRIMARY KEY (`game_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
The other holds data on the at bats in each game:
CREATE TABLE `ex_atbats` (
`ab_id` int(11) NOT NULL auto_increment,
`game` int(11) NOT NULL,
`team` int(11) NOT NULL,
`player` int(11) NOT NULL,
`result` tinyint(1) NOT NULL,
PRIMARY KEY (`hit_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
So I have two questions. Let's start with the simple version: I want to return a list of games with a count of how many at bats are in each game. So I think I would do something like this:
SELECT date, home_team, COUNT(h.ab_id) FROM `ex_atbats` h
LEFT JOIN ex_games g ON g.game_id = h.game
GROUP BY g.game_id
This query uses filesort/temporary. Is there a better way to structure this or to index the tables to get rid of that?
Then, the trickier part: say I now want to not only include a count of the number of at bats, but also include a count of the number of at bats that were preceded by an at bat with the same result by the same team. I assume that would be something like:
SELECT g.date, g.home_team, COUNT(ab.ab_id), COUNT(ab2.ab_id) FROM `ex_atbats` ab
LEFT JOIN ex_games g ON g.game_id = ab.game
LEFT JOIN ex_atbats ab2 ON ab2.ab_id = ab.ab_id - 1 AND ab2.result = ab.result
GROUP BY g.game_id
Is that the correct way to structure that query? This also uses filesort/temporary.
So what is the optimal way to go about accomplishing these tasks?
Thanks again.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
短语
Usingtemporary/filesort
通常与JOIN
操作中使用的索引无关。有很多示例,您可以设置所有索引(它们显示在EXPLAIN
中的key
和key_len
列中),但您仍然得到使用临时
和使用文件排序
。查看手册中关于
使用临时
和使用文件排序
的内容:ORDER BY
优化在某些情况下,GROUP BY 子句中使用的所有列的组合索引可能有助于摆脱
使用文件排序
。如果您还发出ORDER BY
,您可能需要添加更复杂的索引。如果您有一个巨大的数据集,请考虑使用一些标准(例如日期或时间戳)通过 实际分区或简单的
WHERE
子句。Phrases
Using temporary/filesort
usually are not related to the indexes used in theJOIN
operation. There is numerous examples where you can have all indexes set (they show up inkey
andkey_len
columns inEXPLAIN
) but you still getUsing temporary
andUsing filesort
.Check out what the manual says about
Using temporary
andUsing filesort
:ORDER BY
OptimizationHaving a combined index for all columns used in
GROUP BY
clause may help to get rid ofUsing filesort
in certain circumstances. If you also issueORDER BY
you may need to add more complex indexes.If you have a huge dataset consider partitioning it using some criteria like date or timestamp by means of actual partitioning or a simple
WHERE
clause.首先,表格的定义很重要。使用两个主键连接是一回事,使用一侧的主键和另一侧的非唯一键连接是另一回事,等等。表使用什么类型的引擎也很重要,因为 InnoDB 对待主键的方式与 MyISAM 不同引擎。
但我注意到,在表
test1
上,(c1,c2)
组合是唯一的,并且字段不可为空。这允许您的查询重写为:在
JOIN
和GROUP BY
使用相同字段时,它将给出相同的结果。请注意,MySQL 允许您在SELECT
列表中使用不在GROUP BY
列表中的字段,而无需对它们使用聚合函数。这在大多数其他系统中是不允许的,并且被某些人视为错误。在这种情况下,这是一个非常好的功能。每行都可以由(key)
或(c1,c2)
标识,因此使用两者中的哪一个进行分组并不重要。另一件需要注意的事情是,当您使用
LEFT JOIN
时,通常使用右侧的连接列进行计数:COUNT(t2.key)
而不是 <代码>COUNT(*)。您的原始查询将在该列中为test1
中的记录提供1
,这些记录与test2
中的任何记录都不匹配,因为它会计算行数,而您可能想要计算test2
中的相关记录 - 并在这些情况下显示0
。因此,尝试这个查询并发布说明:
First of all, the tables' definitions do matter. It's one thing to join using two primary keys, another to join using a primary key from one side and a non-unique key in the other, etc. It also matters what type of engine the tables use as InnoDB treats Primary Keys differently than MyISAM engine.
What I notice though is that on table
test1
, the(c1,c2)
combination is Unique and the fields are not nullable. This allows your query to be rewritten as:It will give the same results while using the same field for the
JOIN
and theGROUP BY
. Note that MySQL allows you to use in theSELECT
list fields that are not in theGROUP BY
list, without having aggregate functions on them. This is not allowed in most other systems and is seen as a bug by some. In this situation though it is a very nice feature. Every row can be either identified by(key)
or(c1,c2)
, so it shouldn't matter which of the two is used for the grouping.Another thing to note is that when you use
LEFT JOIN
, it's common to use the joining column from the right side for the counting:COUNT(t2.key)
and notCOUNT(*)
. Your original query will give1
in that column for records intest1
that do not mmatch any record intest2
because it counts rows while you probably want to count the related records intest2
- and show0
in those cases.So, try this query and post the EXPLAIN:
索引有助于连接,但您仍然需要进行完整排序才能进行分组。本质上,它仍然需要处理集合中的每条记录。
当然,添加一个 where 子句并限制集合会运行得更快。它只是不会给你带来你想要的结果。
除了对整个表进行分组之外,可能还有其他选择。我注意到你正在执行 SELECT * - 你想从查询中得到什么?
选择不同的 c1、c2
FROM 测试 t
LEFT JOIN test2 t2 ON t2.key = t.key
例如, 可能运行得更快。 (我意识到这只是一个示例查询,但要明白,当您不知道最终目标是什么时,很难优化!)
编辑 - 在进行一些阅读时(http://dev. mysql.com/doc/refman/5.0/en/group-by-optimization.html),我了解到,在正确的情况下,索引可以显着帮助分组依据。
我看到的是它需要是排序索引(如 BTREE),而不是哈希。也许:
可能有帮助。
The indexes help with the join, but you still need to do a full sort in order to do the group by. Essentially, it still has to process every record in the set.
Adding a where clause and limiting the set would run faster, of course. It just won't get you the results you want.
There may be other options than doing a group by on the entire table. I notice you're doing a SELECT * - What are you trying to get out of the query?
SELECT DISTINCT c1, c2
FROM test t
LEFT JOIN test2 t2 ON t2.key = t.key
may run faster, for instance. (I realize this was just a sample query, but understand that it's hard to optimize when you don't know what the end goal is!)
EDIT - In doing some reading (http://dev.mysql.com/doc/refman/5.0/en/group-by-optimization.html), I learned that, under the correct circumstances, indexes can help significantly with the group by.
What I'm seeing is that it needs to be a sorted index (like BTREE), not a HASH. Perhaps:
might help.
对于 innodb 来说它可以工作,因为索引默认带有你的主键。对于 myisam,您必须拥有密钥,因为索引的最后一列是“key”。这将为优化器提供相同顺序的所有键,并且他可以跳过排序。您不能对索引前缀 theN 进行任何范围查询,这会让您立即回到文件排序。目前正在努力解决类似的问题
For innodb it will work, as the index caries your primary key by default. For myisam you have to have the key as the last column of your index be "key". That will give the optimizers all keys in the same order and he can skip the sort. You cannot do any range queryies on the index prefix theN, puts you right back into filesort. currently struggling with a similiar problem