优化未按预期使用索引的日期时间字段
我在使用 MySQL 5.0.77 运行的应用程序中有一个大型且快速增长的日志表。我正在尝试找到优化根据消息类型对过去 X 天内的实例进行计数的查询的最佳方法:
CREATE TABLE `counters` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`kind` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`created_at` datetime DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `index_counters_on_kind` (`kind`),
KEY `index_counters_on_created_at` (`created_at`)
) ENGINE=InnoDB AUTO_INCREMENT=302 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
对于此测试集,表中有 668521 行。我试图优化的查询是:
SELECT kind, COUNT(id) FROM counters WHERE created_at >= ? GROUP BY kind;
现在,该查询需要 3-5 秒,并且估计如下:
+----+-------------+----------+-------+----------------------------------+------------------------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+-------+----------------------------------+------------------------+---------+------+---------+-------------+
| 1 | SIMPLE | counters | index | index_counters_on_created_at_idx | index_counters_on_kind | 258 | NULL | 1185531 | Using where |
+----+-------------+----------+-------+----------------------------------+------------------------+---------+------+---------+-------------+
1 row in set (0.00 sec)
删除created_at索引后,它看起来像这样:(
+----+-------------+----------+-------+---------------+------------------------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+-------+---------------+------------------------+---------+------+---------+-------------+
| 1 | SIMPLE | counters | index | NULL | index_counters_on_kind | 258 | NULL | 1185531 | Using where |
+----+-------------+----------+-------+---------------+------------------------+---------+------+---------+-------------+
1 row in set (0.00 sec)
是的,由于某种原因,行估计是大于表中的行数。)
因此,显然该索引没有意义。
难道真的没有更好的办法吗?我尝试将该列作为时间戳,但结果却变慢了。
编辑:我发现将查询更改为使用间隔而不是特定日期最终会使用索引,将行估计减少到上面查询的大约 20%:
SELECT kind, COUNT(id) FROM counters WHERE created_at >=
(NOW() - INTERVAL 7 DAY) GROUP BY kind;
我不完全确定为什么会发生这种情况,但我'我相当有信心,如果我理解了它,那么这个问题总体上就会变得更有意义。
I have a large, fast-growing log table in an application running with MySQL 5.0.77. I'm trying to find the best way to optimize queries that count instances within the last X days according to message type:
CREATE TABLE `counters` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`kind` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`created_at` datetime DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `index_counters_on_kind` (`kind`),
KEY `index_counters_on_created_at` (`created_at`)
) ENGINE=InnoDB AUTO_INCREMENT=302 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
For this test set, there are 668521 rows in the table. The query I'm trying to optimize is:
SELECT kind, COUNT(id) FROM counters WHERE created_at >= ? GROUP BY kind;
Right now, that query takes between 3-5 seconds, and is being estimated as follows:
+----+-------------+----------+-------+----------------------------------+------------------------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+-------+----------------------------------+------------------------+---------+------+---------+-------------+
| 1 | SIMPLE | counters | index | index_counters_on_created_at_idx | index_counters_on_kind | 258 | NULL | 1185531 | Using where |
+----+-------------+----------+-------+----------------------------------+------------------------+---------+------+---------+-------------+
1 row in set (0.00 sec)
With the created_at index removed, it looks like this:
+----+-------------+----------+-------+---------------+------------------------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+-------+---------------+------------------------+---------+------+---------+-------------+
| 1 | SIMPLE | counters | index | NULL | index_counters_on_kind | 258 | NULL | 1185531 | Using where |
+----+-------------+----------+-------+---------------+------------------------+---------+------+---------+-------------+
1 row in set (0.00 sec)
(Yes, for some reason the row estimate is larger than the number of rows in the table.)
So, apparently, there's no point to that index.
Is there really no better way to do this? I tried the column as a timestamp, and it just ended up slower.
Edit: I discovered that changing the query to use an interval instead of a specific date ends up using the index, cutting down the row estimate to about 20% of the query above:
SELECT kind, COUNT(id) FROM counters WHERE created_at >=
(NOW() - INTERVAL 7 DAY) GROUP BY kind;
I'm not entirely sure why that happens, but I'm fairly confident that if I understood it then the problem in general would make a lot more sense.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
为什么不使用串联索引?
应该进行仅索引扫描(在附加中提及“使用索引”,因为 COUNT(ID) 无论如何都不为 NULL)。
参考文献:
Why not using a concatenated index?
Should go for an Index-Only Scan (mentioning "Using index" in Extras, because COUNT(ID) is NOT NULL anyway).
References:
阅读该问题的最新编辑后,问题似乎是 MySQL 将 WHERE 子句中使用的参数解释为字符串而不是日期时间价值。这可以解释为什么优化器没有选择
index_counters_on_created_at
索引,而是会导致扫描将created_at
值转换为字符串表示形式,然后执行比较。我认为,这可以通过在where
子句中显式转换为datetime
来防止:我原来的评论仍然适用于优化部分。
这里真正的性能杀手是
kind
列。因为在执行 GROUP BY 时,数据库引擎首先需要确定 kind 列中的所有不同值,这会导致表或索引扫描。这就是为什么估计的行数大于表中的总行数,在一次传递中它将确定kind
列中的不同值,在第二次传递中它将确定哪些行满足create_at >= ?
条件。更糟糕的是,
kind
列是一个varchar (255)
,它太大而效率不高,而且它使用utf8
字符集和 utf8_unicode_ci 排序规则,这会增加确定该列中唯一值所需的比较的复杂性。如果将
kind
列的类型更改为int
,效果会好得多。因为整数比较比 unicode 字符比较更高效、更简单。拥有一个用于存储kind_id
和description
消息kind
的目录表也会有所帮助。然后对种类目录表的联接和首先按日期过滤的日志表的子查询进行分组:这将首先按
create_at >= ?< 过滤
counters
表/code> 并且可以从该列的索引中受益。然后它会将其连接到kind_catalog
表,如果 SQL 优化器良好,它将扫描较小的kind_catalog
表来进行分组,而不是扫描counters< /代码>表。
After reading the latest edit on the question, the problem seems to be that the parameter being used in the
WHERE
clause was being interpreted by MySQL as a string rather than as adatetime
value. This would explain why theindex_counters_on_created_at
index was not being selected by the optimizer, and instead it would result in a scan to convert thecreated_at
values to a string representation and then do the comparison. I think, this can be prevented by an explicit cast todatetime
in thewhere
clause:My original comments still apply for the optimization part.
The real performance killer here is the
kind
column. Because when doing theGROUP BY
the database engine first needs to determine all the distinct values in thekind
column which results in a table or index scan. That's why the estimated rows is bigger than the total number of rows in the table, in one pass it will determine the distinct values in thekind
column, and in a second pass it will determine which rows meet thecreate_at >= ?
condition.To make matters worse, the
kind
column is avarchar (255)
which is too big to be efficient, add to that that it usesutf8
character set andutf8_unicode_ci
collation, which increment the complexity of the comparisons needed to determine the unique values in that column.This will perform a lot better if you change the type of the
kind
column toint
. Because integer comparisons are more efficient and simpler than unicode character comparisons. It would also help to have a catalog table for thekind
of messages in which you store thekind_id
anddescription
. And then do the grouping on a join of the kind catalog table and a subquery of the log table that first filters by date:This will first filter the
counters
table bycreate_at >= ?
and can benefit from the index on that column. Then it will join that to thekind_catalog
table and if the SQL optimizer is good it will scan the smallerkind_catalog
table for doing the grouping, instead of thecounters
table.