优化未按预期使用索引的日期时间字段

发布于 2024-12-11 07:05:42 字数 2445 浏览 5 评论 0原文

我在使用 MySQL 5.0.77 运行的应用程序中有一个大型且快速增长的日志表。我正在尝试找到优化根据消息类型对过去 X 天内的实例进行计数的查询的最佳方法：

CREATE TABLE `counters` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `kind` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
  `created_at` datetime DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `index_counters_on_kind` (`kind`),
  KEY `index_counters_on_created_at` (`created_at`)
) ENGINE=InnoDB AUTO_INCREMENT=302 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

对于此测试集，表中有 668521 行。我试图优化的查询是：

SELECT kind, COUNT(id) FROM counters WHERE created_at >= ? GROUP BY kind;

现在，该查询需要 3-5 秒，并且估计如下：

+----+-------------+----------+-------+----------------------------------+------------------------+---------+------+---------+-------------+
| id | select_type | table    | type  | possible_keys                    | key                    | key_len | ref  | rows    | Extra       |
+----+-------------+----------+-------+----------------------------------+------------------------+---------+------+---------+-------------+
|  1 | SIMPLE      | counters | index | index_counters_on_created_at_idx | index_counters_on_kind | 258     | NULL | 1185531 | Using where | 
+----+-------------+----------+-------+----------------------------------+------------------------+---------+------+---------+-------------+
1 row in set (0.00 sec)

删除created_at索引后，它看起来像这样：（

+----+-------------+----------+-------+---------------+------------------------+---------+------+---------+-------------+
| id | select_type | table    | type  | possible_keys | key                    | key_len | ref  | rows    | Extra       |
+----+-------------+----------+-------+---------------+------------------------+---------+------+---------+-------------+
|  1 | SIMPLE      | counters | index | NULL          | index_counters_on_kind | 258     | NULL | 1185531 | Using where | 
+----+-------------+----------+-------+---------------+------------------------+---------+------+---------+-------------+
1 row in set (0.00 sec)

是的，由于某种原因，行估计是大于表中的行数。）

因此，显然该索引没有意义。

难道真的没有更好的办法吗？我尝试将该列作为时间戳，但结果却变慢了。

编辑：我发现将查询更改为使用间隔而不是特定日期最终会使用索引，将行估计减少到上面查询的大约 20%：

SELECT kind, COUNT(id) FROM counters WHERE created_at >= 
    (NOW() - INTERVAL 7 DAY) GROUP BY kind;

我不完全确定为什么会发生这种情况，但我'我相当有信心，如果我理解了它，那么这个问题总体上就会变得更有意义。

原文

I have a large, fast-growing log table in an application running with MySQL 5.0.77. I'm trying to find the best way to optimize queries that count instances within the last X days according to message type:

CREATE TABLE `counters` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `kind` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
  `created_at` datetime DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `index_counters_on_kind` (`kind`),
  KEY `index_counters_on_created_at` (`created_at`)
) ENGINE=InnoDB AUTO_INCREMENT=302 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

For this test set, there are 668521 rows in the table. The query I'm trying to optimize is:

SELECT kind, COUNT(id) FROM counters WHERE created_at >= ? GROUP BY kind;

Right now, that query takes between 3-5 seconds, and is being estimated as follows:

+----+-------------+----------+-------+----------------------------------+------------------------+---------+------+---------+-------------+
| id | select_type | table    | type  | possible_keys                    | key                    | key_len | ref  | rows    | Extra       |
+----+-------------+----------+-------+----------------------------------+------------------------+---------+------+---------+-------------+
|  1 | SIMPLE      | counters | index | index_counters_on_created_at_idx | index_counters_on_kind | 258     | NULL | 1185531 | Using where | 
+----+-------------+----------+-------+----------------------------------+------------------------+---------+------+---------+-------------+
1 row in set (0.00 sec)

With the created_at index removed, it looks like this:

+----+-------------+----------+-------+---------------+------------------------+---------+------+---------+-------------+
| id | select_type | table    | type  | possible_keys | key                    | key_len | ref  | rows    | Extra       |
+----+-------------+----------+-------+---------------+------------------------+---------+------+---------+-------------+
|  1 | SIMPLE      | counters | index | NULL          | index_counters_on_kind | 258     | NULL | 1185531 | Using where | 
+----+-------------+----------+-------+---------------+------------------------+---------+------+---------+-------------+
1 row in set (0.00 sec)

(Yes, for some reason the row estimate is larger than the number of rows in the table.)

So, apparently, there's no point to that index.

Is there really no better way to do this? I tried the column as a timestamp, and it just ended up slower.

Edit: I discovered that changing the query to use an interval instead of a specific date ends up using the index, cutting down the row estimate to about 20% of the query above:

SELECT kind, COUNT(id) FROM counters WHERE created_at >= 
    (NOW() - INTERVAL 7 DAY) GROUP BY kind;

I'm not entirely sure why that happens, but I'm fairly confident that if I understood it then the problem in general would make a lot more sense.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

╄→承喏 2024-12-18 07:05:42

为什么不使用串联索引？

CREATE INDEX idx_counters_created_kind ON counters(created_at, kind);

应该进行仅索引扫描（在附加中提及“使用索引”，因为 COUNT(ID) 无论如何都不为 NULL）。

参考文献：

Why not using a concatenated index?

CREATE INDEX idx_counters_created_kind ON counters(created_at, kind);

Should go for an Index-Only Scan (mentioning "Using index" in Extras, because COUNT(ID) is NOT NULL anyway).

References:

回复收藏 0 原文

呆° 2024-12-18 07:05:42

阅读该问题的最新编辑后，问题似乎是 MySQL 将 WHERE 子句中使用的参数解释为字符串而不是日期时间价值。这可以解释为什么优化器没有选择 index_counters_on_created_at 索引，而是会导致扫描将 created_at 值转换为字符串表示形式，然后执行比较。我认为，这可以通过在 where 子句中显式转换为 datetime 来防止：

where `created_at` >= convert({specific_date}, datetime)

我原来的评论仍然适用于优化部分。

这里真正的性能杀手是 kind 列。因为在执行 GROUP BY 时，数据库引擎首先需要确定 kind 列中的所有不同值，这会导致表或索引扫描。这就是为什么估计的行数大于表中的总行数，在一次传递中它将确定 kind 列中的不同值，在第二次传递中它将确定哪些行满足create_at >= ? 条件。
更糟糕的是，kind 列是一个 varchar (255)，它太大而效率不高，而且它使用 utf8字符集和 utf8_unicode_ci 排序规则，这会增加确定该列中唯一值所需的比较的复杂性。

如果将 kind 列的类型更改为 int，效果会好得多。因为整数比较比 unicode 字符比较更高效、更简单。拥有一个用于存储 kind_id 和 description 消息kind的目录表也会有所帮助。然后对种类目录表的联接和首先按日期过滤的日志表的子查询进行分组：

select k.kind_id, count(*)
from
    kind_catalog k
    inner join (
        select kind_id
        from counters
        where create_at >= ?
    ) c on k.kind_id = c.kind_id
group by k.kind_id

这将首先按 create_at >= ?< 过滤 counters 表/code> 并且可以从该列的索引中受益。然后它会将其连接到 kind_catalog 表，如果 SQL 优化器良好，它将扫描较小的 kind_catalog 表来进行分组，而不是扫描 counters< /代码>表。

After reading the latest edit on the question, the problem seems to be that the parameter being used in the WHERE clause was being interpreted by MySQL as a string rather than as a datetime value. This would explain why the index_counters_on_created_at index was not being selected by the optimizer, and instead it would result in a scan to convert the created_at values to a string representation and then do the comparison. I think, this can be prevented by an explicit cast to datetime in the where clause:

where `created_at` >= convert({specific_date}, datetime)

My original comments still apply for the optimization part.

The real performance killer here is the kind column. Because when doing the GROUP BY the database engine first needs to determine all the distinct values in the kind column which results in a table or index scan. That's why the estimated rows is bigger than the total number of rows in the table, in one pass it will determine the distinct values in the kind column, and in a second pass it will determine which rows meet the create_at >= ? condition.
To make matters worse, the kind column is a varchar (255) which is too big to be efficient, add to that that it uses utf8 character set and utf8_unicode_ci collation, which increment the complexity of the comparisons needed to determine the unique values in that column.

This will perform a lot better if you change the type of the kind column to int. Because integer comparisons are more efficient and simpler than unicode character comparisons. It would also help to have a catalog table for the kind of messages in which you store the kind_id and description. And then do the grouping on a join of the kind catalog table and a subquery of the log table that first filters by date:

select k.kind_id, count(*)
from
    kind_catalog k
    inner join (
        select kind_id
        from counters
        where create_at >= ?
    ) c on k.kind_id = c.kind_id
group by k.kind_id

This will first filter the counters table by create_at >= ? and can benefit from the index on that column. Then it will join that to the kind_catalog table and if the SQL optimizer is good it will scan the smaller kind_catalog table for doing the grouping, instead of the counters table.

回复收藏 0 原文

~没有更多了~