优化MySQL聚合查询

发布于 2024-07-18 07:29:03 字数 1324 浏览 12 评论 0原文

我在 MySQL 中有一个非常大的表（约 1 亿条记录），其中包含有关文件的信息。其中一条信息是每个文件的修改日期。

我需要编写一个查询来计算符合指定日期范围的文件数量。为此，我制作了一个小表，指定这些范围（全部以天为单位），如下所示：

DateRanges
range_id   range_name   range_start   range_end
1          0-90         0             90
2          91-180       91            180
3          181-365      181           365
4          366-1095     366           1095
5          1096+        1096          999999999

并编写了一个如下所示的查询：

SELECT r.range_name, sum(IF((DATEDIFF(CURDATE(),t.file_last_access) > r.range_start and DATEDIFF(CURDATE(),t.file_last_access) < r.range_end),1,0)) as FileCount
FROM `DateRanges` r, `HugeFileTable` t
GROUP BY r.range_name

但是，可以预见的是，该查询需要永远运行。我认为这是因为我要求 MySQL 遍历 HugeFileTable 5 次，每次对每个文件执行 DATEDIFF() 计算。

我想做的是仅逐条记录地遍历 HugeFileTable 记录一次，并为每个文件增加相应 range_name 运行总数中的计数。我不知道该怎么做......

任何人都可以帮忙吗？

谢谢。

编辑：MySQL版本：5.0.45，表是MyISAM

编辑2：这是评论中要求的描述

id  select_type  table  type  possible_keys  key  key_len  ref  rows      Extra  
1   SIMPLE       r      ALL   NULL           NULL NULL     NULL 5         Using temporary; Using filesort 
1   SIMPLE       t      ALL   NULL           NULL NULL     NULL 96506321

原文

I've got a very large table (~100Million Records) in MySQL that contains information about files. One of the pieces of information is the modified date of each file.

I need to write a query that will count the number of files that fit into specified date ranges. To do that I made a small table that specifies these ranges (all in days) and looks like this:

DateRanges
range_id   range_name   range_start   range_end
1          0-90         0             90
2          91-180       91            180
3          181-365      181           365
4          366-1095     366           1095
5          1096+        1096          999999999

And wrote a query that looks like this:

SELECT r.range_name, sum(IF((DATEDIFF(CURDATE(),t.file_last_access) > r.range_start and DATEDIFF(CURDATE(),t.file_last_access) < r.range_end),1,0)) as FileCount
FROM `DateRanges` r, `HugeFileTable` t
GROUP BY r.range_name

However, quite predictably, this query takes forever to run. I think that is because I am asking MySQL to go through the HugeFileTable 5 times, each time performing the DATEDIFF() calculation on each file.

What I want to do instead is to go through the HugeFileTable record by record only once, and for each file increment the count in the appropriate range_name running total. I can't figure out how to do that....

Can anyone help out with this?

Thanks.

EDIT: MySQL Version: 5.0.45, Tables are MyISAM

EDIT2: Here's the descibe that was asked for in the comments

id  select_type  table  type  possible_keys  key  key_len  ref  rows      Extra  
1   SIMPLE       r      ALL   NULL           NULL NULL     NULL 5         Using temporary; Using filesort 
1   SIMPLE       t      ALL   NULL           NULL NULL     NULL 96506321

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

肤浅与狂妄 2024-07-25 07:29:03

首先，在 HugeFileTable.file_last_access 上创建索引。

然后尝试以下查询：

SELECT r.range_name, COUNT(t.file_last_access) as FileCount
FROM `DateRanges` r
 JOIN `HugeFileTable` t 
 ON (t.file_last_access BETWEEN 
   CURDATE() + INTERVAL r.range_start DAY AND 
   CURDATE() + INTERVAL r.range_end DAY)
GROUP BY r.range_name;

这是我在 MySQL 5.0.75 上尝试此查询时得到的 EXPLAIN 计划（为简洁起见进行了编辑）：

+-------+-------+------------------+----------------------------------------------+
| table | type  | key              | Extra                                        |
+-------+-------+------------------+----------------------------------------------+
| t     | index | file_last_access | Using index; Using temporary; Using filesort | 
| r     | ALL   | NULL             | Using where                                  | 
+-------+-------+------------------+----------------------------------------------+

它仍然不会执行得很好。通过使用GROUP BY，查询会产生一个临时表，这可能会很昂贵。对此你无能为力。

但至少此查询消除了原始查询中的笛卡尔积。

更新：这是另一个使用相关子查询的查询，但我已经删除了GROUP BY。

SELECT r.range_name,
  (SELECT COUNT(*) 
   FROM `HugeFileTable` t 
   WHERE t.file_last_access BETWEEN 
     CURDATE() - INTERVAL r.range_end DAY AND 
     CURDATE() - INTERVAL r.range_start DAY
  ) as FileCount
FROM `DateRanges` r;

EXPLAIN 计划没有显示临时表或文件排序（至少在测试表中的行数很少）：

+----+--------------------+-------+-------+------------------+--------------------------+
| id | select_type        | table | type  | key              | Extra                    |
+----+--------------------+-------+-------+------------------+--------------------------+
|  1 | PRIMARY            | r     | ALL   | NULL             |                          | 
|  2 | DEPENDENT SUBQUERY | t     | index | file_last_access | Using where; Using index | 
+----+--------------------+-------+-------+------------------+--------------------------+

在您的数据集上尝试此查询，看看它是否表现更好。

First, create an index on HugeFileTable.file_last_access.

Then try the following query:

SELECT r.range_name, COUNT(t.file_last_access) as FileCount
FROM `DateRanges` r
 JOIN `HugeFileTable` t 
 ON (t.file_last_access BETWEEN 
   CURDATE() + INTERVAL r.range_start DAY AND 
   CURDATE() + INTERVAL r.range_end DAY)
GROUP BY r.range_name;

Here's the EXPLAIN plan that I got when I tried this query on MySQL 5.0.75 (edited down for brevity):

+-------+-------+------------------+----------------------------------------------+
| table | type  | key              | Extra                                        |
+-------+-------+------------------+----------------------------------------------+
| t     | index | file_last_access | Using index; Using temporary; Using filesort | 
| r     | ALL   | NULL             | Using where                                  | 
+-------+-------+------------------+----------------------------------------------+

It's still not going to perform very well. By using GROUP BY, the query incurs a temporary table, which may be expensive. Not much you can do about that.

But at least this query eliminates the Cartesian product that you had in your original query.

update: Here's another query that uses a correlated subquery but I have eliminated the GROUP BY.

SELECT r.range_name,
  (SELECT COUNT(*) 
   FROM `HugeFileTable` t 
   WHERE t.file_last_access BETWEEN 
     CURDATE() - INTERVAL r.range_end DAY AND 
     CURDATE() - INTERVAL r.range_start DAY
  ) as FileCount
FROM `DateRanges` r;

The EXPLAIN plan shows no temporary table or filesort (at least with the trivial amount of rows I have in my test tables):

+----+--------------------+-------+-------+------------------+--------------------------+
| id | select_type        | table | type  | key              | Extra                    |
+----+--------------------+-------+-------+------------------+--------------------------+
|  1 | PRIMARY            | r     | ALL   | NULL             |                          | 
|  2 | DEPENDENT SUBQUERY | t     | index | file_last_access | Using where; Using index | 
+----+--------------------+-------+-------+------------------+--------------------------+

Try this query on your data set and see if it performs better.

回复收藏 0 原文

旧故 2024-07-25 07:29:03

好吧，首先确保 file_last_access 是表 HugeFileTable 的索引。

我不确定这是否可能\更好，但尝试首先计算日期限制（从日期 A 到日期 B 的文件），然后使用 > 进行一些查询;= 和 <=。至少从理论上来说，它会提高性能。

比较会是这样的：

 t.file_last_access >= StartDate AND t.file_last_access <= EndDate

Well, start by making sure that file_last_access is an index for the table HugeFileTable.

I'm not sure if this is possible\better, but try to compute the dates limits first (files from date A to date B), then use some query with >= and <=. It will, theoretically at least, improve the performance.

The comparison would be something like:

 t.file_last_access >= StartDate AND t.file_last_access <= EndDate

回复收藏 0 原文

柳絮泡泡 2024-07-25 07:29:03

您可以通过删除 CURDATE() 并在查询中添加日期来获得一个小改进，因为它将在 SQL 中为每行运行此函数两次。

回复收藏 0 原文

~没有更多了~

关于作者

失去的东西太少

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

优化MySQL聚合查询

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

诺曦

要走干脆点

把回忆走一遍

陌上青苔

Arthur

哄哄

友情链接

优化MySQL聚合查询

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

诺曦

要走干脆点

把回忆走一遍

陌上青苔

Arthur

哄哄

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。