优化MySQL聚合查询

发布于 2024-07-18 07:29:03 字数 1324 浏览 10 评论 0原文

我在 MySQL 中有一个非常大的表(约 1 亿条记录),其中包含有关文件的信息。 其中一条信息是每个文件的修改日期。

我需要编写一个查询来计算符合指定日期范围的文件数量。 为此,我制作了一个小表,指定这些范围(全部以天为单位),如下所示:

DateRanges
range_id   range_name   range_start   range_end
1          0-90         0             90
2          91-180       91            180
3          181-365      181           365
4          366-1095     366           1095
5          1096+        1096          999999999

并编写了一个如下所示的查询:

SELECT r.range_name, sum(IF((DATEDIFF(CURDATE(),t.file_last_access) > r.range_start and DATEDIFF(CURDATE(),t.file_last_access) < r.range_end),1,0)) as FileCount
FROM `DateRanges` r, `HugeFileTable` t
GROUP BY r.range_name

但是,可以预见的是,该查询需要永远运行。 我认为这是因为我要求 MySQL 遍历 HugeFileTable 5 次,每次对每个文件执行 DATEDIFF() 计算。

我想做的是仅逐条记录地遍历 HugeFileTable 记录一次,并为每个文件增加相应 range_name 运行总数中的计数。 我不知道该怎么做......

任何人都可以帮忙吗?

谢谢。

编辑:MySQL版本:5.0.45,表是MyISAM

编辑2:这是评论中要求的描述

id  select_type  table  type  possible_keys  key  key_len  ref  rows      Extra  
1   SIMPLE       r      ALL   NULL           NULL NULL     NULL 5         Using temporary; Using filesort 
1   SIMPLE       t      ALL   NULL           NULL NULL     NULL 96506321   

I've got a very large table (~100Million Records) in MySQL that contains information about files. One of the pieces of information is the modified date of each file.

I need to write a query that will count the number of files that fit into specified date ranges. To do that I made a small table that specifies these ranges (all in days) and looks like this:

DateRanges
range_id   range_name   range_start   range_end
1          0-90         0             90
2          91-180       91            180
3          181-365      181           365
4          366-1095     366           1095
5          1096+        1096          999999999

And wrote a query that looks like this:

SELECT r.range_name, sum(IF((DATEDIFF(CURDATE(),t.file_last_access) > r.range_start and DATEDIFF(CURDATE(),t.file_last_access) < r.range_end),1,0)) as FileCount
FROM `DateRanges` r, `HugeFileTable` t
GROUP BY r.range_name

However, quite predictably, this query takes forever to run. I think that is because I am asking MySQL to go through the HugeFileTable 5 times, each time performing the DATEDIFF() calculation on each file.

What I want to do instead is to go through the HugeFileTable record by record only once, and for each file increment the count in the appropriate range_name running total. I can't figure out how to do that....

Can anyone help out with this?

Thanks.

EDIT: MySQL Version: 5.0.45, Tables are MyISAM

EDIT2: Here's the descibe that was asked for in the comments

id  select_type  table  type  possible_keys  key  key_len  ref  rows      Extra  
1   SIMPLE       r      ALL   NULL           NULL NULL     NULL 5         Using temporary; Using filesort 
1   SIMPLE       t      ALL   NULL           NULL NULL     NULL 96506321   

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

肤浅与狂妄 2024-07-25 07:29:03

首先,在 HugeFileTable.file_last_access 上创建索引。

然后尝试以下查询:

SELECT r.range_name, COUNT(t.file_last_access) as FileCount
FROM `DateRanges` r
 JOIN `HugeFileTable` t 
 ON (t.file_last_access BETWEEN 
   CURDATE() + INTERVAL r.range_start DAY AND 
   CURDATE() + INTERVAL r.range_end DAY)
GROUP BY r.range_name;

这是我在 MySQL 5.0.75 上尝试此查询时得到的 EXPLAIN 计划(为简洁起见进行了编辑):

+-------+-------+------------------+----------------------------------------------+
| table | type  | key              | Extra                                        |
+-------+-------+------------------+----------------------------------------------+
| t     | index | file_last_access | Using index; Using temporary; Using filesort | 
| r     | ALL   | NULL             | Using where                                  | 
+-------+-------+------------------+----------------------------------------------+

它仍然不会执行得很好。 通过使用GROUP BY,查询会产生一个临时表,这可能会很昂贵。 对此你无能为力。

但至少此查询消除了原始查询中的笛卡尔积。


更新:这是另一个使用相关子查询的查询,但我已经删除了GROUP BY

SELECT r.range_name,
  (SELECT COUNT(*) 
   FROM `HugeFileTable` t 
   WHERE t.file_last_access BETWEEN 
     CURDATE() - INTERVAL r.range_end DAY AND 
     CURDATE() - INTERVAL r.range_start DAY
  ) as FileCount
FROM `DateRanges` r;

EXPLAIN 计划没有显示临时表或文件排序(至少在测试表中的行数很少):

+----+--------------------+-------+-------+------------------+--------------------------+
| id | select_type        | table | type  | key              | Extra                    |
+----+--------------------+-------+-------+------------------+--------------------------+
|  1 | PRIMARY            | r     | ALL   | NULL             |                          | 
|  2 | DEPENDENT SUBQUERY | t     | index | file_last_access | Using where; Using index | 
+----+--------------------+-------+-------+------------------+--------------------------+

在您的数据集上尝试此查询,看看它是否表现更好。

First, create an index on HugeFileTable.file_last_access.

Then try the following query:

SELECT r.range_name, COUNT(t.file_last_access) as FileCount
FROM `DateRanges` r
 JOIN `HugeFileTable` t 
 ON (t.file_last_access BETWEEN 
   CURDATE() + INTERVAL r.range_start DAY AND 
   CURDATE() + INTERVAL r.range_end DAY)
GROUP BY r.range_name;

Here's the EXPLAIN plan that I got when I tried this query on MySQL 5.0.75 (edited down for brevity):

+-------+-------+------------------+----------------------------------------------+
| table | type  | key              | Extra                                        |
+-------+-------+------------------+----------------------------------------------+
| t     | index | file_last_access | Using index; Using temporary; Using filesort | 
| r     | ALL   | NULL             | Using where                                  | 
+-------+-------+------------------+----------------------------------------------+

It's still not going to perform very well. By using GROUP BY, the query incurs a temporary table, which may be expensive. Not much you can do about that.

But at least this query eliminates the Cartesian product that you had in your original query.


update: Here's another query that uses a correlated subquery but I have eliminated the GROUP BY.

SELECT r.range_name,
  (SELECT COUNT(*) 
   FROM `HugeFileTable` t 
   WHERE t.file_last_access BETWEEN 
     CURDATE() - INTERVAL r.range_end DAY AND 
     CURDATE() - INTERVAL r.range_start DAY
  ) as FileCount
FROM `DateRanges` r;

The EXPLAIN plan shows no temporary table or filesort (at least with the trivial amount of rows I have in my test tables):

+----+--------------------+-------+-------+------------------+--------------------------+
| id | select_type        | table | type  | key              | Extra                    |
+----+--------------------+-------+-------+------------------+--------------------------+
|  1 | PRIMARY            | r     | ALL   | NULL             |                          | 
|  2 | DEPENDENT SUBQUERY | t     | index | file_last_access | Using where; Using index | 
+----+--------------------+-------+-------+------------------+--------------------------+

Try this query on your data set and see if it performs better.

旧故 2024-07-25 07:29:03

好吧,首先确保 file_last_access 是表 HugeFileTable索引

我不确定这是否可能\更好,但尝试首先计算日期限制(从日期 A 到日期 B 的文件),然后使用 > 进行一些查询;= 和 <=。 至少从理论上来说,它会提高性能。

比较会是这样的:

 t.file_last_access >= StartDate AND t.file_last_access <= EndDate 

Well, start by making sure that file_last_access is an index for the table HugeFileTable.

I'm not sure if this is possible\better, but try to compute the dates limits first (files from date A to date B), then use some query with >= and <=. It will, theoretically at least, improve the performance.

The comparison would be something like:

 t.file_last_access >= StartDate AND t.file_last_access <= EndDate 
柳絮泡泡 2024-07-25 07:29:03

您可以通过删除 CURDATE() 并在查询中添加日期来获得一个小改进,因为它将在 SQL 中为每行运行此函数两​​次。

You could get a small improvement by removing CURDATE() and putting a date in the query as it will run this function for each row twice in your SQL.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文