优化MySQL聚合查询
我在 MySQL 中有一个非常大的表(约 1 亿条记录),其中包含有关文件的信息。 其中一条信息是每个文件的修改日期。
我需要编写一个查询来计算符合指定日期范围的文件数量。 为此,我制作了一个小表,指定这些范围(全部以天为单位),如下所示:
DateRanges
range_id range_name range_start range_end
1 0-90 0 90
2 91-180 91 180
3 181-365 181 365
4 366-1095 366 1095
5 1096+ 1096 999999999
并编写了一个如下所示的查询:
SELECT r.range_name, sum(IF((DATEDIFF(CURDATE(),t.file_last_access) > r.range_start and DATEDIFF(CURDATE(),t.file_last_access) < r.range_end),1,0)) as FileCount
FROM `DateRanges` r, `HugeFileTable` t
GROUP BY r.range_name
但是,可以预见的是,该查询需要永远运行。 我认为这是因为我要求 MySQL 遍历 HugeFileTable 5 次,每次对每个文件执行 DATEDIFF() 计算。
我想做的是仅逐条记录地遍历 HugeFileTable 记录一次,并为每个文件增加相应 range_name 运行总数中的计数。 我不知道该怎么做......
任何人都可以帮忙吗?
谢谢。
编辑:MySQL版本:5.0.45,表是MyISAM
编辑2:这是评论中要求的描述
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE r ALL NULL NULL NULL NULL 5 Using temporary; Using filesort
1 SIMPLE t ALL NULL NULL NULL NULL 96506321
I've got a very large table (~100Million Records) in MySQL that contains information about files. One of the pieces of information is the modified date of each file.
I need to write a query that will count the number of files that fit into specified date ranges. To do that I made a small table that specifies these ranges (all in days) and looks like this:
DateRanges
range_id range_name range_start range_end
1 0-90 0 90
2 91-180 91 180
3 181-365 181 365
4 366-1095 366 1095
5 1096+ 1096 999999999
And wrote a query that looks like this:
SELECT r.range_name, sum(IF((DATEDIFF(CURDATE(),t.file_last_access) > r.range_start and DATEDIFF(CURDATE(),t.file_last_access) < r.range_end),1,0)) as FileCount
FROM `DateRanges` r, `HugeFileTable` t
GROUP BY r.range_name
However, quite predictably, this query takes forever to run. I think that is because I am asking MySQL to go through the HugeFileTable 5 times, each time performing the DATEDIFF() calculation on each file.
What I want to do instead is to go through the HugeFileTable record by record only once, and for each file increment the count in the appropriate range_name running total. I can't figure out how to do that....
Can anyone help out with this?
Thanks.
EDIT: MySQL Version: 5.0.45, Tables are MyISAM
EDIT2: Here's the descibe that was asked for in the comments
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE r ALL NULL NULL NULL NULL 5 Using temporary; Using filesort
1 SIMPLE t ALL NULL NULL NULL NULL 96506321
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
首先,在
HugeFileTable.file_last_access
上创建索引。然后尝试以下查询:
这是我在 MySQL 5.0.75 上尝试此查询时得到的
EXPLAIN
计划(为简洁起见进行了编辑):它仍然不会执行得很好。 通过使用
GROUP BY
,查询会产生一个临时表,这可能会很昂贵。 对此你无能为力。但至少此查询消除了原始查询中的笛卡尔积。
更新:这是另一个使用相关子查询的查询,但我已经删除了
GROUP BY
。EXPLAIN
计划没有显示临时表或文件排序(至少在测试表中的行数很少):在您的数据集上尝试此查询,看看它是否表现更好。
First, create an index on
HugeFileTable.file_last_access
.Then try the following query:
Here's the
EXPLAIN
plan that I got when I tried this query on MySQL 5.0.75 (edited down for brevity):It's still not going to perform very well. By using
GROUP BY
, the query incurs a temporary table, which may be expensive. Not much you can do about that.But at least this query eliminates the Cartesian product that you had in your original query.
update: Here's another query that uses a correlated subquery but I have eliminated the
GROUP BY
.The
EXPLAIN
plan shows no temporary table or filesort (at least with the trivial amount of rows I have in my test tables):Try this query on your data set and see if it performs better.
好吧,首先确保
file_last_access
是表HugeFileTable
的索引。我不确定这是否可能\更好,但尝试首先计算日期限制(从日期 A 到日期 B 的文件),然后使用 > 进行一些查询;= 和 <=。 至少从理论上来说,它会提高性能。
比较会是这样的:
Well, start by making sure that
file_last_access
is an index for the tableHugeFileTable
.I'm not sure if this is possible\better, but try to compute the dates limits first (files from date A to date B), then use some query with >= and <=. It will, theoretically at least, improve the performance.
The comparison would be something like:
您可以通过删除 CURDATE() 并在查询中添加日期来获得一个小改进,因为它将在 SQL 中为每行运行此函数两次。
You could get a small improvement by removing CURDATE() and putting a date in the query as it will run this function for each row twice in your SQL.