如何有效地选择SQL中根据不同时间戳计算的多个总和的平均总和?
我有一个如下所示的数据库表:
id | macaddr | load | timestamp
=========================================
1 | 0011111 | 17 | 2012-02-07 10:00:00
1 | 0011111 | 6 | 2012-02-07 12:00:00
2 | 0022222 | 3 | 2012-02-07 12:00:03
3 | 0033333 | 9 | 2012-02-07 12:00:04
4 | 0022222 | 4 | 2012-02-07 12:00:06
5 | 0033333 | 8 | 2012-02-07 12:00:10
...
现在,我想计算不同时间段(例如今天、昨天、本周、本月)所有设备(= mac 地址)的平均负载。
可以通过首先找出不同时间点(样本日期)的总体负载总和,然后计算这些样本日期的负载总和的平均值来计算平均负载。例如,如果我想要最后十秒的平均负载(现在是 2012-02-07 12:00:10),我可以将样本日期决定为 12:00:02, 12:00: 04、12:00:06、12:00:08 和 12:00:10。然后,我将通过总结每个设备的最新负载值来计算负载总和:
2012-02-07 12:00:02 | 6 [= load(id=2)]
2012-02-07 12:00:04 | 18 [= load(id=2) + load(id=3) + load(id=4)]
2012-02-07 12:00:06 | 19 [= load(id=2) + load(id=4) + load(id=5)]
2012-02-07 12:00:08 | 19 [= load(id=2) + load(id=4) + load(id=5)]
2012-02-07 12:00:10 | 18 [= load(id=2) + load(id=5) + load(id=6)]
如果设备的负载值早于例如一个小时(此处发生于 id=1),则该设备的负载值将被忽略。在这种情况下,平均值为 16。
目前,我生成了一个相当复杂的查询,其中包含许多“UNION ALL”语句,这非常慢:
SELECT avg(l.load_sum) as avg_load
FROM (
SELECT sum(so.load) AS load_sum
FROM (
SELECT *
FROM (
SELECT si.macaddr, si.load
FROM sensor_data si WHERE si.timestamp > '2012-02-07 11:00:10' AND si.timestamp < '2012-02-07 12:00:10'
ORDER BY si.timestamp DESC
) AS sm
GROUP BY macaddr
) so
UNION ALL
[THE SAME THING AGAIN WITH OTHER TIMESTAMPS]
UNION ALL
[AND AGAIN]
UNION ALL
[AND AGAIN]
...
) l
现在想象一下我想计算整个月的平均负载。对于每小时的样本日期,我需要使用 UNION ALL 语句加入 30x24=720 个查询。在我的机器上完成整个查询需要近一分钟的时间。我确信没有 UNION ALL 语句有更好的解决方案。然而,我在网上没有找到任何有用的东西。因此,我非常感谢您的帮助!
I have a database table looking like the following:
id | macaddr | load | timestamp
=========================================
1 | 0011111 | 17 | 2012-02-07 10:00:00
1 | 0011111 | 6 | 2012-02-07 12:00:00
2 | 0022222 | 3 | 2012-02-07 12:00:03
3 | 0033333 | 9 | 2012-02-07 12:00:04
4 | 0022222 | 4 | 2012-02-07 12:00:06
5 | 0033333 | 8 | 2012-02-07 12:00:10
...
Now, I would like to calculate the average load over all devices (= mac addresses) for different time periods, e.g., for today, yesterday, this week, this month.
An average load can be calculated by first finding out the overall load sum at different points in time (sample dates) and then calculating the average of the load sums for these sample dates. For example, if I wanted the average load of the last ten seconds (and now is 2012-02-07 12:00:10), I could decide about my sample dates to be at 12:00:02, 12:00:04, 12:00:06, 12:00:08, and 12:00:10. Then, I would calculate the load sums by summing up the most recent load values for each device:
2012-02-07 12:00:02 | 6 [= load(id=2)]
2012-02-07 12:00:04 | 18 [= load(id=2) + load(id=3) + load(id=4)]
2012-02-07 12:00:06 | 19 [= load(id=2) + load(id=4) + load(id=5)]
2012-02-07 12:00:08 | 19 [= load(id=2) + load(id=4) + load(id=5)]
2012-02-07 12:00:10 | 18 [= load(id=2) + load(id=5) + load(id=6)]
A device's load value is ignored if it is older than, e.g., an hour (happened here to id=1). The average would be 16 in this case.
Currently, I generate a rather complex query with many "UNION ALL" statements which is reeeeally slow:
SELECT avg(l.load_sum) as avg_load
FROM (
SELECT sum(so.load) AS load_sum
FROM (
SELECT *
FROM (
SELECT si.macaddr, si.load
FROM sensor_data si WHERE si.timestamp > '2012-02-07 11:00:10' AND si.timestamp < '2012-02-07 12:00:10'
ORDER BY si.timestamp DESC
) AS sm
GROUP BY macaddr
) so
UNION ALL
[THE SAME THING AGAIN WITH OTHER TIMESTAMPS]
UNION ALL
[AND AGAIN]
UNION ALL
[AND AGAIN]
...
) l
Now imagine I would like to calculate the average load for a whole month. With hourly sample dates I need to join 30x24=720 queries using the UNION ALL statement. The overall query takes nearly a minute to complete on my machine. I am sure there is a much better solution without the UNION ALL statement. However, I did not find anything useful on the Web. I would therefore be very thankful for your help!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
使用 unix 时间戳的一小部分:首先我们制定每小时(3600 秒)平均值:
然后我们对整个月的平均值进行平均
Use a fraction of the unix timestamp: First we formulate the hourly (3600 seconds) averages:
Then we average those over the month
为了让事情变得更容易,您应该创建一个“小时”函数,该函数返回一个日期时间,在小时部分之后没有有效数字。所以现在(2/2/2012 5:05pm)将是 2012-02-07 17:00。以下是小时函数的代码:(
将上述代码中的
current_timestamp
替换为小时函数的日期时间参数。我假设您将其创建为 dbo.fnHour(),并且它需要一个日期时间 然后,您可以使用 dbo.fnHour() 作为分区函数来查询您想要的内容。您的 sql 看起来像这样:
我还没有测试过,所以可能有一些拼写错误,但这应该是。足以让你继续前进。
To make things easier for yourself you should create an "hour" function, that returns a datetime with no significant figures after the hour part. So right now (2/2/2012 5:05pm) would be 2012-02-07 17:00. Here's the code for your hour function:
(replace
current_timestamp
in the above code with the datetime parameter of your hour function. I'll assume you created it as dbo.fnHour(), and it takes a datetime parameter.You can then use the dbo.fnHour() as a partitioning function to query what you want. Your sql will look something like this:
I haven't tested it so there may be some typos but this should be enough to get you going.
我可能误解了你想要做什么。看起来你让事情变得比他们需要使用的采样复杂得多。也许提供结果示例可以让人们为您的特定案例提供更好的解决方案。
I may be misunderstanding what you are trying to do. It looks like you are making things a lot more complicated than they need to be using the sampling. Perhaps giving samples of what the result should look like would allow people to provide better solutions for your particular case.