分析基于时间的数据并获取 SQL Server 中数据的分布
我想分析历史数据以获得数据的分布。
tData 表中的数据以 15 分钟为单位,拥有超过 100 亿条记录,这使得性能非常关键。我想分析所有年份数据的所有 96 个点(1 天 = 96 * 15 分钟)。
例如,对于“00:15”,我将获取“00:15”的所有数据的最小值和最大值,比方说:
2010-01-01 00:15 2010-01-02 00:15 2010-01-03 00:15 2010-01-04 00:15 2010-01-05 00:15 2010-01-06 00:15 2010-01-07 00:15 ... ...
然后我将使用以下查询来获取数据在 10 个部分中的分布: tData 表(DeviceID + FooTime 为 PK),tData 是物理表,但这里仅用于测试示例:
declare @tData table(DeviceID nvarchar(10), FooTime datetime, Value float)
--just dummy data, actual data will have full 96 points for each day
INSERT INTO @tData VALUES ('PM00100', '2010-01-01 00:15', 3)
INSERT INTO @tData VALUES ('PM00100', '2010-01-02 00:15', 4)
INSERT INTO @tData VALUES ('PM00100', '2010-01-03 00:15', 5)
INSERT INTO @tData VALUES ('PM00100', '2010-01-04 00:15', 2)
INSERT INTO @tData VALUES ('PM00100', '2010-01-05 00:15', 3)
INSERT INTO @tData VALUES ('PM00100', '2010-01-06 00:15', 4)
INSERT INTO @tData VALUES ('PM00100', '2010-01-07 00:15', 5)
INSERT INTO @tData VALUES ('PM00100', '2010-01-08 00:15', 6)
INSERT INTO @tData VALUES ('PM00100', '2010-01-09 00:15', 7)
INSERT INTO @tData VALUES ('PM00100', '2010-01-10 00:15', 11)
INSERT INTO @tData VALUES ('PM00100', '2010-01-11 00:15', 12)
INSERT INTO @tData VALUES ('PM00100', '2010-01-12 00:15', 13)
INSERT INTO @tData VALUES ('PM00100', '2010-01-13 00:15', 14)
INSERT INTO @tData VALUES ('PM00100', '2010-01-14 00:15', 15)
INSERT INTO @tData VALUES ('PM00100', '2010-01-15 00:15', 16)
INSERT INTO @tData VALUES ('PM00100', '2010-01-16 00:15', 17)
INSERT INTO @tData VALUES ('PM00100', '2010-01-17 00:15', 18)
INSERT INTO @tData VALUES ('PM00100', '2010-01-18 00:15', 10)
INSERT INTO @tData VALUES ('PM00100', '2010-01-19 00:15', 19)
INSERT INTO @tData VALUES ('PM00100', '2010-01-20 00:15', 9)
INSERT INTO @tData VALUES ('PM00100', '2010-01-21 00:15', 8)
INSERT INTO @tData VALUES ('PM00100', '2010-01-22 00:15', 3)
INSERT INTO @tData VALUES ('PM00100', '2010-01-23 00:15', 4)
INSERT INTO @tData VALUES ('PM00100', '2010-01-24 00:15', 4)
INSERT INTO @tData VALUES ('PM00100', '2010-01-25 00:15', 2)
INSERT INTO @tData VALUES ('PM00100', '2010-01-26 00:15', 2)
INSERT INTO @tData VALUES ('PM00100', '2010-01-27 00:15', 5)
INSERT INTO @tData VALUES ('PM00100', '2010-01-28 00:15', 6)
INSERT INTO @tData VALUES ('PM00100', '2010-01-29 00:15', 2)
INSERT INTO @tData VALUES ('PM00100', '2010-01-30 00:15', 14)
INSERT INTO @tData VALUES ('PM00100', '2010-01-31 00:15', 15)
INSERT INTO @tData VALUES ('PM00100', '2010-02-01 00:15', 25)
DECLARE @CurrentPoint nvarchar(10)
DECLARE @MinValue float
DECLARE @MaxValue float
DECLARE @AvgValue float
DECLARE @StartDate datetime
DECLARE @EndDate datetime
DECLARE @RangeWidth float
DECLARE @RangeCount float
SET @StartDate = '2010-01-01'
SET @EndDate = '2010-02-02'
SET @CurrentPoint = '00:15'
SET @RangeCount = 9
SELECT @MinValue = MIN(Value), @MaxValue = MAX(Value), @AvgValue = AVG(Value)
FROM @tData
WHERE DeviceID = 'PM00100' AND FooTime between @StartDate and @EndDate AND CONVERT(nvarchar(5), FooTime, 108) = @CurrentPoint
SET @RangeWidth = (@MaxValue - @MinValue) / @RangeCount
SELECT FLOOR((Value - @MinValue)/@RangeWidth) * @RangeWidth + @MinValue as LowerBound
,FLOOR((Value - @MinValue)/@RangeWidth) * @RangeWidth + @MinValue + @RangeWidth as UpperBound
,COUNT(*) AS Count
,FLOOR((Value - @MinValue)/@RangeWidth) AS Position
FROM @tData
WHERE DeviceID = 'PM00100' AND FooTime between @StartDate and @EndDate AND CONVERT(nvarchar(5), FooTime, 108) = @CurrentPoint
GROUP BY FLOOR((Value - @MinValue) / @RangeWidth)
上述查询获取最大值和最小值之间的差值,然后除以 9(结果是 10部分),对于每个数据范围,我想获取有多少记录属于该范围。位置只是输出索引。
该查询对于一个 15 分钟的数据点工作得很好,但是当我想要执行所有 96 个点(每天,从 00:00 到 23:45)时,现在我正在使用循环,它非常慢。我相信这是因为我不需要做 96 个查询,我可以一次性完成,但我无法弄清楚。
结果将如下所示:
LowerBound UpperBound Count Position 2.00000000000000 4.55555555555556 11 0 4.55555555555556 7.11111111111111 6 1 7.11111111111111 9.66666666666667 2 2 9.66666666666667 12.2222222222222 3 3 12.2222222222222 14.7777777777778 3 4 14.7777777777778 17.3333333333333 4 5 17.3333333333333 19.8888888888889 2 6 25.0000000000000 27.5555555555556 1 9
I would like to analyse the historic data to get the distribution of the data.
The data from tData table is 15 minute based, it has over 10 billion records, which makes performance really critical. I would like to analyze all 96 points (1 day = 96 * 15 minutes) across all years of data.
For example, for "00:15", I will get the minimum and maximum across all the data that is "00:15", let's say:
2010-01-01 00:15 2010-01-02 00:15 2010-01-03 00:15 2010-01-04 00:15 2010-01-05 00:15 2010-01-06 00:15 2010-01-07 00:15 ... ...
Then I will use the following query to get the distribution of the data in 10 sections from tData table (DeviceID + FooTime is PK), and tData is a physical table but here just for test sample:
declare @tData table(DeviceID nvarchar(10), FooTime datetime, Value float)
--just dummy data, actual data will have full 96 points for each day
INSERT INTO @tData VALUES ('PM00100', '2010-01-01 00:15', 3)
INSERT INTO @tData VALUES ('PM00100', '2010-01-02 00:15', 4)
INSERT INTO @tData VALUES ('PM00100', '2010-01-03 00:15', 5)
INSERT INTO @tData VALUES ('PM00100', '2010-01-04 00:15', 2)
INSERT INTO @tData VALUES ('PM00100', '2010-01-05 00:15', 3)
INSERT INTO @tData VALUES ('PM00100', '2010-01-06 00:15', 4)
INSERT INTO @tData VALUES ('PM00100', '2010-01-07 00:15', 5)
INSERT INTO @tData VALUES ('PM00100', '2010-01-08 00:15', 6)
INSERT INTO @tData VALUES ('PM00100', '2010-01-09 00:15', 7)
INSERT INTO @tData VALUES ('PM00100', '2010-01-10 00:15', 11)
INSERT INTO @tData VALUES ('PM00100', '2010-01-11 00:15', 12)
INSERT INTO @tData VALUES ('PM00100', '2010-01-12 00:15', 13)
INSERT INTO @tData VALUES ('PM00100', '2010-01-13 00:15', 14)
INSERT INTO @tData VALUES ('PM00100', '2010-01-14 00:15', 15)
INSERT INTO @tData VALUES ('PM00100', '2010-01-15 00:15', 16)
INSERT INTO @tData VALUES ('PM00100', '2010-01-16 00:15', 17)
INSERT INTO @tData VALUES ('PM00100', '2010-01-17 00:15', 18)
INSERT INTO @tData VALUES ('PM00100', '2010-01-18 00:15', 10)
INSERT INTO @tData VALUES ('PM00100', '2010-01-19 00:15', 19)
INSERT INTO @tData VALUES ('PM00100', '2010-01-20 00:15', 9)
INSERT INTO @tData VALUES ('PM00100', '2010-01-21 00:15', 8)
INSERT INTO @tData VALUES ('PM00100', '2010-01-22 00:15', 3)
INSERT INTO @tData VALUES ('PM00100', '2010-01-23 00:15', 4)
INSERT INTO @tData VALUES ('PM00100', '2010-01-24 00:15', 4)
INSERT INTO @tData VALUES ('PM00100', '2010-01-25 00:15', 2)
INSERT INTO @tData VALUES ('PM00100', '2010-01-26 00:15', 2)
INSERT INTO @tData VALUES ('PM00100', '2010-01-27 00:15', 5)
INSERT INTO @tData VALUES ('PM00100', '2010-01-28 00:15', 6)
INSERT INTO @tData VALUES ('PM00100', '2010-01-29 00:15', 2)
INSERT INTO @tData VALUES ('PM00100', '2010-01-30 00:15', 14)
INSERT INTO @tData VALUES ('PM00100', '2010-01-31 00:15', 15)
INSERT INTO @tData VALUES ('PM00100', '2010-02-01 00:15', 25)
DECLARE @CurrentPoint nvarchar(10)
DECLARE @MinValue float
DECLARE @MaxValue float
DECLARE @AvgValue float
DECLARE @StartDate datetime
DECLARE @EndDate datetime
DECLARE @RangeWidth float
DECLARE @RangeCount float
SET @StartDate = '2010-01-01'
SET @EndDate = '2010-02-02'
SET @CurrentPoint = '00:15'
SET @RangeCount = 9
SELECT @MinValue = MIN(Value), @MaxValue = MAX(Value), @AvgValue = AVG(Value)
FROM @tData
WHERE DeviceID = 'PM00100' AND FooTime between @StartDate and @EndDate AND CONVERT(nvarchar(5), FooTime, 108) = @CurrentPoint
SET @RangeWidth = (@MaxValue - @MinValue) / @RangeCount
SELECT FLOOR((Value - @MinValue)/@RangeWidth) * @RangeWidth + @MinValue as LowerBound
,FLOOR((Value - @MinValue)/@RangeWidth) * @RangeWidth + @MinValue + @RangeWidth as UpperBound
,COUNT(*) AS Count
,FLOOR((Value - @MinValue)/@RangeWidth) AS Position
FROM @tData
WHERE DeviceID = 'PM00100' AND FooTime between @StartDate and @EndDate AND CONVERT(nvarchar(5), FooTime, 108) = @CurrentPoint
GROUP BY FLOOR((Value - @MinValue) / @RangeWidth)
the above mentioned query get the difference between maximum and minimum values, and divide by 9 (turn out to be 10 sections), for each range of data I would like to get how many records belongs to that range. The position is simply outputing the index.
The query works fine for one 15 minute point of data, but when I want to do all 96 points (daily, from 00:00 to 23:45), right now I am using a loop, it's very slow. I believe it is because I do not need to do 96 queries, I could do it in one shot, but I could not figure it out.
the result will look like:
LowerBound UpperBound Count Position 2.00000000000000 4.55555555555556 11 0 4.55555555555556 7.11111111111111 6 1 7.11111111111111 9.66666666666667 2 2 9.66666666666667 12.2222222222222 3 3 12.2222222222222 14.7777777777778 3 4 14.7777777777778 17.3333333333333 4 5 17.3333333333333 19.8888888888889 2 6 25.0000000000000 27.5555555555556 1 9
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我不太确定我是否遵循其他计算,为什么您需要首先采用 AVG/MAX 等,RangeCount 和 RangeWidth 的含义等等,但也许这会给您一个开始:
这将为您提供一行每次切片。您可以嵌套更多 CTE,以便通过聚合进行进一步计算。我无法弄清楚您正在进行的计算,而且您的样本数据和期望的结果似乎不太匹配。
I'm not quite sure I follow the other calculations, why you need to take AVG/MAX etc. first, what RangeCount and RangeWidth mean, and so on, but maybe this will give you a start:
This will give you a row for each time slice. You can nest more CTEs for further calculations with the aggregates. I couldn't make heads or tails of the calculations you were doing, and it seemed like your sample data and desired results don't quite match up.
我找到了通过一次性执行查询而不是迭代所有时间点来提高性能的方法。它首先计算每个点的最小值/最大值并将它们存储在临时表中。
I found the way to improve the performance by doing the query in one shot, rather than iterating all the time points. It first the the min/max for each point and store them in temporary table.