T-SQL:更好的滑动分布函数/查询
我需要一种类似于 NTILE() 提供的 T-SQL 排名方法,不同之处在于每个图块的成员将采用滑动分布,以便排名较高的图块具有较少的成员。
例如
CREATE TABLE #Rank_Table(
id int identity(1,1) not null,
hits bigint not null default 0,
PERCENTILE smallint null
)
--Slant the distribution of the data
INSERT INTO #Rank_Table (hits)
select CASE
when DATA > 9500 THEN DATA*30
WHEN data > 8000 THEN DATA*5
WHEN data < 7000 THEN DATA/3 +1
ELSE DATA
END
FROM
(select top 10000 (ABS(CHECKSUM(NewId())) % 99 +1) * (ABS(CHECKSUM(NewId())) % 99 +1 ) DATA
from master..spt_values t1
cross JOIN master..spt_values t2) exponential
Declare @hitsPerGroup as bigint
Declare @numGroups as smallint
set @numGroups=100
select @hitsPerGroup=SUM(hits)/(@numGroups -1) FROM #Rank_Table
select @hitsPerGroup HITS_PER_GROUP
--This is an even distribution
SELECT id,HITS, NTILE(@numGroups) Over (Order By HITS DESC) PERCENTILE
FROM #Rank_Table
GROUP by id, HITS
--This is my best attempt, but it skips groups because of the erratic distribution
select
T1.ID,
T1.hits,
T.RunningTotal/@hitsPerGroup + 1 TILE,
T.RunningTotal
FROM #Rank_Table T1
CROSS APPLY ( Select SUM(hits) RunningTotal FROM #Rank_Table where hits <= T1.hits) T
order by T1.hits
DROP TABLE #Rank_Table
,在#Rank_table 中,NTILE(@numGroups) 创建均匀分布的@numGroups 组。我需要的是 @numGroups 组,其中图块 1 的成员最少,图块 2 的成员数量比图块 1 多一个或多个,图块 3 的成员数量比图块 2 多一个或多个...图块 100 的成员最多。
我正在使用 SQL Server 2008。实际上,这将针对可能包含数百万行的永久表运行,以便定期更新 PERCENTILE 列及其从 1 到 100 的百分位。
我上面的最佳尝试将跳过百分位数并且表现不佳。一定有更好的方法。
I need a T-SQL ranking approach similar to the one provided by NTILE(), except that the members of each tile would be on a sliding distribution so that higher ranking tiles have fewer members.
For example
CREATE TABLE #Rank_Table(
id int identity(1,1) not null,
hits bigint not null default 0,
PERCENTILE smallint null
)
--Slant the distribution of the data
INSERT INTO #Rank_Table (hits)
select CASE
when DATA > 9500 THEN DATA*30
WHEN data > 8000 THEN DATA*5
WHEN data < 7000 THEN DATA/3 +1
ELSE DATA
END
FROM
(select top 10000 (ABS(CHECKSUM(NewId())) % 99 +1) * (ABS(CHECKSUM(NewId())) % 99 +1 ) DATA
from master..spt_values t1
cross JOIN master..spt_values t2) exponential
Declare @hitsPerGroup as bigint
Declare @numGroups as smallint
set @numGroups=100
select @hitsPerGroup=SUM(hits)/(@numGroups -1) FROM #Rank_Table
select @hitsPerGroup HITS_PER_GROUP
--This is an even distribution
SELECT id,HITS, NTILE(@numGroups) Over (Order By HITS DESC) PERCENTILE
FROM #Rank_Table
GROUP by id, HITS
--This is my best attempt, but it skips groups because of the erratic distribution
select
T1.ID,
T1.hits,
T.RunningTotal/@hitsPerGroup + 1 TILE,
T.RunningTotal
FROM #Rank_Table T1
CROSS APPLY ( Select SUM(hits) RunningTotal FROM #Rank_Table where hits <= T1.hits) T
order by T1.hits
DROP TABLE #Rank_Table
In #Rank_table, NTILE(@numGroups) creates an even distribution of @numGroups groups. What I need are @numGroups groups where the tile 1 has the fewest members, tile 2 would have one or more than tile 1, tile 3 would have 1 or more than tile 2 ... tile 100 would have the most.
I'm using SQL Server 2008. In practice this will be run against a permanent table with potentially millions of rows in order to periodically update the PERCENTILE column with its percentile from 1-100.
My best attempt above will skip percentiles and performs poorly. There must be a better way.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
更好的 NTILE 实现?青年MMV
A better NTILE implementation? YMMV
为了创建更线性的分布,我向数据表添加了一个计算列 HITS_SQRT
HITS_SQRT AS (CONVERT([int],sqrt(HITS*4),(0))) PERSISTED
。使用此列,您可以计算“每百分位点击数”的目标数量。
然后,该脚本使用按命中数排序的 ROW_NUMBER() 创建一个临时表,并按降序迭代行,将其百分位数从 100 更新为 1。 保留命中数的运行总计,并且当
传递@hitsPerGroup
,百分位数从 100 降低到 99、99 到 98 等。然后用其百分位数更新源数据表。临时工作表有一个索引以加快更新速度。
使用
#Rank_Table
作为源数据表的完整脚本。性能不是很出色,但它实现了平滑滑动分布的目标。
In order to create a more linear distribution, I added a computed column to the data table, HITS_SQRT
HITS_SQRT AS (CONVERT([int],sqrt(HITS*4),(0))) PERSISTED
.Using this column you can calculate a target number of "hits per percentile".
The script then creates a temp table with a ROW_NUMBER() ordered by the number of hits and iterates the rows in descending order updating its percentile from 100 to 1. A running total is kept of the number of hits, and when the
@hitsPerGroup
is passed, the percentile is lowered form 100 to 99, 99 to 98, etc.Then the source data table is updated with its percentile. There is an index of the temp work table to speed the update.
Full script using
#Rank_Table
as the source data table.The performance is not stellar, but it achieves the goal of a smooth sliding distribution.