SQL Server 中计算中位数的函数
根据 MSDN,中位数在 Transact- 中不能作为聚合函数使用SQL。但是,我想知道是否可以创建此功能(使用 创建聚合函数、用户定义函数或其他一些方法)。
执行此操作的最佳方法(如果可能)是什么 - 允许在聚合查询中计算中值(假设数字数据类型)?
According to MSDN, Median is not available as an aggregate function in Transact-SQL. However, I would like to find out whether it is possible to create this functionality (using the Create Aggregate function, user defined function, or some other method).
What would be the best way (if possible) to do this - allow for the calculation of a median value (assuming a numeric data type) in an aggregate query?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(30)
我刚刚在寻找基于集合的中位数解决方案时遇到了此页面。在查看了这里的一些解决方案后,我提出了以下解决方案。希望有帮助/有效。
I just came across this page while looking for a set based solution to median. After looking at some of the solutions here, I came up with the following. Hope is helps/works.
以下查询返回一列中的值列表的中位数。它不能用作聚合函数或与聚合函数一起使用,但您仍然可以将其用作内部 select 中带有 WHERE 子句的子查询。
SQL Server 2005+:
The following query returns the median from a list of values in one column. It cannot be used as or along with an aggregate function, but you can still use it as a sub-query with a WHERE clause in the inner select.
SQL Server 2005+:
尽管 Justin grant 的解决方案看起来很可靠,但我发现当给定分区键中有多个重复值时,ASC 重复值的行号最终会乱序,因此它们无法正确对齐。
这是我的结果的一个片段:
我使用 Justin 的代码作为此解决方案的基础。尽管在使用多个派生表的情况下效率不高,但它确实解决了我遇到的行排序问题。任何改进都会受到欢迎,因为我在 T-SQL 方面没有那么丰富的经验。
Although Justin grant's solution appears solid I found that when you have a number of duplicate values within a given partition key the row numbers for the ASC duplicate values end up out of sequence so they do not properly align.
Here is a fragment from my result:
I used Justin's code as the basis for this solution. Although not as efficient given the use of multiple derived tables it does resolve the row ordering problem I encountered. Any improvements would be welcome as I am not that experienced in T-SQL.
在 UDF 中写入:
In a UDF, write:
贾斯汀上面的例子非常好。但主键需求应该非常清楚地说明。我见过没有密钥的野外代码,结果很糟糕。
我对 Percentile_Cont 的抱怨是它不会为您提供数据集中的实际值。
要获取数据集中的实际值“中位数”,请使用 Percentile_Disc。
Justin's example above is very good. But that Primary key need should be stated very clearly. I have seen that code in the wild without the key and the results are bad.
The complaint I get about the Percentile_Cont is that it wont give you an actual value from the dataset.
To get to a "median" that is an actual value from the dataset use Percentile_Disc.
中值查找
这是查找属性中值的最简单方法。
Median Finding
This is the simplest method to find the median of an attribute.
请在此处查看 SQL 中值计算的其他解决方案:
“使用 MySQL 计算中位数的简单方法”(解决方案主要是独立于供应商)。
See other solutions for median calculation in SQL here:
"Simple way to calculate median with MySQL" (the solutions are mostly vendor-independent).
基于杰夫·阿特伍德(Jeff Atwood)上面的答案,它是使用 GROUP BY 和相关子查询来获取每个组的中位数。
Building on Jeff Atwood's answer above here it is with GROUP BY and a correlated subquery to get the median for each group.
对于“table1”中的连续变量/度量“col1”
For a continuous variable/measure 'col1' from 'table1'
通常,我们可能不仅需要计算整个表的中值,还需要计算某个 ID 的聚合的中值。换句话说,计算表中每个 ID 的中位数,其中每个 ID 有许多记录。 (基于@gdoron编辑的解决方案:良好的性能并且适用于许多SQL)
希望它有帮助。
Frequently, we may need to calculate Median not just for the whole table, but for aggregates with respect to some ID. In other words, calculate median for each ID in our table, where each ID has many records. (based on the solution edited by @gdoron: good performance and works in many SQL)
Hope it helps.
对于大规模数据集,您可以尝试此 GIST:
https://gist.github.com/chrisknoll/1b38761ce8c5016ec5b2
它的工作原理是聚合您在集合中找到的不同值(例如年龄或出生年份等),并使用 SQL 窗口函数来定位您在查询中指定的任何百分位位置。
For large scale datasets, you can try this GIST:
https://gist.github.com/chrisknoll/1b38761ce8c5016ec5b2
It works by aggregating the distinct values you would find in your set (such as ages, or year of birth, etc.), and uses SQL window functions to locate any percentile position you specify in the query.
从员工表中获取工资中值
To get median value of salary from employee table
我本想自己想出一个解决办法,但我的大脑却绊倒了,摔在了半路上。我认为它有效,但不要让我在早上解释它。 :P
I wanted to work out a solution by myself, but my brain tripped and fell on the way. I think it works, but don't ask me to explain it in the morning. :P
这适用于 SQL 2000:
This works with SQL 2000:
对于像我这样正在学习基础知识的新手来说,我个人发现这个示例更容易理解,因为更容易准确理解发生了什么以及中值来自哪里......
不过,对上面的一些代码绝对敬畏! !!
For newbies like myself who are learning the very basics, I personally find this example easier to follow, as it is easier to understand exactly what's happening and where median values are coming from...
In absolute awe of some of the codes above though!!!
这是我能想到的最简单的答案。与我的数据配合良好。如果您想排除某些值,只需向内部选择添加一个 where 子句即可。
This is as simple an answer as I could come up with. Worked well with my data. If you want to exclude certain values just add a where clause to the inner select.
以下解决方案在这些假设下工作:
代码:
The following solution works under these assumptions:
Code:
我尝试了几种替代方案,但由于我的数据记录有重复的值,ROW_NUMBER 版本似乎不是我的选择。所以这里我使用的查询(带有 NTILE 的版本):
I try with several alternatives, but due my data records has repeated values, the ROW_NUMBER versions seems are not a choice for me. So here the query I used (a version with NTILE):
对于你的问题,Jeff Atwood已经给出了简单有效的解决方案。但是,如果您正在寻找一些替代方法来计算中位数,下面的 SQL 代码将为您提供帮助。
如果您想在 MySQL 中计算中位数,此 github 链接 将有用的。
For your question, Jeff Atwood had already given the simple and effective solution. But, if you are looking for some alternative approach to calculate the median, below SQL code will help you.
If you are looking to calculate median in MySQL, this github link will be useful.
如果您使用的是 SQL 2005 或更高版本,那么对于表中的单个列,这是一个很好、简单的中值计算:
If you're using SQL 2005 or better this is a nice, simple-ish median calculation for a single column in a table:
2019 年更新:在我写下这个答案的 10 年间,人们发现了更多可能产生更好结果的解决方案。此外,此后的 SQL Server 版本(尤其是 SQL 2012)引入了可用于计算中位数的新 T-SQL 功能。 SQL Server 版本还改进了其查询优化器,这可能会影响各种中间解决方案的性能。 Net-net,我最初的 2009 年帖子仍然可以,但对于现代 SQL Server 应用程序可能有更好的解决方案。看一下 2012 年的这篇文章,这是一个很棒的资源:https://sqlperformance .com/2012/08/t-sql-queries/median
本文发现以下模式比所有其他替代方案快得多,至少在他们测试的简单模式上是如此。该解决方案比测试的最慢 (
PERCENTILE_CONT
) 解决方案快 373 倍 (!!!)。请注意,此技巧需要两个单独的查询,这可能并不适用于所有情况。它还需要 SQL 2012 或更高版本。当然,仅仅因为 2012 年对一种模式的一项测试产生了很好的结果,您的里程可能会有所不同,特别是如果您使用的是 SQL Server 2014 或更高版本。如果性能对于您的中值计算很重要,我强烈建议尝试和性能测试该文章中推荐的几个选项,以确保您找到最适合您的架构的选项。
我还会特别小心地使用(SQL Server 2012 中的新功能)函数
PERCENTILE_CONT
这是其他答案 这个问题,因为上面链接的文章发现这个内置函数比最快的解决方案慢 373 倍。此后的 7 年里,这种差异可能有所改善,但就我个人而言,在验证其性能与其他解决方案的性能之前,我不会在大型表上使用此函数。2009 年原始帖子如下:
有很多方法可以实现此目的,但性能差异很大。这是一个特别优化的解决方案,来自 中位数、ROW_NUMBER 和效果。对于执行过程中生成的实际 I/O 而言,这是一个特别优化的解决方案 - 它看起来比其他解决方案成本更高,但实际上速度要快得多。
该页面还包含对其他解决方案和性能测试详细信息的讨论。请注意,如果存在多行具有相同中值列值的情况,请使用唯一列作为消歧符。
与所有数据库性能场景一样,始终尝试在真实硬件上使用真实数据测试解决方案 - 您永远不知道 SQL Server 优化器的更改或环境中的特殊性何时会使通常快速的解决方案变慢。
2019 UPDATE: In the 10 years since I wrote this answer, more solutions have been uncovered that may yield better results. Also, SQL Server releases since then (especially SQL 2012) have introduced new T-SQL features that can be used to calculate medians. SQL Server releases have also improved its query optimizer which may affect perf of various median solutions. Net-net, my original 2009 post is still OK but there may be better solutions on for modern SQL Server apps. Take a look at this article from 2012 which is a great resource: https://sqlperformance.com/2012/08/t-sql-queries/median
This article found the following pattern to be much, much faster than all other alternatives, at least on the simple schema they tested. This solution was 373x faster (!!!) than the slowest (
PERCENTILE_CONT
) solution tested. Note that this trick requires two separate queries which may not be practical in all cases. It also requires SQL 2012 or later.Of course, just because one test on one schema in 2012 yielded great results, your mileage may vary, especially if you're on SQL Server 2014 or later. If perf is important for your median calculation, I'd strongly suggest trying and perf-testing several of the options recommended in that article to make sure that you've found the best one for your schema.
I'd also be especially careful using the (new in SQL Server 2012) function
PERCENTILE_CONT
that's recommended in one of the other answers to this question, because the article linked above found this built-in function to be 373x slower than the fastest solution. It's possible that this disparity has been improved in the 7 years since, but personally I wouldn't use this function on a large table until I verified its performance vs. other solutions.ORIGINAL 2009 POST IS BELOW:
There are lots of ways to do this, with dramatically varying performance. Here's one particularly well-optimized solution, from Medians, ROW_NUMBERs, and performance. This is a particularly optimal solution when it comes to actual I/Os generated during execution – it looks more costly than other solutions, but it is actually much faster.
That page also contains a discussion of other solutions and performance testing details. Note the use of a unique column as a disambiguator in case there are multiple rows with the same value of the median column.
As with all database performance scenarios, always try to test a solution out with real data on real hardware – you never know when a change to SQL Server's optimizer or a peculiarity in your environment will make a normally-speedy solution slower.
在 SQL Server 2012 中,您应该使用 PERCENTILE_CONT:
另请参阅:
In SQL Server 2012 you should use PERCENTILE_CONT:
See also : http://blog.sqlauthority.com/2011/11/20/sql-server-introduction-to-percentile_cont-analytic-functions-introduced-in-sql-server-2012/
我最初的快速回答是:
这将一次性为您提供中位数和四分位数范围。如果您确实只想要一行作为中位数,则取消注释 where 子句。
当您将其纳入解释计划时,60% 的工作是对数据进行排序,在计算此类位置相关统计数据时,这是不可避免的。
我修改了答案,以遵循以下评论中 Robert Ševčík-Robajz 的出色建议:
当您有偶数个数据项时,这应该计算正确的中位数和百分位值。同样,如果您只需要中位数而不是整个百分位分布,请取消注释最后的 where 子句。
My original quick answer was:
This will give you the median and interquartile range in one fell swoop. If you really only want one row that is the median then uncomment the where clause.
When you stick that into an explain plan, 60% of the work is sorting the data which is unavoidable when calculating position dependent statistics like this.
I've amended the answer to follow the excellent suggestion from Robert Ševčík-Robajz in the comments below:
This should calculate the correct median and percentile values when you have an even number of data items. Again, uncomment the final where clause if you only want the median and not the entire percentile distribution.
更好的是:
来自大师本人,Itzik Ben-Gan!
Even better:
From the master Himself, Itzik Ben-Gan!
MS SQL Server 2012(及更高版本)具有 PERCENTILE_DISC 函数,可计算排序值的特定百分位。 PERCENTILE_DISC (0.5) 将计算中位数 - https://msdn.microsoft.com/ en-us/library/hh231327.aspx
MS SQL Server 2012 (and later) has the PERCENTILE_DISC function which computes a specific percentile for sorted values. PERCENTILE_DISC (0.5) will compute the median - https://msdn.microsoft.com/en-us/library/hh231327.aspx
如果您想在 SQL Server 中使用 Create Aggregate 功能,请按以下方法操作。这样做的好处是能够编写干净的查询。请注意,此过程可以适用于相当容易地计算百分位值。
创建一个新的Visual Studio项目并将目标框架设置为.NET 3.5(这是针对SQL 2008的,在SQL 2012中可能有所不同)。然后创建一个类文件并放入以下代码或 C# 等效代码:
然后编译它并将 DLL 和 PDB 文件复制到 SQL Server 计算机并在 SQL Server 中运行以下命令:
然后您可以编写一个查询来计算中位数像这样:
从表中选择 dbo.Median(Field)
If you want to use the Create Aggregate function in SQL Server, this is how to do it. Doing it this way has the benefit of being able to write clean queries. Note this this process could be adapted to calculate a Percentile value fairly easily.
Create a new Visual Studio project and set the target framework to .NET 3.5 (this is for SQL 2008, it may be different in SQL 2012). Then create a class file and put in the following code, or c# equivalent:
Then compile it and copy the DLL and PDB file to your SQL Server machine and run the following command in SQL Server:
You can then write a query to calculate the median like this:
SELECT dbo.Median(Field) FROM Table
简单、快速、准确
Simple, fast, accurate
使用单个语句 - 一种方法是使用 ROW_NUMBER()、COUNT() 窗口函数并过滤子查询。这是找到工资中位数的方法:
我在网上看到了使用 FLOOR 和 CEILING 的类似解决方案,但尝试使用单个语句。 (已编辑)
Using a single statement - One way is to use ROW_NUMBER(), COUNT() window function and filter the sub-query. Here is to find the median salary:
I have seen similar solutions over the net using FLOOR and CEILING but tried to use a single statement. (edited)