用MySQL计算中位数的简单方法
使用 MySQL 计算中位数的最简单(希望不会太慢)的方法是什么? 我使用 AVG(x)
来查找平均值,但我很难找到计算中位数的简单方法。 现在,我将所有行返回给 PHP,进行排序,然后选择中间行,但肯定有一些简单的方法可以在单个 MySQL 查询中完成此操作。
示例数据:
id | val
--------
1 4
2 7
3 2
4 2
5 9
6 8
7 3
按 val
排序得到 2 2 3 4 7 8 9
,因此中位数应为 4
,而 SELECT AVG( val)
其中 == 5
。
What's the simplest (and hopefully not too slow) way to calculate the median with MySQL? I've used AVG(x)
for finding the mean, but I'm having a hard time finding a simple way of calculating the median. For now, I'm returning all the rows to PHP, doing a sort, and then picking the middle row, but surely there must be some simple way of doing it in a single MySQL query.
Example data:
id | val
--------
1 4
2 7
3 2
4 2
5 9
6 8
7 3
Sorting on val
gives 2 2 3 4 7 8 9
, so the median should be 4
, versus SELECT AVG(val)
which == 5
.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(30)
我在 HackerRank 上找到了下面的代码,它非常简单并且适用于每种情况。
I have this below code which I found on HackerRank and it is pretty simple and works in each and every case.
根据维可牢尼龙搭扣的答案,对于那些必须对由另一个参数分组的东西进行中值计算的人:
Building off of velcro's answer, for those of you having to do a median off of something that is grouped by another parameter:
您可以使用此处找到的用户定义函数。
You could use the user-defined function that's found here.
通常,我们可能不仅需要计算整个表的中位数,还需要计算与 ID 相关的聚合的中位数。 换句话说,计算表中每个 ID 的中位数,其中每个 ID 有许多记录。 (良好的性能并且适用于许多 SQL + 修复了偶数和奇数问题,更多关于不同中位数方法的性能 https://sqlperformance.com/2012/08/t-sql-queries/median )
希望有帮助
Often, we may need to calculate Median not just for the whole table, but for aggregates with respect to our ID. In other words, calculate median for each ID in our table, where each ID has many records. (good performance and works in many SQL + fixes problem of even and odds, more about performance of different Median-methods https://sqlperformance.com/2012/08/t-sql-queries/median )
Hope it helps
关注奇数计数 - 在这种情况下给出中间两个值的平均值。
Takes care about an odd value count - gives the avg of the two values in the middle in that case.
我的代码,无需表格或附加变量即可高效:
My code, efficient without tables or additional variables:
单个查询来归档完美中位数:
Single query to archive the perfect median:
或者,您也可以在存储过程中执行此操作:
Optionally, you could also do this in a stored procedure:
下面介绍的我的解决方案仅适用于一个查询,无需创建表、变量甚至子查询。
另外,它允许您在分组查询中获得每个组的中位数(这就是我所需要的!):
它的工作原理是因为巧妙地使用了 group_concat 和 substring_index 。
但是,要允许使用大的 group_concat,您必须将 group_concat_max_len 设置为更高的值(默认情况下为 1024 个字符)。
您可以这样设置(对于当前的sql会话):
group_concat_max_len的更多信息:
My solution presented below works in just one query without creation of table, variable or even sub-query.
Plus, it allows you to get median for each group in group-by queries (this is what i needed !):
It works because of a smart use of group_concat and substring_index.
But, to allow big group_concat, you have to set group_concat_max_len to a higher value (1024 char by default).
You can set it like that (for current sql session) :
More infos for group_concat_max_len: https://dev.mysql.com/doc/refman/5.1/en/server-system-variables.html#sysvar_group_concat_max_len
Velcrow 答案的另一个重复,但使用单个中间表并利用用于行编号的变量来获取计数,而不是执行额外的查询来计算它。 还开始计数,以便第一行是第 0 行,以便简单地使用 Floor 和 Ceil 来选择中间行。
Another riff on Velcrow's answer, but uses a single intermediate table and takes advantage of the variable used for row numbering to get the count, rather than performing an extra query to calculate it. Also starts the count so that the first row is row 0 to allow simply using Floor and Ceil to select the median row(s).
知道确切的行数,您可以使用此查询:
Where; = 天花板(<尺寸> / 2.0) - 1
Knowing exact row count you can use this query:
Where
<half> = ceiling(<size> / 2.0) - 1
以上似乎对我有用。
The above seems to work for me.
ORACLE 的简单解决方案:
MySQL 的易于理解的解决方案:
说明
STATION 是表名。 LAT_N 是具有数值的列名
假设站点表中有101 条记录(奇数)。 这意味着如果表按升序或降序排序,则中位数为第 51 条记录。
在上面对 S 表的每个 S.LAT_N 的查询中,我创建了两个表。 一个用于小于 S.LAT_N 的 LAT_N 值的数量,另一个用于大于 S.LAT_N 的 LAT_N 值的数量。 稍后我会比较这两个表,如果它们匹配,那么我将选择 S.LAT_N 值。 当我检查第 51 条记录时,有 50 个值小于第 51 条记录,有 50 条记录大于第 51 条记录。 如您所见,两个表中都有 50 条记录。 这就是我们的答案。 对于每个其他记录,在创建的两个表中都有不同数量的记录用于比较。 因此,只有第 51 条记录满足条件。
现在假设车站表中有100条记录(偶数)。 这意味着如果表按升序或降序排序,则中位数是第 50 条和第 51 条记录的平均值。
与奇怪的逻辑相同,我正在创建两个表。 一个用于小于 S.LAT_N 的 LAT_N 值的数量,另一个用于大于 S.LAT_N 的 LAT_N 值的数量。 后来我比较这两个表,如果它们的差值等于 1,那么我选择 S.LAT_N 值并找到平均值。 当我检查第 50 条记录时,有 49 个值小于第 50 条记录,有 51 个记录大于第 50 条记录。 如您所见,两个表中存在 1 条记录的差异。 所以这个(第 50 条记录)是我们的第一个平均值记录。 同样,当我检查第 51 条记录时,有 50 个值小于第 51 条记录,有 49 个记录大于第 51 条记录。 如您所见,两个表中存在 1 条记录的差异。 所以这个(第 51 条记录)是我们的第二条平均记录。 对于每个其他记录,在创建的两个表中都有不同数量的记录用于比较。 因此,只有第 50 条和第 51 条记录满足条件。
Simple Solution For ORACLE:
Easy Solution to Understand For MySQL:
Explanation
STATION is table name. LAT_N is the column name having numeric value
Suppose there are 101 records(odd number) in station table. This means that the median is 51st record if the tabled sorted either asc or desc.
In above query for every S.LAT_N of S table I am creating two tables. One for number of LAT_N values less than S.LAT_N and another for number of LAT_N values greater than S.LAT_N. Later I am comparing these two tables and if they are matched then I am selecting that S.LAT_N value. When I check for 51st records there are 50 values less than 51st record and there 50 records greater than 51st record. As you see, there are 50 records in both tables. So this is our answer. For every other record there are different number of records in two tables created for comparison. So, only 51st record meets the condition.
Now suppose there are 100 records(even number) in station table. This means that the median is average of 50th and 51st records if the tabled sorted either asc or desc.
Same as odd logic I am creating two tables. One for number of LAT_N values less than S.LAT_N and another for number of LAT_N values greater than S.LAT_N. Later I am comparing these two tables and if their difference is equal to 1 then I am selecting that S.LAT_N value and find the average. When I check for 50th records there are 49 values less than 50th record and there 51 records greater than 50th record. As you see, there is difference of 1 record in both tables. So this(50th record) is our 1st record for average. Similarly, When I check for 51st records there are 50 values less than 51st record and there 49 records greater than 51st record. As you see, there is difference of 1 record in both tables. So this(51st record) is our 2nd record for average. For every other record there are different number of records in two tables created for comparison. So, only 50th and 51st records meet the condition.
我在 MySQL 中使用下表作为解决方案:
计算“金额”列的中位数:
TotalRecords = 16 和 Median = 120.5000
此查询适用于这两种条件,即偶数和奇数记录。
I am using the below table for the solution in MySQL:
Calculating Median for 'amount' column:
TotalRecords = 16 and Median = 120.5000
This query will work for both the conditions i.e. Even and Odd records.
如果您需要每组的中位数,请在 ROW_NUMBER() OVER (...) 中使用“PARTITION BY”
If you need median per group then use "PARTITION BY" in ROW_NUMBER() OVER (...)
计算中位数的另一种方法是在 MySQL 5.7+、8+ 和 MariaDB 10.2+ 中使用 JSON 函数。
这是我在 MySQL 8.0 中测试的存储函数:
您现在可以将此函数与包含数字项的 JSON 数组一起使用,或使用 JSON_ARRAYAGG 函数创建输入,如下所示:
此方法没有 GROUP_CONCAT 限制.
A different way to calculate the Median is using JSON functions in MySQL 5.7+, 8+ and MariaDB 10.2+.
This is my stored function tested in MySQL 8.0:
You can now use this function with JSON arrays containing numeric items or create the input using the JSON_ARRAYAGG function like this:
This method does not have GROUP_CONCAT limits.
我使用了两种查询方法:
这些是包装在函数 defn 中,因此一次调用即可返回所有值。
如果您的范围是静态的并且数据不经常更改,则预先计算/存储这些值并使用存储的值可能会更有效,而不是每次都从头开始查询。
I used a two query approach:
These are wrapped in a function defn, so all values can be returned from one call.
If your ranges are static and your data does not change often, it might be more efficient to precompute/store these values and use the stored values instead of querying from scratch every time.
由于我只需要一个中位数和百分位数解决方案,因此我根据该线程中的发现创建了一个简单且非常灵活的函数。 我知道,如果我发现“现成”的功能很容易包含在我的项目中,我自己会很高兴,所以我决定快速分享:
用法非常简单,来自我当前项目的示例:
as i just needed a median AND percentile solution, I made a simple and quite flexible function based on the findings in this thread. I know that I am happy myself if I find "readymade" functions that are easy to include in my projects, so I decided to quickly share:
Usage is very easy, example from my current project:
这是我的方法。 当然,您可以将其放入过程中:-)
如果您替换变量
@median_counter
,则可以避免使用它:Here is my way . Of course, you could put it into a procedure :-)
You could avoid the variable
@median_counter
, if you substitude it:根据 @bob 的回答,这概括了查询能够返回按某些条件分组的多个中位数。
例如,考虑一下按年月分组的停车场二手车的中位销售价格。
Based on @bob's answer, this generalizes the query to have the ability to return multiple medians, grouped by some criteria.
Think, e.g., median sale price for used cars in a car lot, grouped by year-month.
阅读完之前的所有内容后,它们与我的实际要求不符,因此我实现了自己的一个,不需要任何过程或复杂的语句,只需
GROUP_CONCAT
我想要获取的列中的所有值MEDIAN 并应用 COUNT DIV BY 2 我从列表中间提取值,如下查询所示:(POS 是我想要获取其中位数的列的名称)
我希望这对以下人员有用就像这个网站上的许多其他评论一样。
After reading all previous ones they didn't match with my actual requirement so I implemented my own one which doesn't need any procedure or complicate statements, just I
GROUP_CONCAT
all values from the column I wanted to obtain the MEDIAN and applying a COUNT DIV BY 2 I extract the value in from the middle of the list like the following query does :(POS is the name of the column I want to get its median)
I hope this could be useful for someone in the way many of other comments were for me from this website.
上面的大多数解决方案仅适用于表的一个字段,您可能需要获取查询中许多字段的中位数(第 50 个百分位)。
我使用这个:
您可以将上面示例中的“50”替换为任何百分位数,非常有效。
只需确保您有足够的内存用于 GROUP_CONCAT,您可以通过以下方式更改它:
更多详细信息:http://web.performancerasta.com/metrics-tips-calculate-95th-99th-or-any-percentile-with-single-mysql-query/
Most of the solutions above work only for one field of the table, you might need to get the median (50th percentile) for many fields on the query.
I use this:
You can replace the "50" in example above to any percentile, is very efficient.
Just make sure you have enough memory for the GROUP_CONCAT, you can change it with:
More details: http://web.performancerasta.com/metrics-tips-calculating-95th-99th-or-any-percentile-with-single-mysql-query/
对 MySQL 文档中的此页面的评论有以下建议:
A comment on this page in the MySQL documentation has the following suggestion:
在 MariaDB / MySQL 中:
Steve Cohen 指出,在第一次传递之后,@rownum 将包含总行数。 这可用于确定中位数,因此不需要第二次传递或连接。
另外,
AVG(dd.val)
和dd.row_number IN(...)
用于在存在偶数条记录时正确生成中位数。 推理:最后,MariaDB 10.3.3+ 包含 MEDIAN 函数
In MariaDB / MySQL:
Steve Cohen points out, that after the first pass, @rownum will contain the total number of rows. This can be used to determine the median, so no second pass or join is needed.
Also
AVG(dd.val)
anddd.row_number IN(...)
is used to correctly produce a median when there are an even number of records. Reasoning:Finally, MariaDB 10.3.3+ contains a MEDIAN function
我刚刚在评论中在线找到了另一个答案 :
确保您的列有良好的索引,并且索引用于过滤和排序。 与解释计划进行验证。
计算“中位数”行数。 也许使用:
median_row = Floor(count / 2)
。然后从列表中选择它:
这应该返回一行,其中仅包含您想要的值。
I just found another answer online in the comments:
Make sure your columns are well indexed and the index is used for filtering and sorting. Verify with the explain plans.
Calculate the "median" row number. Maybe use:
median_row = floor(count / 2)
.Then pick it out of the list:
This should return you one row with just the value you want.
我发现接受的解决方案在我的 MySQL 安装上不起作用,返回一个空集,但这个查询在我测试它的所有情况下都对我有效:
I found the accepted solution didn't work on my MySQL install, returning an empty set, but this query worked for me in all situations that I tested it on:
不幸的是,TheJacobTaylor 和 velcrow 的答案都没有返回当前版本 MySQL 的准确结果。
Velcro 上面的答案很接近,但对于偶数行的结果集,它无法正确计算。 中位数定义为 1) 奇数组中的中间数字,或 2) 偶数组中两个中间数字的平均值。
因此,这里是 velcro 的解决方案,修补后可以处理奇数和偶数集:
要使用它,请遵循以下 3 个简单步骤:
Unfortunately, neither TheJacobTaylor's nor velcrow's answers return accurate results for current versions of MySQL.
Velcro's answer from above is close, but it does not calculate correctly for result sets with an even number of rows. Medians are defined as either 1) the middle number on odd numbered sets, or 2) the average of the two middle numbers on even number sets.
So, here's velcro's solution patched to handle both odd and even number sets:
To use this, follow these 3 easy steps:
我提出一个更快的方法。
获取行数:
SELECT CEIL(COUNT(*)/2) FROM data;
然后在排序子查询中取中间值:
SELECT max(val) FROM (SELECT val FROM data ORDER BY val limit @middlevalue) x;
我用 5x10e6 的随机数数据集对此进行了测试,它将在 10 秒内找到中值。
I propose a faster way.
Get the row count:
SELECT CEIL(COUNT(*)/2) FROM data;
Then take the middle value in a sorted subquery:
SELECT max(val) FROM (SELECT val FROM data ORDER BY val limit @middlevalue) x;
I tested this with a 5x10e6 dataset of random numbers and it will find the median in under 10 seconds.
安装并使用这个mysql统计函数:http://www.xarg。 org/2012/07/statistical-functions-in-mysql/
之后,计算中位数很容易:
Install and use this mysql statistical functions: http://www.xarg.org/2012/07/statistical-functions-in-mysql/
After that, calculate median is easy:
如果 MySQL 有 ROW_NUMBER,则 MEDIAN 为(受此 SQL Server 查询启发):
如果条目数为偶数,则使用 IN。
如果您想找到每组的中位数,那么只需在 OVER 子句中使用 PARTITION BY group 即可。
抢
If MySQL has ROW_NUMBER, then the MEDIAN is (be inspired by this SQL Server query):
The IN is used in case you have an even number of entries.
If you want to find the median per group, then just PARTITION BY group in your OVER clauses.
Rob