在 MySQL 中计算百分位数排名
我在 MySQL 中有一个非常大的测量数据表,我需要计算每个值的百分位数排名。 Oracle 似乎有一个名为percent_rank 的函数,但我找不到MySQL 的类似函数。 当然,我可以在 Python 中对其进行暴力破解,无论如何我都会使用它来填充表,但我怀疑这会非常低效,因为一个样本可能有 200.000 个观察值。
I have a very big table of measurement data in MySQL and I need to compute the percentile rank for each and every one of these values. Oracle appears to have a function called percent_rank but I can't find anything similar for MySQL. Sure I could just brute-force it in Python which I use anyways to populate the table but I suspect that would be quite inefficient because one sample might have 200.000 observations.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
没有简单的方法可以做到这一点。 请参阅http://rpbouman.blogspot.com/2008 /07/calculate-nth-percentile-in-mysql.html
there is no easy way to do this. see http://rpbouman.blogspot.com/2008/07/calculating-nth-percentile-in-mysql.html
这是一个相对丑陋的答案,说出来我感到内疚。 也就是说,它可能会帮助您解决问题。
确定百分比的一种方法是计算所有行的数量,并计算大于您提供的数量的行数。 您可以计算大于或小于,并根据需要取倒数。
对您的号码创建索引。
总计 = 选择计数();
less_equal = 选择计数(),其中值 > 索引号;
百分比类似于: less_equal / Total 或 (total - less_equal)/total
确保它们都使用您创建的索引。 如果不是,请调整它们直到它们为止。 解释查询应该在右侧列中包含“使用索引”。 在 select count(*) 的情况下,InnoDB 应该使用索引,MyISAM 应该使用 const 之类的东西。 MyISAM 随时都会知道这个值,而无需计算它。
如果您需要将百分比存储在数据库中,则可以使用上面的设置来提高性能,然后使用第二个查询作为内部选择来计算每行的值。 第一个查询的值可以设置为常量。
这有帮助吗?
雅各布
This is a relatively ugly answer, and I feel guilty saying it. That said, it might help you with your issue.
One way to determine the percentage would be to count all of the rows, and count the number of rows that are greater than the number you provided. You can calculate either greater or less than and take the inverse as necessary.
Create an index on your number.
total = select count();
less_equal = select count() where value > indexed_number;
The percentage would be something like: less_equal / total or (total - less_equal)/total
Make sure that both of them are using the index that you created. If they are not, tweak them until they are. The explain query should have "using index" in the right hand column. In the case of the select count(*) it should be using index for InnoDB and something like const for MyISAM. MyISAM will know this value at any time without having to calculate it.
If you needed to have the percentage stored in the database, you can use the setup from above for performance and then calculate the value for each row by using the second query as an inner select. The first query's value can be set as a constant.
Does this help?
Jacob
MySQL 8终于引入了窗口函数,其中,
PERCENT_RANK()
您正在寻找的函数。 所以,只需写:你的问题提到了“百分位数”,这是一个略有不同的东西。 为了完整起见,SQL 标准和某些 RBDMS(Oracle、PostgreSQL、SQL Server、Teradata)中有
PERCENTILE_DISC
和PERCENTILE_CONT
逆分布函数,但在 MySQL 中没有。 使用 MySQL 8 和窗口函数, 您可以再次使用PERCENT_RANK
和FIRST_VALUE
窗口函数来模拟PERCENTILE_DISC
。MySQL 8 finally introduced window functions, and among them, the
PERCENT_RANK()
function you were looking for. So, just write:Your question mentions "percentiles", which are a slightly different thing. For completeness' sake, there are
PERCENTILE_DISC
andPERCENTILE_CONT
inverse distribution functions in the SQL standard and in some RBDMS (Oracle, PostgreSQL, SQL Server, Teradata), but not in MySQL. With MySQL 8 and window functions, you can emulatePERCENTILE_DISC
, however, again using thePERCENT_RANK
andFIRST_VALUE
window functions.如果您将 SQL 与 PHP 等过程语言相结合,则可以执行以下操作。 此示例将机场的超额航班阻塞时间分解为百分位数。 在 MySQL 中将 LIMIT x,y 子句与 ORDER BY 结合使用。 不是很漂亮,但可以完成工作(抱歉格式问题):
If you're combining your SQL with a procedural language like PHP, you can do the following. This example breaks down excess flight block times into an airport, into their percentiles. Uses the LIMIT x,y clause in MySQL in combination with
ORDER BY
. Not very pretty, but does the job (sorry struggled with the formatting):为了获得排名,我想说您需要将表本身(左)外部连接,例如:
对于每一行,您将计算同一个表中有多少(如果有)行具有较低的值。
请注意,我更熟悉 sqlserver,因此语法可能不正确。 此外,独特的行为可能不适合您想要实现的目标。 但这是总体想法。
然后,要获得真正的百分位数排名,您需要首先获取变量中的值数量(或不同的值,具体取决于您想要采用的约定),并使用上面给出的真实排名计算百分位数排名。
To get the rank, I'd say you need to (left) outer join the table on itself something like :
For each row, you will count how many (if any) rows of the same table have an inferior value.
Note that I'm more familiar with sqlserver so the syntax might not be right. Also the distinct may not have the right behaviour for what you want to achieve. But that's the general idea.
Then to get the real percentile rank you will need to first get the number of values in a variable (or distinct values depending on the convention you want to take) and compute the percentile rank using the real rank given above.
假设我们有一个像这样的销售表:
user_id,units
那么下面的查询将给出每个用户的百分位数:
请注意,这将用于交叉连接,因此会导致 O(n2) 复杂性,因此可以被视为未优化的解决方案,但考虑到我们这样做似乎很简单mysql版本没有任何功能。
Suppose we have a sales table like :
user_id,units
then following query will give percentile of each user :
Note that this will go for cross join so result in O(n2) complexity so can be considered as unoptimized solution but seems simple given we do not have any function in mysql version.
这是一种不需要联接的不同方法。 在我的例子中(一个包含 15,000 多行的表),它的运行时间约为 3 秒。 (JOIN 方法花费的时间要长一个数量级)。
在示例中,假设 measure 是您要计算百分比排名的列,id 只是行标识符(不是必需的):
此方法的功劳去什洛米·诺奇。 他在这里详细介绍了这一点:
http://code.openark.org/blog/ mysql/sql-ranking-without-self-join
我已经在 MySQL 中对此进行了测试,效果很好; 不知道Oracle、SQLServer等。
Here's a different approach that doesn't require a join. In my case (a table with 15,000+) rows, it runs in about 3 seconds. (The JOIN method takes an order of magnitude longer).
In the sample, assume that measure is the column on which you're calculating the percent rank, and id is just a row identifier (not required):
Credit for this method goes to Shlomi Noach. He writes about it in detail here:
http://code.openark.org/blog/mysql/sql-ranking-without-self-join
I've tested this in MySQL and it works great; no idea about Oracle, SQLServer, etc.