在 MySQL 中计算百分位数排名

发布于 2024-07-26 05:30:18 字数 172 浏览 11 评论 0原文

我在 MySQL 中有一个非常大的测量数据表,我需要计算每个值的百分位数排名。 Oracle 似乎有一个名为percent_rank 的函数,但我找不到MySQL 的类似函数。 当然,我可以在 Python 中对其进行暴力破解,无论如何我都会使用它来填充表,但我怀疑这会非常低效,因为一个样本可能有 200.000 个观察值。

I have a very big table of measurement data in MySQL and I need to compute the percentile rank for each and every one of these values. Oracle appears to have a function called percent_rank but I can't find anything similar for MySQL. Sure I could just brute-force it in Python which I use anyways to populate the table but I suspect that would be quite inefficient because one sample might have 200.000 observations.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

做个少女永远怀春 2024-08-02 05:30:19

这是一个相对丑陋的答案,说出来我感到内疚。 也就是说,它可能会帮助您解决问题。

确定百分比的一种方法是计算所有行的数量,并计算大于您提供的数量的行数。 您可以计算大于或小于,并根据需要取倒数。

对您的号码创建索引。
总计 = 选择计数();
less_equal = 选择计数(
),其中值 > 索引号;

百分比类似于: less_equal / Total 或 (total - less_equal)/total

确保它们都使用您创建的索引。 如果不是,请调整它们直到它们为止。 解释查询应该在右侧列中包含“使用索引”。 在 select count(*) 的情况下,InnoDB 应该使用索引,MyISAM 应该使用 const 之类的东西。 MyISAM 随时都会知道这个值,而无需计算它。

如果您需要将百分比存储在数据库中,则可以使用上面的设置来提高性能,然后使用第二个查询作为内部选择来计算每行的值。 第一个查询的值可以设置为常量。

这有帮助吗?

雅各布

This is a relatively ugly answer, and I feel guilty saying it. That said, it might help you with your issue.

One way to determine the percentage would be to count all of the rows, and count the number of rows that are greater than the number you provided. You can calculate either greater or less than and take the inverse as necessary.

Create an index on your number.
total = select count();
less_equal = select count(
) where value > indexed_number;

The percentage would be something like: less_equal / total or (total - less_equal)/total

Make sure that both of them are using the index that you created. If they are not, tweak them until they are. The explain query should have "using index" in the right hand column. In the case of the select count(*) it should be using index for InnoDB and something like const for MyISAM. MyISAM will know this value at any time without having to calculate it.

If you needed to have the percentage stored in the database, you can use the setup from above for performance and then calculate the value for each row by using the second query as an inner select. The first query's value can be set as a constant.

Does this help?

Jacob

黯淡〆 2024-08-02 05:30:19

MySQL 8终于引入了窗口函数,其中,PERCENT_RANK() 您正在寻找的函数。 所以,只需写:

SELECT col, percent_rank() OVER (ORDER BY col)
FROM t
ORDER BY col

你的问题提到了“百分位数”,这是一个略有不同的东西。 为了完整起见,SQL 标准和某些 RBDMS(Oracle、PostgreSQL、SQL Server、Teradata)中有 PERCENTILE_DISCPERCENTILE_CONT 逆分布函数,但在 MySQL 中没有。 使用 MySQL 8 和窗口函数, 您可以再次使用PERCENT_RANKFIRST_VALUE 窗口函数来模拟PERCENTILE_DISC

MySQL 8 finally introduced window functions, and among them, the PERCENT_RANK() function you were looking for. So, just write:

SELECT col, percent_rank() OVER (ORDER BY col)
FROM t
ORDER BY col

Your question mentions "percentiles", which are a slightly different thing. For completeness' sake, there are PERCENTILE_DISC and PERCENTILE_CONT inverse distribution functions in the SQL standard and in some RBDMS (Oracle, PostgreSQL, SQL Server, Teradata), but not in MySQL. With MySQL 8 and window functions, you can emulate PERCENTILE_DISC, however, again using the PERCENT_RANK and FIRST_VALUE window functions.

冰之心 2024-08-02 05:30:19

如果您将 SQL 与 PHP 等过程语言相结合,则可以执行以下操作。 此示例将机场的超额航班阻塞时间分解为百分位数。 在 MySQL 中将 LIMIT x,y 子句与 ORDER BY 结合使用。 不是很漂亮,但可以完成工作(抱歉格式问题):

$startDt = "2011-01-01";
$endDt = "2011-02-28";
$arrPort= 'JFK';

$strSQL = "SELECT COUNT(*) as TotFlights FROM FIDS where depdt >= '$startDt' And depdt <= '$endDt' and ArrPort='$arrPort'";
if (!($queryResult = mysql_query($strSQL, $con)) ) {
    echo $strSQL . " FAILED\n"; echo mysql_error();
    exit(0);
}
$totFlights=0;
while($fltRow=mysql_fetch_array($queryResult)) {
    echo "Total Flights into " . $arrPort . " = " . $fltRow['TotFlights'];
    $totFlights = $fltRow['TotFlights'];

    /* 1906 flights. Percentile 90 = int(0.9 * 1906). */
    for ($x = 1; $x<=10; $x++) {
        $pctlPosn = $totFlights - intval( ($x/10) * $totFlights);
        echo "PCTL POSN for " . $x * 10 . " IS " . $pctlPosn . "\t";
        $pctlSQL = "SELECT  (ablk-sblk) as ExcessBlk from FIDS where ArrPort='" . $arrPort . "' order by ExcessBlk DESC limit " . $pctlPosn . ",1;";
        if (!($query2Result = mysql_query($pctlSQL, $con)) ) {
            echo $pctlSQL  . " FAILED\n";
            echo mysql_error();
            exit(0);
        }
        while ($pctlRow = mysql_fetch_array($query2Result)) {
            echo "Excess Block is :" . $pctlRow['ExcessBlk'] . "\n";
        }
    }
}

If you're combining your SQL with a procedural language like PHP, you can do the following. This example breaks down excess flight block times into an airport, into their percentiles. Uses the LIMIT x,y clause in MySQL in combination with ORDER BY. Not very pretty, but does the job (sorry struggled with the formatting):

$startDt = "2011-01-01";
$endDt = "2011-02-28";
$arrPort= 'JFK';

$strSQL = "SELECT COUNT(*) as TotFlights FROM FIDS where depdt >= '$startDt' And depdt <= '$endDt' and ArrPort='$arrPort'";
if (!($queryResult = mysql_query($strSQL, $con)) ) {
    echo $strSQL . " FAILED\n"; echo mysql_error();
    exit(0);
}
$totFlights=0;
while($fltRow=mysql_fetch_array($queryResult)) {
    echo "Total Flights into " . $arrPort . " = " . $fltRow['TotFlights'];
    $totFlights = $fltRow['TotFlights'];

    /* 1906 flights. Percentile 90 = int(0.9 * 1906). */
    for ($x = 1; $x<=10; $x++) {
        $pctlPosn = $totFlights - intval( ($x/10) * $totFlights);
        echo "PCTL POSN for " . $x * 10 . " IS " . $pctlPosn . "\t";
        $pctlSQL = "SELECT  (ablk-sblk) as ExcessBlk from FIDS where ArrPort='" . $arrPort . "' order by ExcessBlk DESC limit " . $pctlPosn . ",1;";
        if (!($query2Result = mysql_query($pctlSQL, $con)) ) {
            echo $pctlSQL  . " FAILED\n";
            echo mysql_error();
            exit(0);
        }
        while ($pctlRow = mysql_fetch_array($query2Result)) {
            echo "Excess Block is :" . $pctlRow['ExcessBlk'] . "\n";
        }
    }
}
惟欲睡 2024-08-02 05:30:19

为了获得排名,我想说您需要将表本身(左)外部连接,例如:

select t1.name, t1.value, count(distinct isnull(t2.value,0))  
from table t1  
left join table t2  
on t1.value>t2.value  
group by t1.name, t1.value 

对于每一行,您将计算同一个表中有多少(如果有)行具有较低的值。

请注意,我更熟悉 sqlserver,因此语法可能不正确。 此外,独特的行为可能不适合您想要实现的目标。 但这是总体想法。
然后,要获得真正的百分位数排名,您需要首先获取变量中的值数量(或不同的值,具体取决于您想要采用的约定),并使用上面给出的真实排名计算百分位数排名。

To get the rank, I'd say you need to (left) outer join the table on itself something like :

select t1.name, t1.value, count(distinct isnull(t2.value,0))  
from table t1  
left join table t2  
on t1.value>t2.value  
group by t1.name, t1.value 

For each row, you will count how many (if any) rows of the same table have an inferior value.

Note that I'm more familiar with sqlserver so the syntax might not be right. Also the distinct may not have the right behaviour for what you want to achieve. But that's the general idea.
Then to get the real percentile rank you will need to first get the number of values in a variable (or distinct values depending on the convention you want to take) and compute the percentile rank using the real rank given above.

清醇 2024-08-02 05:30:19

假设我们有一个像这样的销售表:

user_id,units

那么下面的查询将给出每个用户的百分位数:

select a.user_id,a.units,
(sum(case when a.units >= b.units then 1 else 0 end )*100)/count(1) percentile
from sales a join sales b ;

请注意,这将用于交叉连接,因此会导致 O(n2) 复杂性,因此可以被视为未优化的解决方案,但考虑到我们这样做似乎很简单mysql版本没有任何功能。

Suppose we have a sales table like :

user_id,units

then following query will give percentile of each user :

select a.user_id,a.units,
(sum(case when a.units >= b.units then 1 else 0 end )*100)/count(1) percentile
from sales a join sales b ;

Note that this will go for cross join so result in O(n2) complexity so can be considered as unoptimized solution but seems simple given we do not have any function in mysql version.

时间你老了 2024-08-02 05:30:18

这是一种不需要联接的不同方法。 在我的例子中(一个包含 15,000 多行的表),它的运行时间约为 3 秒。 (JOIN 方法花费的时间要长一个数量级)。

在示例中,假设 measure 是您要计算百分比排名的列,id 只是行标识符(不是必需的):

SELECT
    id,
    @prev := @curr as prev,
    @curr := measure as curr,
    @rank := IF(@prev > @curr, @rank+@ties, @rank) AS rank,
    @ties := IF(@prev = @curr, @ties+1, 1) AS ties,
    (1-@rank/@total) as percentrank
FROM
    mytable,
    (SELECT
        @curr := null,
        @prev := null,
        @rank := 0,
        @ties := 1,
        @total := count(*) from mytable where measure is not null
    ) b
WHERE
    measure is not null
ORDER BY
    measure DESC

此方法的功劳去什洛米·诺奇。 他在这里详细介绍了这一点:

http://code.openark.org/blog/ mysql/sql-ranking-without-self-join

我已经在 MySQL 中对此进行了测试,效果很好; 不知道Oracle、SQLServer等。

Here's a different approach that doesn't require a join. In my case (a table with 15,000+) rows, it runs in about 3 seconds. (The JOIN method takes an order of magnitude longer).

In the sample, assume that measure is the column on which you're calculating the percent rank, and id is just a row identifier (not required):

SELECT
    id,
    @prev := @curr as prev,
    @curr := measure as curr,
    @rank := IF(@prev > @curr, @rank+@ties, @rank) AS rank,
    @ties := IF(@prev = @curr, @ties+1, 1) AS ties,
    (1-@rank/@total) as percentrank
FROM
    mytable,
    (SELECT
        @curr := null,
        @prev := null,
        @rank := 0,
        @ties := 1,
        @total := count(*) from mytable where measure is not null
    ) b
WHERE
    measure is not null
ORDER BY
    measure DESC

Credit for this method goes to Shlomi Noach. He writes about it in detail here:

http://code.openark.org/blog/mysql/sql-ranking-without-self-join

I've tested this in MySQL and it works great; no idea about Oracle, SQLServer, etc.

慈悲佛祖 2024-08-02 05:30:18
SELECT 
    c.id, c.score, ROUND(((@rank - rank) / @rank) * 100, 2) AS percentile_rank
FROM
    (SELECT 
    *,
        @prev:=@curr,
        @curr:=a.score,
        @rank:=IF(@prev = @curr, @rank, @rank + 1) AS rank
    FROM
        (SELECT id, score FROM mytable) AS a,
        (SELECT @curr:= null, @prev:= null, @rank:= 0) AS b
ORDER BY score DESC) AS c;
SELECT 
    c.id, c.score, ROUND(((@rank - rank) / @rank) * 100, 2) AS percentile_rank
FROM
    (SELECT 
    *,
        @prev:=@curr,
        @curr:=a.score,
        @rank:=IF(@prev = @curr, @rank, @rank + 1) AS rank
    FROM
        (SELECT id, score FROM mytable) AS a,
        (SELECT @curr:= null, @prev:= null, @rank:= 0) AS b
ORDER BY score DESC) AS c;
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文