如何使用 python/numpy 计算百分位数?
有没有一种方便的方法来计算序列或一维 numpy 数组的百分位数?
我正在寻找类似于Excel百分位函数的东西。
Is there a convenient way to calculate percentiles for a sequence or single-dimensional numpy
array?
I am looking for something similar to Excel's percentile function.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(12)
NumPy 有
np.percentile()
。SciPy 有
scipy. stats.scoreatpercentile()
,除了许多其他统计数据。NumPy has
np.percentile()
.SciPy has
scipy.stats.scoreatpercentile()
, in addition to many other statistical goodies.顺便说一句,有一个 纯 Python 实现百分位函数,以防万一不想依赖 scipy。该函数复制如下:
By the way, there is a pure-Python implementation of percentile function, in case one doesn't want to depend on scipy. The function is copied below:
从
Python 3.8
开始,标准库附带分位数
作为的一部分发挥作用统计
模块:quantiles
返回给定分布dist
的n - 1
个分割点列表,分隔n
> 分位数间隔(将dist
等概率划分为n
个连续间隔):其中
n
,在我们的例子中(百分位数
)是100
。Starting
Python 3.8
, the standard library comes with thequantiles
function as part of thestatistics
module:quantiles
returns for a given distributiondist
a list ofn - 1
cut points separating then
quantile intervals (division ofdist
inton
continuous intervals with equal probability):where
n
, in our case (percentiles
) is100
.下面介绍如何不使用 numpy,仅使用 python 来计算百分位数。
Here's how to do it without numpy, using only python to calculate the percentile.
我通常看到的百分位数的定义期望结果是所提供列表中的值,在该列表下面找到 P% 的值...这意味着结果必须来自集合,而不是集合元素之间的插值。为此,您可以使用更简单的函数。
如果您希望从提供的列表中获取等于或低于 P% 的值,则使用以下简单的修改:
或者使用 @ijustlovemath 建议的简化:
The definition of percentile I usually see expects as a result the value from the supplied list below which P percent of values are found... which means the result must be from the set, not an interpolation between set elements. To get that, you can use a simpler function.
If you would rather get the value from the supplied list at or below which P percent of values are found, then use this simple modification:
Or with the simplification suggested by @ijustlovemath:
检查 scipy.stats 模块:
check for scipy.stats module:
计算一维 numpy 序列或矩阵的百分位数的一种便捷方法是使用 numpy.percentile <https://docs.scipy.org/doc/numpy/reference/ generated/numpy.percentile.html>。示例:
但是,如果数据中存在任何 NaN 值,则上述函数将不起作用。在这种情况下推荐使用的函数是 numpy.nanpercentile https://docs.scipy.org/doc/numpy/reference/ generated/numpy.nanpercentile.html>功能:
在上面提供的两个选项中,您仍然可以选择插值模式。请遵循以下示例以更容易理解。
如果您的输入数组仅包含整数值,您可能会对整数的百分比答案感兴趣。如果是这样,请选择插值模式,例如“较低”、“较高”或“最近”。
A convenient way to calculate percentiles for a one-dimensional numpy sequence or matrix is by using numpy.percentile <https://docs.scipy.org/doc/numpy/reference/generated/numpy.percentile.html>. Example:
However, if there is any NaN value in your data, the above function will not be useful. The recommended function to use in that case is the numpy.nanpercentile <https://docs.scipy.org/doc/numpy/reference/generated/numpy.nanpercentile.html> function:
In the two options presented above, you can still choose the interpolation mode. Follow the examples below for easier understanding.
If your input array only consists of integer values, you might be interested in the percentil answer as an integer. If so, choose interpolation mode such as ‘lower’, ‘higher’, or ‘nearest’.
要计算系列的百分位数,请运行:
例如:
To calculate the percentile of a series, run:
For example:
如果您需要答案成为输入 numpy 数组的成员:
只需添加 numpy 中的百分位数函数默认将输出计算为输入中两个相邻条目的线性加权平均值向量。在某些情况下,人们可能希望返回的百分位数是向量的实际元素,在这种情况下,从 v1.9.0 开始,您可以使用“插值”选项,包括“较低”、“较高”或“最近”。
后者是向量中的实际条目,而前者是与百分位接壤的两个向量条目的线性插值
In case you need the answer to be a member of the input numpy array:
Just to add that the percentile function in numpy by default calculates the output as a linear weighted average of the two neighboring entries in the input vector. In some cases people may want the returned percentile to be an actual element of the vector, in this case, from v1.9.0 onwards you can use the "interpolation" option, with either "lower", "higher" or "nearest".
The latter is an actual entry in the vector, while the former is a linear interpolation of two vector entries that border the percentile
对于一个系列:使用描述函数
假设您有 df 以及以下列 sales 和 id。你想计算销售额的百分位,那么它的工作原理如下,
for a series: used describe functions
suppose you have df with following columns sales and id. you want to calculate percentiles for sales then it works like this,
我引导数据,然后绘制 10 个样本的置信区间。置信区间显示概率介于 5% 到 95% 之间的范围。
I bootstrap the data and then plotted out the confidence interval for 10 samples. The confidence interval shows the range where the probabilities will fall between 5 percent and 95 percent probability.