Java Apache Commons getPercentile() 结果与 MS Excel 百分位数不同
我有一个算法,可以使用 Apache Commons 计算一系列值(12 个值)的百分位数(85),以便稍后通过阈值进行评估以做出决定。结果与 Excel 给出的结果类似,但不相等,有时这对我的应用程序至关重要,因为使用 Excel 时,结果未通过阈值,而使用 Java
中的 Apache Commons Math 时,结果却通过了阈值,所以我得到不同的输出。
这是一个示例:每 2 小时互联网流量 (Mbps)
32,7076813360000000 41,2580429776000000 45,4453940200000000 48,8044409456000000 46,7462847936000000 49,8028100056000000 54,3719451144000000 41,9708134600000000 29,4371963240000000 22,4667255616000000 20,0388452248000000 28,7807757104000000
除以 1000 Mb(电缆的容量)后,我计算职业的百分比(85):
Excel:0,049153870117
Apache Commons Math:0.05003126676104001
我发现可以使用 setPercentileImpl()
更改百分位数的实现(不存在官方的实现),但我找不到任何有关如何执行此操作的示例,也找不到 Excel 算法(这是我被告知要实现的目标)。
任何有关这方面的帮助都将受到欢迎。
谢谢。
I have an algorithm that calculates the percentile(85)
with Apache Commons
of a series of values (12 values), for a later evaluation with a threshold to make a decision. The result is similar to the one given by Excel, but not equal, and sometimes this is critical for my application because with excel the result doesn't pass the threshold and with Apache Commons Math in Java
it does, so I get different outputs.
Here it is an example: Internet traffic (Mbps) every 2 hours
32,7076813360000000 41,2580429776000000 45,4453940200000000 48,8044409456000000 46,7462847936000000 49,8028100056000000 54,3719451144000000 41,9708134600000000 29,4371963240000000 22,4667255616000000 20,0388452248000000 28,7807757104000000
After dividing by 1000 Mb (the capacity of the cable) I calculate the percentil(85) of the Occupation:
Excel: 0,049153870117
Apache Commons Math: 0.05003126676104001
I have found that it is possible to change the implementation of the percentile (it does not exist an official one) with setPercentileImpl()
, but I couldn't find any example of how to do this, or the Excel algorithm (which is the one I was told to achieve).
Any help about this will be welcomed.
Thank you.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
差异是微妙的并且是由于假设造成的。用三元案例最容易解释。假设你有三个元素(N=3)
a=x[0]
a=x[0]
a=x[0]
a=x[0]
b=x[1] < c=x[2]
。 Apache 和 Excel 方法都表示元素 b 是第 50 个百分位(中位数)。但是,a
和c
的情况有所不同。Apache 方法(和NIST 页面引用的方法)假设
a
是第 25 个百分点,c
是 75% 百分点,因为它将将空间划分为 N+1 个块,即划分为四分之一。Excel 方法表示
a
是第 0 个百分位数,c
是第 100 个百分位数,因为空间被分为 N-1 个块,即一半。因此,如果您想要 Excel 方法并且不想自己编写代码,您可以从数组中删除最小和最大元素,然后调用 Apache 方法 - 它应该给您完全相同的结果,除了超出终点的百分位数。
如果您想自己编写代码,下面给出了一种简单的方法。请注意这些问题:
代码(未经测试或什至编译,但应该给您一个想法)。
The difference is subtle and due to assumptions. It is easiest to explain with the 3 element case. Suppose you have three elements(N=3)
a=x[0] < b=x[1] < c=x[2]
. Both the Apache and the Excel method say that element b is the 50th percentile (the median). However they differ fora
andc
.The Apache method (and the method referenced by the NIST page) say
a
is the 25th percentile andc
is the 75% percentile, because it divides the space up into N+1 blocks, that is, into quarters.The Excel method says that
a
is the 0th percentile andc
the 100th percentile, as the space is divided into N-1 blocks, that is, in half.Because of this, if you want the Excel method and you don't want to code it yourself, you could just remove the smallest and the largest element from your array, and call the Apache method - it should give you exactly the same result except at percentiles beyond the end points.
If you want to code it yourself, an easy way is given below. Be aware of these issues:
Code (not tested or even compiled, but should give you an idea).
解决方案是创建一个类 PercentileExcel ,它几乎是 commons 方法中百分位数的副本,除了如何计算位置的一个小变化:
然后您需要将此行添加到代码中才能使用百分位数的新类:
The solution was creating a class PercentileExcel which is almost a copy of percentile from commons method except for a small change on how to caculate the position:
Then you need to add this line to the code in order to use the new class for percentile:
org.apache.commons.math3.stat.descriptive.rank.Percentile
类已经支持 Excel 样式插值,您只需使用EstimationType.R_7
启用它Class
org.apache.commons.math3.stat.descriptive.rank.Percentile
already supports Excel style interpolation, you just need to enable it withEstimationType.R_7
从数据集计算得出的百分位数没有唯一的定义。有关最常用的定义,请参阅维基百科页面。
There is no unique definition of a percentile computed from a data set. See the Wikipedia page for the most common definitions in use.
以下无需新类的替代方案适用于 3.6:
The following alternative with no new class needed works in 3.6: