Java Apache Commons getPercentile() 结果与 MS Excel 百分位数不同

发布于 2024-11-05 19:29:27 字数 789 浏览 1 评论 0原文

我有一个算法,可以使用 Apache Commons 计算一系列值(12 个值)的百分位数(85),以便稍后通过阈值进行评估以做出决定。结果与 Excel 给出的结果类似,但不相等,有时这对我的应用程序至关重要,因为使用 Excel 时,结果未通过阈值,而使用 Java 中的 Apache Commons Math 时,结果却通过了阈值,所以我得到不同的输出。

这是一个示例:每 2 小时互联网流量 (Mbps)

32,7076813360000000 41,2580429776000000 45,4453940200000000 48,8044409456000000 46,7462847936000000 49,8028100056000000 54,3719451144000000 41,9708134600000000 29,4371963240000000 22,4667255616000000 20,0388452248000000 28,7807757104000000

除以 1000 Mb(电缆的容量)后,我计算职业的百分比(85):

Excel:0,049153870117

Apache Commons Math:0.05003126676104001

我发现可以使用 setPercentileImpl() 更改百分位数的实现(不存在官方的实现),但我找不到任何有关如何执行此操作的示例,也找不到 Excel 算法(这是我被告知要实现的目标)。

任何有关这方面的帮助都将受到欢迎。

谢谢。

I have an algorithm that calculates the percentile(85) with Apache Commons of a series of values (12 values), for a later evaluation with a threshold to make a decision. The result is similar to the one given by Excel, but not equal, and sometimes this is critical for my application because with excel the result doesn't pass the threshold and with Apache Commons Math in Java it does, so I get different outputs.

Here it is an example: Internet traffic (Mbps) every 2 hours

32,7076813360000000 41,2580429776000000 45,4453940200000000 48,8044409456000000 46,7462847936000000 49,8028100056000000 54,3719451144000000 41,9708134600000000 29,4371963240000000 22,4667255616000000 20,0388452248000000 28,7807757104000000

After dividing by 1000 Mb (the capacity of the cable) I calculate the percentil(85) of the Occupation:

Excel: 0,049153870117

Apache Commons Math: 0.05003126676104001

I have found that it is possible to change the implementation of the percentile (it does not exist an official one) with setPercentileImpl(), but I couldn't find any example of how to do this, or the Excel algorithm (which is the one I was told to achieve).

Any help about this will be welcomed.

Thank you.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

吖咩 2024-11-12 19:29:27

差异是微妙的并且是由于假设造成的。用三元案例最容易解释。假设你有三个元素(N=3) a=x[0] a=x[0] a=x[0] a=x[0] b=x[1] < c=x[2]。 Apache 和 Excel 方法都表示元素 b 是第 50 个百分位(中位数)。但是,ac 的情况有所不同。

Apache 方法(和NIST 页面引用的方法)假设 a 是第 25 个百分点,c 是 75% 百分点,因为它将将空间划分为 N+1 个块,即划分为四分之一。

Excel 方法表示 a 是第 0 个百分位数,c 是第 100 个百分位数,因为空间被分为 N-1 个块,即一半。

因此,如果您想要 Excel 方法并且不想自己编写代码,您可以从数组中删除最小和最大元素,然后调用 Apache 方法 - 它应该给您完全相同的结果,除了超出终点的百分位数。

如果您想自己编写代码,下面给出了一种简单的方法。请注意这些问题:

  • 这对数组进行排序(因此会更改它),
  • 由于排序,这需要 O(N log(N)) 时间。 Apache 方法使用快速选择算法,因此需要 O(N) 时间(如果您想了解更多,请谷歌“quickselect”)

代码(未经测试或什至编译,但应该给您一个想法)。

// warning - modifies data 
double excelPercentile(double [] data, double percentile) { array
    Arrays.sort(data);
    double index = percentile*(data.length-1);
    int lower = (int)Math.floor(index);
    if(lower<0) { // should never happen, but be defensive
       return data[0];
    }
    if(lower>=data.length-1) { // only in 100 percentile case, but be defensive
       return data[data.length-1);
    }
    double fraction = index-lower;
    // linear interpolation
    double result=data[lower] + fraction*(data[lower+1]-data[lower]);
    return result;
 }

The difference is subtle and due to assumptions. It is easiest to explain with the 3 element case. Suppose you have three elements(N=3) a=x[0] < b=x[1] < c=x[2]. Both the Apache and the Excel method say that element b is the 50th percentile (the median). However they differ for a and c.

The Apache method (and the method referenced by the NIST page) say a is the 25th percentile and c is the 75% percentile, because it divides the space up into N+1 blocks, that is, into quarters.

The Excel method says that a is the 0th percentile and c the 100th percentile, as the space is divided into N-1 blocks, that is, in half.

Because of this, if you want the Excel method and you don't want to code it yourself, you could just remove the smallest and the largest element from your array, and call the Apache method - it should give you exactly the same result except at percentiles beyond the end points.

If you want to code it yourself, an easy way is given below. Be aware of these issues:

  • this sorts the array (so changes it)
  • this takes O(N log(N)) time due to the sorting. The Apache method uses a fast selection algorithm so takes O(N) time (google "quickselect" if you want to know more)

Code (not tested or even compiled, but should give you an idea).

// warning - modifies data 
double excelPercentile(double [] data, double percentile) { array
    Arrays.sort(data);
    double index = percentile*(data.length-1);
    int lower = (int)Math.floor(index);
    if(lower<0) { // should never happen, but be defensive
       return data[0];
    }
    if(lower>=data.length-1) { // only in 100 percentile case, but be defensive
       return data[data.length-1);
    }
    double fraction = index-lower;
    // linear interpolation
    double result=data[lower] + fraction*(data[lower+1]-data[lower]);
    return result;
 }
宫墨修音 2024-11-12 19:29:27

解决方案是创建一个类 PercentileExcel ,它几乎是 commons 方法中百分位数的副本,除了如何计算位置的一个小变化:

pos=(1+p*(n-1))/100;

然后您需要将此行添加到代码中才能使用百分位数的新类:

setPercentileImpl(PercentileExcel);

The solution was creating a class PercentileExcel which is almost a copy of percentile from commons method except for a small change on how to caculate the position:

pos=(1+p*(n-1))/100;

Then you need to add this line to the code in order to use the new class for percentile:

setPercentileImpl(PercentileExcel);
差↓一点笑了 2024-11-12 19:29:27

org.apache.commons.math3.stat.descriptive.rank.Percentile 类已经支持 Excel 样式插值,您只需使用 EstimationType.R_7 启用它

public class PercentileExcel extends Percentile {
    public PercentileExcel() throws MathIllegalArgumentException {

    super(50.0,
          EstimationType.R_7, // use excel style interpolation
          NaNStrategy.REMOVED,
          new KthSelector(new MedianOf3PivotingStrategy()));
    }
}

Class org.apache.commons.math3.stat.descriptive.rank.Percentile already supports Excel style interpolation, you just need to enable it with EstimationType.R_7

public class PercentileExcel extends Percentile {
    public PercentileExcel() throws MathIllegalArgumentException {

    super(50.0,
          EstimationType.R_7, // use excel style interpolation
          NaNStrategy.REMOVED,
          new KthSelector(new MedianOf3PivotingStrategy()));
    }
}
始终不够 2024-11-12 19:29:27

从数据集计算得出的百分位数没有唯一的定义。有关最常用的定义,请参阅维基百科页面

There is no unique definition of a percentile computed from a data set. See the Wikipedia page for the most common definitions in use.

杀お生予夺 2024-11-12 19:29:27

以下无需新类的替代方案适用于 3.6:

DescriptiveStatistics ds = new DescriptiveStatistics();
Percentile p = new Percentile(50.0).withEstimationType(EstimationType.R_7)
                .withNaNStrategy(NaNStrategy.REMOVED)
                .withKthSelector(new KthSelector(new 
                  MedianOf3PivotingStrategy()))
ds.setPercentileImpl(p);

The following alternative with no new class needed works in 3.6:

DescriptiveStatistics ds = new DescriptiveStatistics();
Percentile p = new Percentile(50.0).withEstimationType(EstimationType.R_7)
                .withNaNStrategy(NaNStrategy.REMOVED)
                .withKthSelector(new KthSelector(new 
                  MedianOf3PivotingStrategy()))
ds.setPercentileImpl(p);
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文