Apache Commons Math 2.2 百分位数错误?
我不能 100% 确定这是一个错误还是我没有做正确的事情,但是如果您向 Percentile 提供大量与相同值一致的数据(请参见下面的代码),则评估方法需要很长时间。如果您给出百分位数,则评估随机值所需的时间会大大缩短。
如下所述,中位数是百分位数的子类。
private void testOne(){
int size = 200000;
int sameValue = 100;
List<Double> list = new ArrayList<Double>();
for (int i = 0; i < size; i++)
{
list.add((double)sameValue);
}
Median m = new Median();
m.setData(ArrayUtils.toPrimitive(list.toArray(new Double[0])));
long start = System.currentTimeMillis();
System.out.println("Start:"+ start);
double result = m.evaluate();
System.out.println("Result:" + result);
System.out.println("Time:"+ (System.currentTimeMillis()- start));
}
private void testTwo(){
int size = 200000;
List<Double> list = new ArrayList<Double>();
Random r = new Random();
for (int i = 0; i < size; i++)
{
list.add(r.nextDouble() * 100.0);
}
Median m = new Median();
m.setData(ArrayUtils.toPrimitive(list.toArray(new Double[0])));
long start = System.currentTimeMillis();
System.out.println("Start:"+ start);
double result = m.evaluate();
System.out.println("Result:" + result);
System.out.println("Time:"+ (System.currentTimeMillis()- start));
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这是版本 2.0 和 2.1 之间的已知问题,已修复版本 3.1。
2.0 版本确实涉及对数据进行排序,但在 2.1 中,他们似乎已切换到选择算法 。但是,其实现中的一个错误导致具有大量相同值的数据出现一些不良行为。基本上他们使用 >= 和 <= 而不是 >和<。
This is a known issue between versions 2.0 and 2.1 and has been fixed for version 3.1.
Version 2.0 did indeed involve sorting the data, but in 2.1 they seemed to have switched to a selection algorithm. However, a bug in their implementation of that led to some bad behavior for data with lots of identical values. Basically they used >= and <= instead of > and <.
众所周知,某些算法对于某些数据集可能表现出较慢的性能。实际上可以通过在执行操作之前随机化数据集来提高性能。
由于百分位数可能涉及对数据进行排序,因此我猜测您的“错误”实际上并不是代码中的缺陷,而是性能较慢的数据集之一的表现。
It's well known that some algorithms can exhibit slower performance for certain data sets. Performance can actually be improved by randomizing the data set before performing the operation.
Since percentile probably involves sorting the data, I'm guessing that your "bug" is not really a defect in the code, but rather the manifestation of one of the slower performing data sets.