重复计算百分位数的快速算法?
在算法中,每当我添加值时,我都必须计算数据集的75%。现在我正在这样做:
- 获取值
x
- 在已经排序的数组中插入
x
在后面 - 交换
x
直到数组排序 - 读取位置
array[array.size * 3/4]
点 3 的元素
是 O(n),其余的是 O(1),但这仍然很慢,特别是如果数组得到更大。有什么办法可以优化这个吗?
更新
谢谢尼基塔!由于我使用 C++,这是最容易实现的解决方案。这是代码:
template<class T>
class IterativePercentile {
public:
/// Percentile has to be in range [0, 1(
IterativePercentile(double percentile)
: _percentile(percentile)
{ }
// Adds a number in O(log(n))
void add(const T& x) {
if (_lower.empty() || x <= _lower.front()) {
_lower.push_back(x);
std::push_heap(_lower.begin(), _lower.end(), std::less<T>());
} else {
_upper.push_back(x);
std::push_heap(_upper.begin(), _upper.end(), std::greater<T>());
}
unsigned size_lower = (unsigned)((_lower.size() + _upper.size()) * _percentile) + 1;
if (_lower.size() > size_lower) {
// lower to upper
std::pop_heap(_lower.begin(), _lower.end(), std::less<T>());
_upper.push_back(_lower.back());
std::push_heap(_upper.begin(), _upper.end(), std::greater<T>());
_lower.pop_back();
} else if (_lower.size() < size_lower) {
// upper to lower
std::pop_heap(_upper.begin(), _upper.end(), std::greater<T>());
_lower.push_back(_upper.back());
std::push_heap(_lower.begin(), _lower.end(), std::less<T>());
_upper.pop_back();
}
}
/// Access the percentile in O(1)
const T& get() const {
return _lower.front();
}
void clear() {
_lower.clear();
_upper.clear();
}
private:
double _percentile;
std::vector<T> _lower;
std::vector<T> _upper;
};
In an algorithm I have to calculate the 75th percentile of a data set whenever I add a value. Right now I am doing this:
- Get value
x
- Insert
x
in an already sorted array at the back - swap
x
down until the array is sorted - Read the element at position
array[array.size * 3/4]
Point 3 is O(n), and the rest is O(1), but this is still quite slow, especially if the array gets larger. Is there any way to optimize this?
UPDATE
Thanks Nikita! Since I am using C++ this is the solution easiest to implement. Here is the code:
template<class T>
class IterativePercentile {
public:
/// Percentile has to be in range [0, 1(
IterativePercentile(double percentile)
: _percentile(percentile)
{ }
// Adds a number in O(log(n))
void add(const T& x) {
if (_lower.empty() || x <= _lower.front()) {
_lower.push_back(x);
std::push_heap(_lower.begin(), _lower.end(), std::less<T>());
} else {
_upper.push_back(x);
std::push_heap(_upper.begin(), _upper.end(), std::greater<T>());
}
unsigned size_lower = (unsigned)((_lower.size() + _upper.size()) * _percentile) + 1;
if (_lower.size() > size_lower) {
// lower to upper
std::pop_heap(_lower.begin(), _lower.end(), std::less<T>());
_upper.push_back(_lower.back());
std::push_heap(_upper.begin(), _upper.end(), std::greater<T>());
_lower.pop_back();
} else if (_lower.size() < size_lower) {
// upper to lower
std::pop_heap(_upper.begin(), _upper.end(), std::greater<T>());
_lower.push_back(_upper.back());
std::push_heap(_lower.begin(), _lower.end(), std::less<T>());
_upper.pop_back();
}
}
/// Access the percentile in O(1)
const T& get() const {
return _lower.front();
}
void clear() {
_lower.clear();
_upper.clear();
}
private:
double _percentile;
std::vector<T> _lower;
std::vector<T> _upper;
};
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
您可以使用两个堆来完成此操作。不确定是否有一种不那么“人为”的解决方案,但是这个解决方案提供了
O(logn)
时间复杂度,并且堆也包含在大多数编程语言的标准库中。第一个堆(堆 A)包含最小的 75% 元素,另一个堆(堆 B)包含其余元素(最大的 25%)。第一个元素在顶部有最大的元素,第二个元素在顶部,最小的元素在顶部。
查看新元素
x
是否为<=max(A)
。如果是,则将其添加到堆A
中,否则添加到堆B
中。现在,如果我们将
x
添加到堆 A 并且它变得太大(包含超过 75% 的元素),我们需要从A
中删除最大的元素 (O(logn )) 并将其添加到堆 B(也是 O(logn))。如果堆 B 变得太大,则类似。
只需从 A 中取出最大的元素(或从 B 中取出最小的元素)。需要 O(logn) 或 O(1) 时间,具体取决于堆实现。
编辑
正如 Dolphin 指出的,我们需要精确指定每个 n 的每个堆应该有多大(如果我们想要精确的答案)。例如,如果
size(A) = Floor(n * 0.75)
且size(B)
是其余部分,则对于每个n > 0
,数组[array.size * 3/4] = min(B)
。You can do it with two heaps. Not sure if there's a less 'contrived' solution, but this one provides
O(logn)
time complexity and heaps are also included in standard libraries of most programming languages.First heap (heap A) contains smallest 75% elements, another heap (heap B) - the rest (biggest 25%). First one has biggest element on the top, second one - smallest.
See if new element
x
is <=max(A)
. If it is, add it to heapA
, otherwise - to heapB
.Now, if we added
x
to heap A and it became too big (holds more than 75% of elements), we need to remove biggest element fromA
(O(logn)) and add it to heap B (also O(logn)).Similar if heap B became too big.
Just take the largest element from A (or smallest from B). Requires O(logn) or O(1) time, depending on heap implementation.
edit
As Dolphin noted, we need to specify precisely how big each heap should be for every n (if we want precise answer). For example, if
size(A) = floor(n * 0.75)
andsize(B)
is the rest, then, for everyn > 0
,array[array.size * 3/4] = min(B)
.一个简单的 订单统计树 就足够了。
该树的平衡版本支持 O(logn) 时间插入/删除和按 Rank 访问。因此,您不仅可以获得 75% 的百分位数,还可以获得 66% 或 50% 或任何您需要的值,而无需更改代码。
如果您频繁访问 75% 百分位数,但插入频率较低,则始终可以在插入/删除操作期间缓存 75% 百分位数元素。
大多数标准实现(如 Java 的 TreeMap)都是顺序统计树。
A simple Order Statistics Tree is enough for this.
A balanced version of this tree supports O(logn) time insert/delete and access by Rank. So you not only get the 75% percentile, but also the 66% or 50% or whatever you need without having to change your code.
If you access the 75% percentile frequently, but only insert less frequently, you can always cache the 75% percentile element during an insert/delete operation.
Most standard implementations (like Java's TreeMap) are order statistic trees.
如果您可以使用近似答案,则可以使用直方图而不是将整个值保留在内存中。
对于每个新值,将其添加到适当的容器中。
通过遍历 bin 并对计数求和来计算第 75 个百分位,直到达到总体规模的 75%。百分位值介于 bin(您停止的位置)的下限和上限之间。
这将提供 O(B) 复杂度,其中 B 是 bin 的数量,即
range_size/bin_size
。 (使用适合您的用户情况的bin_size
)。我已经在 JVM 库中实现了此逻辑: https://github.com/IBM/HBPE你可以作为参考。
If you can do with an approximate answer, you can use a histogram instead of keeping entire values in memory.
For each new value, add it to the appropriate bin.
Calculate percentile 75th by traversing bins and summing counts until 75% of the population size is reached. Percentile value is between bin's (which you stopped at) low bound to high bound.
This will provide O(B) complexity where B is the count of bins, which is
range_size/bin_size
. (usebin_size
appropriate to your user case).I have implemented this logic in a JVM library: https://github.com/IBM/HBPE which you can use as a reference.
您可以使用二分查找在 O(log n) 中找到正确的位置。然而,向上移动数组仍然是 O(n)。
You can use binary search to do find the correct position in O(log n). However, shifting the array up is still O(n).
如果您有一组已知的值,则以下操作将非常快:
创建一个大型整数数组(甚至字节也可以),其元素数量等于数据的最大值。
例如,如果 t 的最大值为 100,000,则创建一个数组
现在迭代整个值集,如
现在计算百分位 如
如果值不符合这些限制,您还可以考虑使用 TreeMap 而不是数组。
If you have a known set of values, following will be very fast:
Create a large array of integers (even bytes will work) with number of elements equal to maximum value of your data.
For example, if the maximum value of t is 100,000 create an array
Now iterate over the entire set of values, as
Now calculate percentile as
You can also consider using a TreeMap instead of array, if the values don't confirm to these restrictions.
这是一个 JavaScript 解决方案。将其复制粘贴到浏览器控制台中,它就可以工作了。
$scores
包含分数列表,$percentile
给出列表的第 n 个百分位
。所以第 75 个百分位数是 76.8,第 99 个百分位数是 87.9。Here is a javaScript solution . Copy-paste it in browser console and it works .
$scores
contains the List of scores and ,$percentile
gives then-th percentile
of the list . So 75th percentile is 76.8 and 99 percentile is 87.9.