重复计算百分位数的快速算法？

发布于 2024-09-19 11:29:07 字数 1889 浏览 6 评论 0原文

在算法中，每当我添加值时，我都必须计算数据集的75%。现在我正在这样做：

获取值x
在已经排序的数组中插入x在后面
交换x直到数组排序
读取位置 array[array.size * 3/4] 点 3 的元素

是 O(n)，其余的是 O(1)，但这仍然很慢，特别是如果数组得到更大。有什么办法可以优化这个吗？

更新

谢谢尼基塔！由于我使用 C++，这是最容易实现的解决方案。这是代码：

template<class T>
class IterativePercentile {
public:
  /// Percentile has to be in range [0, 1(
  IterativePercentile(double percentile)
    : _percentile(percentile)
  { }

  // Adds a number in O(log(n))
  void add(const T& x) {
    if (_lower.empty() || x <= _lower.front()) {
      _lower.push_back(x);
      std::push_heap(_lower.begin(), _lower.end(), std::less<T>());
    } else {
      _upper.push_back(x);
      std::push_heap(_upper.begin(), _upper.end(), std::greater<T>());
    }

    unsigned size_lower = (unsigned)((_lower.size() + _upper.size()) * _percentile) + 1;
    if (_lower.size() > size_lower) {
      // lower to upper
      std::pop_heap(_lower.begin(), _lower.end(), std::less<T>());
      _upper.push_back(_lower.back());
      std::push_heap(_upper.begin(), _upper.end(), std::greater<T>());
      _lower.pop_back();
    } else if (_lower.size() < size_lower) {
      // upper to lower
      std::pop_heap(_upper.begin(), _upper.end(), std::greater<T>());
      _lower.push_back(_upper.back());
      std::push_heap(_lower.begin(), _lower.end(), std::less<T>());
      _upper.pop_back();
    }            
  }

  /// Access the percentile in O(1)
  const T& get() const {
    return _lower.front();
  }

  void clear() {
    _lower.clear();
    _upper.clear();
  }

private:
  double _percentile;
  std::vector<T> _lower;
  std::vector<T> _upper;
};

原文

In an algorithm I have to calculate the 75th percentile of a data set whenever I add a value. Right now I am doing this:

Get value x
Insert x in an already sorted array at the back
swap x down until the array is sorted
Read the element at position array[array.size * 3/4]

Point 3 is O(n), and the rest is O(1), but this is still quite slow, especially if the array gets larger. Is there any way to optimize this?

UPDATE

Thanks Nikita! Since I am using C++ this is the solution easiest to implement. Here is the code:

template<class T>
class IterativePercentile {
public:
  /// Percentile has to be in range [0, 1(
  IterativePercentile(double percentile)
    : _percentile(percentile)
  { }

  // Adds a number in O(log(n))
  void add(const T& x) {
    if (_lower.empty() || x <= _lower.front()) {
      _lower.push_back(x);
      std::push_heap(_lower.begin(), _lower.end(), std::less<T>());
    } else {
      _upper.push_back(x);
      std::push_heap(_upper.begin(), _upper.end(), std::greater<T>());
    }

    unsigned size_lower = (unsigned)((_lower.size() + _upper.size()) * _percentile) + 1;
    if (_lower.size() > size_lower) {
      // lower to upper
      std::pop_heap(_lower.begin(), _lower.end(), std::less<T>());
      _upper.push_back(_lower.back());
      std::push_heap(_upper.begin(), _upper.end(), std::greater<T>());
      _lower.pop_back();
    } else if (_lower.size() < size_lower) {
      // upper to lower
      std::pop_heap(_upper.begin(), _upper.end(), std::greater<T>());
      _lower.push_back(_upper.back());
      std::push_heap(_lower.begin(), _lower.end(), std::less<T>());
      _upper.pop_back();
    }            
  }

  /// Access the percentile in O(1)
  const T& get() const {
    return _lower.front();
  }

  void clear() {
    _lower.clear();
    _upper.clear();
  }

private:
  double _percentile;
  std::vector<T> _lower;
  std::vector<T> _upper;
};

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

如若梦似彩虹 2024-09-26 11:29:07

您可以使用两个堆来完成此操作。不确定是否有一种不那么“人为”的解决方案，但是这个解决方案提供了 O(logn) 时间复杂度，并且堆也包含在大多数编程语言的标准库中。

第一个堆（堆 A）包含最小的 75% 元素，另一个堆（堆 B）包含其余元素（最大的 25%）。第一个元素在顶部有最大的元素，第二个元素在顶部，最小的元素在顶部。

正在添加元素。

查看新元素x是否为<= max(A)。如果是，则将其添加到堆 A 中，否则添加到堆 B 中。
现在，如果我们将 x 添加到堆 A 并且它变得太大（包含超过 75% 的元素），我们需要从 A 中删除最大的元素 (O(logn )) 并将其添加到堆 B（也是 O(logn)）。
如果堆 B 变得太大，则类似。

查找“0.75中位数”

只需从 A 中取出最大的元素（或从 B 中取出最小的元素）。需要 O(logn) 或 O(1) 时间，具体取决于堆实现。

编辑
正如 Dolphin 指出的，我们需要精确指定每个 n 的每个堆应该有多大（如果我们想要精确的答案）。例如，如果 size(A) = Floor(n * 0.75) 且 size(B) 是其余部分，则对于每个 n > 0，数组[array.size * 3/4] = min(B)。

回复收藏 0 原文

何止钟意 2024-09-26 11:29:07

一个简单的订单统计树就足够了。

该树的平衡版本支持 O(logn) 时间插入/删除和按 Rank 访问。因此，您不仅可以获得 75% 的百分位数，还可以获得 66% 或 50% 或任何您需要的值，而无需更改代码。

如果您频繁访问 75% 百分位数，但插入频率较低，则始终可以在插入/删除操作期间缓存 75% 百分位数元素。

大多数标准实现（如 Java 的 TreeMap）都是顺序统计树。

回复收藏 0 原文

沧桑㈠ 2024-09-26 11:29:07

如果您可以使用近似答案，则可以使用直方图而不是将整个值保留在内存中。

对于每个新值，将其添加到适当的容器中。
通过遍历 bin 并对计数求和来计算第 75 个百分位，直到达到总体规模的 75%。百分位值介于 bin（您停止的位置）的下限和上限之间。

这将提供 O(B) 复杂度，其中 B 是 bin 的数量，即 range_size/bin_size。（使用适合您的用户情况的 bin_size）。

我已经在 JVM 库中实现了此逻辑： https://github.com/IBM/HBPE你可以作为参考。

回复收藏 0 原文

赢得她心 2024-09-26 11:29:07

您可以使用二分查找在 O(log n) 中找到正确的位置。然而，向上移动数组仍然是 O(n)。

回复收藏 0 原文

一绘本一梦想 2024-09-26 11:29:07

如果您有一组已知的值，则以下操作将非常快：

创建一个大型整数数组（甚至字节也可以），其元素数量等于数据的最大值。
例如，如果 t 的最大值为 100,000，则创建一个数组

int[] index = new int[100000]; // 400kb

现在迭代整个值集，如

for each (int t : set_of_values) {
  index[t]++;
}

// You can do a try catch on ArrayOutOfBounds just in case :)

现在计算百分位如

int sum = 0, i = 0;
while (sum < 0.9*set_of_values.length) {
  sum += index[i++];
}

return i;

如果值不符合这些限制，您还可以考虑使用 TreeMap 而不是数组。

If you have a known set of values, following will be very fast:

Create a large array of integers (even bytes will work) with number of elements equal to maximum value of your data.
For example, if the maximum value of t is 100,000 create an array

int[] index = new int[100000]; // 400kb

Now iterate over the entire set of values, as

for each (int t : set_of_values) {
  index[t]++;
}

// You can do a try catch on ArrayOutOfBounds just in case :)

Now calculate percentile as

int sum = 0, i = 0;
while (sum < 0.9*set_of_values.length) {
  sum += index[i++];
}

return i;

You can also consider using a TreeMap instead of array, if the values don't confirm to these restrictions.

回复收藏 0 原文

千年*琉璃梦 2024-09-26 11:29:07

这是一个 JavaScript 解决方案。将其复制粘贴到浏览器控制台中，它就可以工作了。 $scores 包含分数列表，$percentile 给出列表的第 n 个百分位。所以第 75 个百分位数是 76.8，第 99 个百分位数是 87.9。

function get_percentile($percentile, $array) {
    $array = $array.sort();
    $index = ($percentile/100) * $array.length;
    if (Math.floor($index) === $index) {
         $result = ($array[$index-1] + $array[$index])/2;
    }
    else {
        $result = $array[Math.floor($index)];
    }
    return $result;
}

$scores = [22.3, 32.4, 12.1, 54.6, 76.8, 87.3, 54.6, 45.5, 87.9];

get_percentile(75, $scores);
get_percentile(90, $scores);

Here is a javaScript solution . Copy-paste it in browser console and it works . $scores contains the List of scores and , $percentilegives the n-th percentile of the list . So 75th percentile is 76.8 and 99 percentile is 87.9.

function get_percentile($percentile, $array) {
    $array = $array.sort();
    $index = ($percentile/100) * $array.length;
    if (Math.floor($index) === $index) {
         $result = ($array[$index-1] + $array[$index])/2;
    }
    else {
        $result = $array[Math.floor($index)];
    }
    return $result;
}

$scores = [22.3, 32.4, 12.1, 54.6, 76.8, 87.3, 54.6, 45.5, 87.9];

get_percentile(75, $scores);
get_percentile(90, $scores);

回复收藏 0 原文

~没有更多了~