对大量数据进行分位数计数的增量方法
我需要计算大量数据的分位数。
假设我们只能通过某些部分(即大矩阵的一行)获取数据。要计算 Q3 分位数,需要获取数据的所有部分并将其存储在某处,然后对其进行排序并计算分位数:
List<double> allData = new List<double>();
// This is only an example; the portions of data are not really rows of some matrix
foreach(var row in matrix)
{
allData.AddRange(row);
}
allData.Sort();
double p = 0.75 * allData.Count;
int idQ3 = (int)Math.Ceiling(p) - 1;
double Q3 = allData[idQ3];
我想找到一种无需将数据存储在中间变量中即可获取分位数的方法。最好的解决方案是计算第一行中间结果的一些参数,然后逐步调整下一行。
注意:
- 这些数据集非常大(每行约 5000 个元素),
- 可以估计 Q3,它不必是精确值。
- 我将数据部分称为“行”,但它们可以具有不同的长度!通常它变化不大(+/-几百个样本),但它变化!
这个问题类似于“在线”(迭代)用于估计统计中位数、众数、偏度、峰度的算法,但我需要计算分位数。
此主题中也有几篇文章,即:
在尝试之前实施这些方法,我想知道是否还有其他更快的方法来计算 0.25/0.75 分位数?
I need to count the quantiles for a large set of data.
Let's assume we can get the data only through some portions (i.e. one row of a large matrix). To count the Q3 quantile one need to get all the portions of the data and store it somewhere, then sort it and count the quantile:
List<double> allData = new List<double>();
// This is only an example; the portions of data are not really rows of some matrix
foreach(var row in matrix)
{
allData.AddRange(row);
}
allData.Sort();
double p = 0.75 * allData.Count;
int idQ3 = (int)Math.Ceiling(p) - 1;
double Q3 = allData[idQ3];
I would like to find a way of obtaining the quantile without storing the data in an intermediate variable. The best solution would be to count some parameters of mid-results for first row and then adjust it step by step for next rows.
Note:
- These datasets are really big (ca 5000 elements in each row)
- The Q3 can be estimated, it doesn't have to be an exact value.
- I call the portions of data "rows", but they can have different leghts! Usually it varies not so much (+/- few hundred samples) but it varies!
This question is similar to “On-line” (iterator) algorithms for estimating statistical median, mode, skewness, kurtosis, but I need to count quantiles.
ALso there are few articles in this topic, i.e.:
- An Efficient Algorithm for the Approximate Median Selection Problem
- Incremental quantile estimation for massive tracking
Before trying to implement these approaches, I wondered if there are maybe any other, quicker ways of counting the 0.25/0.75 quantiles?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
我赞同使用水桶的想法。不要将自己限制在 100 个桶 - 还不如使用 100 万个桶。棘手的部分是选择你的存储桶范围,这样所有的东西都不会最终集中在一个存储桶中。估计存储桶范围的最佳方法可能是对数据进行合理的随机采样,使用简单的排序算法计算 10% 和 90% 分位数,然后生成相同大小的存储桶来填充该范围。它并不完美,但如果您的数据不是来自超级奇怪的分布,它应该可以工作。
如果你不能做随机样本,你就会遇到更多麻烦。您可以根据预期的数据分布选择初始分桶猜测,然后在处理数据时,如果任何存储桶(通常是第一个或最后一个存储桶)过满,则使用新的存储桶范围重新开始。
I second the idea of using buckets. Don't limit yourself to 100 buckets - might as well use 1 million. The tricky part is to pick your bucket ranges so that everything doesn't end up in a single bucket. Probably the best way to estimate your bucket ranges is to take a reasonable random sample of your data, compute the 10% and 90% quantiles using the simple sort algorithm, then generate equal-sized buckets to fill that range. It isn't perfect, but if your data isn't from a super-weird distribution, it should work.
If you can't do random samples, you're in more trouble. You can pick an initial bucketing guess based on your expected data distribution, then while working through your data if any bucket (typically the first or last bucket) gets overfull, start over again with a new bucket range.
有一种更新且更简单的算法可以提供非常好的极端分位数估计。
基本思想是在极端情况下使用较小的 bin,既限制数据结构的大小,又保证小或大 q 的更高准确性。该算法有多种语言和多种软件包可供使用。 MergingDigest 版本不需要动态分配...一旦实例化 MergingDigest,就不需要进一步的堆分配。
请参阅 https://github.com/tdunning/t-digest
There is a more recent and much simpler algorithm for this that provides very good estimates of the extreme quantiles.
The basic idea is that smaller bins are used at the extremes in a way that both bounds the size of the data structure and guarantees higher accuracy for small or large q. The algorithm is available in several languages and many packages. The MergingDigest version requires no dynamic allocation ... once the MergingDigest is instantiated, no further heap allocation is required.
See https://github.com/tdunning/t-digest
如果您的数据服从高斯分布,您可以根据标准差估计分位数。我假设你的数据不是高斯分布的,或者你只是使用 SD。
如果您可以两次传递数据,我会执行以下操作:
这应该为您提供一个非常好的线性时间算法,适用于大多数不完全反常的数据集。
If your data has a Gaussian distribution, you can estimate the quantiles from the standard deviation. I assume your data isn't Gaussian distributed or you'd just be using the SD anyway.
If you can pass through your data twice, I'd do the following:
This should give you a pretty good linear-time algorithm that works okay for most sets of not-entirely-perverse data.
受到这个答案的启发我创建了一种可以很好地估计分位数的方法。对于我的目的来说,它是足够接近的近似值。
这个想法如下:0.75 分位数实际上是高于全局中位数的所有值的中位数。 0.25 分位数分别是低于全球中位数的所有值的中位数。
因此,如果我们可以近似中位数,我们就可以以类似的方式近似分位数。
备注:
eta
来适应奇怪的数据。但准确度会差一些。eta
参数:在开始时将eta
设置为几乎等于某个大值(即0.2)。随着循环的进行,降低 eta 的值,这样当您几乎到达集合末尾时,eta 将几乎等于 0(例如,在循环中计算它像这样:eta = 0.2 - 0.2*(i/N);
Inspired by this answer I created a method that estimates the quantiles quite good. It is approximation close enough for my purposes.
The idea is following: the 0.75 quantile is in fact a median of all values that lies above the global median. And respectively, 0.25 quantile is a median of all values below the global median.
So if we can approximate the median, we can in similar way approximate the quantiles.
Remarks:
eta
in order to fit to the strange data. But the accuracy will be worse.eta
parameter in this way: at the beggining set theeta
to be almost equal some large value (i.e. 0.2). As the loop passes, lower the value ofeta
so when you reach almost the end of the collection, theeta
will be almost equal 0 (for example, in loop compute it like that:eta = 0.2 - 0.2*(i/N);
q-digest 是一种近似在线算法,可让您计算分位数: http://www.cs.virginia.edu/~son/cs851/papers/ucsb.sensys04.pdf
这是一个实现:
https://github.com/airlift/airlift/blob/master/stats/src/main/ java/io/airlift/stats/QuantileDigest.java
q-digest is an approximate online algorithm that lets you compute quantile: http://www.cs.virginia.edu/~son/cs851/papers/ucsb.sensys04.pdf
Here is an implementation:
https://github.com/airlift/airlift/blob/master/stats/src/main/java/io/airlift/stats/QuantileDigest.java