对大量数据进行分位数计数的增量方法

发布于 2024-09-01 18:43:02 字数 1198 浏览 5 评论 0原文

我需要计算大量数据的分位数。

假设我们只能通过某些部分(即大矩阵的一行)获取数据。要计算 Q3 分位数,需要获取数据的所有部分并将其存储在某处,然后对其进行排序并计算分位数:

List<double> allData = new List<double>();
// This is only an example; the portions of data are not really rows of some matrix
foreach(var row in matrix) 
{
    allData.AddRange(row);
}

allData.Sort();
double p = 0.75 * allData.Count;
int idQ3 = (int)Math.Ceiling(p) - 1;
double Q3 = allData[idQ3];

我想找到一种无需将数据存储在中间变量中即可获取分位数的方法。最好的解决方案是计算第一行中间结果的一些参数,然后逐步调整下一行。

注意:

  • 这些数据集非常大(每行约 5000 个元素),
  • 可以估计 Q3,它不必是精确值。
  • 我将数据部分称为“行”,但它们可以具有不同的长度!通常它变化不大(+/-几百个样本),但它变化!

这个问题类似于“在线”(迭代)用于估计统计中位数、众数、偏度、峰度的算法,但我需要计算分位数。

此主题中也有几篇文章,即:

在尝试之前实施这些方法,我想知道是否还有其他更快的方法来计算 0.25/0.75 分位数?

I need to count the quantiles for a large set of data.

Let's assume we can get the data only through some portions (i.e. one row of a large matrix). To count the Q3 quantile one need to get all the portions of the data and store it somewhere, then sort it and count the quantile:

List<double> allData = new List<double>();
// This is only an example; the portions of data are not really rows of some matrix
foreach(var row in matrix) 
{
    allData.AddRange(row);
}

allData.Sort();
double p = 0.75 * allData.Count;
int idQ3 = (int)Math.Ceiling(p) - 1;
double Q3 = allData[idQ3];

I would like to find a way of obtaining the quantile without storing the data in an intermediate variable. The best solution would be to count some parameters of mid-results for first row and then adjust it step by step for next rows.

Note:

  • These datasets are really big (ca 5000 elements in each row)
  • The Q3 can be estimated, it doesn't have to be an exact value.
  • I call the portions of data "rows", but they can have different leghts! Usually it varies not so much (+/- few hundred samples) but it varies!

This question is similar to “On-line” (iterator) algorithms for estimating statistical median, mode, skewness, kurtosis, but I need to count quantiles.

ALso there are few articles in this topic, i.e.:

Before trying to implement these approaches, I wondered if there are maybe any other, quicker ways of counting the 0.25/0.75 quantiles?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

待"谢繁草 2024-09-08 18:43:02

我赞同使用水桶的想法。不要将自己限制在 100 个桶 - 还不如使用 100 万个桶。棘手的部分是选择你的存储桶范围,这样所有的东西都不会最终集中在一个存储桶中。估计存储桶范围的最佳方法可能是对数据进行合理的随机采样,使用简单的排序算法计算 10% 和 90% 分位数,然后生成相同大小的存储桶来填充该范围。它并不完美,但如果您的数据不是来自超级奇怪的分布,它应该可以工作。

如果你不能做随机样本,你就会遇到更多麻烦。您可以根据预期的数据分布选择初始分桶猜测,然后在处理数据时,如果任何存储桶(通常是第一个或最后一个存储桶)过满,则使用新的存储桶范围重新开始。

I second the idea of using buckets. Don't limit yourself to 100 buckets - might as well use 1 million. The tricky part is to pick your bucket ranges so that everything doesn't end up in a single bucket. Probably the best way to estimate your bucket ranges is to take a reasonable random sample of your data, compute the 10% and 90% quantiles using the simple sort algorithm, then generate equal-sized buckets to fill that range. It isn't perfect, but if your data isn't from a super-weird distribution, it should work.

If you can't do random samples, you're in more trouble. You can pick an initial bucketing guess based on your expected data distribution, then while working through your data if any bucket (typically the first or last bucket) gets overfull, start over again with a new bucket range.

又爬满兰若 2024-09-08 18:43:02

有一种更新且更简单的算法可以提供非常好的极端分位数估计。

基本思想是在极端情况下使用较小的 bin,既限制数据结构的大小,又保证小或大 q 的更高准确性。该算法有多种语言和多种软件包可供使用。 MergingDigest 版本不需要动态分配...一旦实例化 MergingDigest,就不需要进一步的堆分配。

请参阅 https://github.com/tdunning/t-digest

There is a more recent and much simpler algorithm for this that provides very good estimates of the extreme quantiles.

The basic idea is that smaller bins are used at the extremes in a way that both bounds the size of the data structure and guarantees higher accuracy for small or large q. The algorithm is available in several languages and many packages. The MergingDigest version requires no dynamic allocation ... once the MergingDigest is instantiated, no further heap allocation is required.

See https://github.com/tdunning/t-digest

伪装你 2024-09-08 18:43:02
  1. 仅检索您真正需要的数据 - 即,任何值被用作排序的键,而不是与其关联的所有其他数据。
  2. 您可能可以使用 Tony Hoare 的 Select 算法比对所有数据进行排序更快地找到分位数。
  1. Only retrieve the data you really need -- i.e., whatever value(s) is/are being used as the key for sorting, not everything else associated with it.
  2. You can probably use Tony Hoare's Select algorithm to find your quantile more quickly than sorting all the data.
挖个坑埋了你 2024-09-08 18:43:02

如果您的数据服从高斯分布,您可以根据标准差估计分位数。我假设你的数据不是高斯分布的,或者你只是使用 SD。

如果您可以两次传递数据,我会执行以下操作:

  • 第一次传递,计算最大值、最小值、SD 和平均值。
  • 第二遍,将范围 [min,max] 划分为一定数量的桶(例如 100);对 (mean - 2*SD,mean + 2*SD) 执行相同的操作(带有额外的异常值桶)。然后再次运行数据,将数字放入这些桶中。
  • 对存储桶进行计数,直到达到数据的 25% 和 75%。如果您想获得额外的效果,您可以在存储桶值之间进行插值。 (即,如果您需要 10% 的存储桶来达到第 25 个分位数,则假设该值是从下限到上限的 10%。)

这应该为您提供一个非常好的线性时间算法,适用于大多数不完全反常的数据集。

If your data has a Gaussian distribution, you can estimate the quantiles from the standard deviation. I assume your data isn't Gaussian distributed or you'd just be using the SD anyway.

If you can pass through your data twice, I'd do the following:

  • First pass, compute the max, min, SD and mean.
  • Second pass, divide the range [min,max] into some number of buckets (e.g. 100); do the same for (mean - 2*SD,mean + 2*SD) (with extra buckets for outliers). Then run through the data again, tossing numbers into these buckets.
  • Count buckets until you are at 25% and 75% of the data. If you want to get extra-fancy, you can interpolate between bucket values. (I.e. if you need 10% of a bucket to hit your 25th quantile, assume the value is 10% of the way from the low bound to the upper bound.)

This should give you a pretty good linear-time algorithm that works okay for most sets of not-entirely-perverse data.

暖阳 2024-09-08 18:43:02

受到这个答案的启发我创建了一种可以很好地估计分位数的方法。对于我的目的来说,它是足够接近的近似值。

这个想法如下:0.75 分位数实际上是高于全局中位数的所有值的中位数。 0.25 分位数分别是低于全球中位数的所有值的中位数。

因此,如果我们可以近似中位数,我们就可以以类似的方式近似分位数。

double median = 0;
double q1 = 0;
double q3 = 0;
double eta = 0.005;

foreach( var value in listOfValues) // or stream, or any other large set of data...
{
    median += eta * Math.Sign(p.Int - median);
}
// Second pass. We know the median, so we can count the quantiles.
foreach(var value in listOfValues)
{ 
    if(p.Int < median)
        q1 += eta*Math.Sign(p.Int - q1);
    else
        q3 += eta*Math.Sign(p.Int - q3);
}

备注:

  • 如果你的数据分布很奇怪,你需要有更大的eta来适应奇怪的数据。但准确度会差一些。
  • 如果分布很奇怪,但您知道集合的总大小(即 N),您可以通过以下方式调整 eta 参数:在开始时将 eta 设置为几乎等于某个大值(即0.2)。随着循环的进行,降低 eta 的值,这样当您几乎到达集合末尾时,eta 将几乎等于 0(例如,在循环中计算它像这样:eta = 0.2 - 0.2*(i/N);

Inspired by this answer I created a method that estimates the quantiles quite good. It is approximation close enough for my purposes.

The idea is following: the 0.75 quantile is in fact a median of all values that lies above the global median. And respectively, 0.25 quantile is a median of all values below the global median.

So if we can approximate the median, we can in similar way approximate the quantiles.

double median = 0;
double q1 = 0;
double q3 = 0;
double eta = 0.005;

foreach( var value in listOfValues) // or stream, or any other large set of data...
{
    median += eta * Math.Sign(p.Int - median);
}
// Second pass. We know the median, so we can count the quantiles.
foreach(var value in listOfValues)
{ 
    if(p.Int < median)
        q1 += eta*Math.Sign(p.Int - q1);
    else
        q3 += eta*Math.Sign(p.Int - q3);
}

Remarks:

  • If distribution of your data is strange, you will need to have bigger eta in order to fit to the strange data. But the accuracy will be worse.
  • If the distribution is strange, but you know the total size of your collection (i.e. N) you can adjust the eta parameter in this way: at the beggining set the eta to be almost equal some large value (i.e. 0.2). As the loop passes, lower the value of eta so when you reach almost the end of the collection, the eta will be almost equal 0 (for example, in loop compute it like that: eta = 0.2 - 0.2*(i/N);
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文