高效的经验CDF计算/存储

发布于 2024-10-04 01:46:53 字数 306 浏览 0 评论 0原文

我正在尝试预先计算几个随机变量的分布。特别是,这些随机变量是在基因组中的位置评估函数的结果,因此每个变量的值约为 10^8 或 10^9。这些功能非常流畅,所以我认为仅在每 2/10/100 次评估时不会损失太多准确性?基数左右,但不管怎样都会有大量的样本。我的计划是为每个函数预先计算分位数表(可能是百分位数),并在主程序的执行中引用这些表,以避免在每次运行中计算这些分布统计数据。

但我真的不明白如何轻松做到这一点:存储、排序和减少 10^9 浮点数的数组实际上并不可行,但我想不出另一种不会丢失有关分配。有没有一种方法可以测量样本分布的分位数,而不需要将整个数据存储在内存中?

I'm trying to precompute the distributions of several random variables. In particular, these random variables are the results of functions evaluated at locations in a genome, so there will be on the order of 10^8 or 10^9 values for each. The functions are pretty smooth, so I don't think I'll lose much accuracy by only evaluating at every 2nd/10th/100th? base or so, but regardless there will be a large number of samples. My plan is to precompute quantile tables (maybe percentiles) for each function and reference these in the execution of my main program to avoid having to compute these distribution statistics in every run.

But I don't really see how I can easily do this: storing, sorting, and reducing an array of 10^9 floats isn't really feasible, but I can't think of another way that doesn't lose information about the distribution. Is there a way of measuring the quantiles of a sample distribution that doesn't require storing the whole thing in memory?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

坏尐絯 2024-10-11 01:46:53

我同意@katriealex 的评论:询问具有强大统计背景的人。

您可以轻松评估最小/最大/平均值/标准偏差,无需存储任何大量内存。 (平均值+标准差的注意事项:使用 Knuth 的技术:

delta = x - m[n-1]
m[n] = m[n-1] +  1/n * delta
S[n] = S[n-1] + (x[n] - m[n])*delta
mean = m[n]
std dev = sqrt(S[n]/n)

这可以防止您在标准偏差的简单计算中遇到浮点上溢/下溢问题,例如 取 S1 = x[k] 之和,S2 = x[k]^2 之和,并尝试计算标准偏差 = sqrt(S2/N - S1^ 2/N^2)。另请参阅维基百科。)

可能有其他面向流的算法用于计算分布的更高特征矩,但我不知道它们是什么。

或者,您也可以使用直方图技术和足够的箱来表征分布。

I agree with @katriealex's comment: ask someone w/ a strong statistics background.

You could easily evaluate min/max/mean/std deviation w/o needing to store any significant amount of memory. (note for mean + std deviation: use Knuth's technique:

delta = x - m[n-1]
m[n] = m[n-1] +  1/n * delta
S[n] = S[n-1] + (x[n] - m[n])*delta
mean = m[n]
std dev = sqrt(S[n]/n)

This prevents you from floating point overflow/underflow problems encountered in the naive calculation of std dev, e.g. taking S1 = the sum of x[k] and S2 = the sum of x[k]^2 and trying to calculate std deviation = sqrt(S2/N - S1^2/N^2). See also Wikipedia.)

There are probably other stream-oriented algorithms for computing higher characteristic moments of the distribution, but I don't know what they are.

Or alternatively, you could also use histogramming techniques with enough bins to characterize the distribution.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文