计算平均置信区间而不存储所有数据点
对于大的n
(请参阅下文了解如何确定足够大的内容),根据中心极限定理,可以安全地将样本均值的分布视为正态(高斯),但我想要一个给出任意n
的置信区间的过程。 实现此目的的方法是使用具有 n-1
自由度的 Student T 分布。
所以问题是,给定您收集或一次遇到一个数据点流,如何计算 c
(例如 c=.95
)置信度数据点平均值的间隔(不存储所有先前遇到的数据)?
提出这个问题的另一种方式是:如何在不存储整个流的情况下跟踪数据流的第一时刻和第二时刻?
额外问题:您可以在不存储整个流的情况下跟踪更高的时刻吗?
For large n
(see below for how to determine what's large enough), it's safe to treat, by the central limit theorem, the distribution of the sample mean as normal (gaussian) but I'd like a procedure that gives a confidence interval for any n
. The way to do that is to use a Student T distribution with n-1
degrees of freedom.
So the question is, given a stream of data points that you collect or encounter one at a time, how do you compute a c
(eg, c=.95
) confidence interval on the mean of the data points (without storing all of the previously encountered data)?
Another way to ask this is: How do you keep track of the first and second moments for a stream of data without storing the whole stream?
BONUS QUESTION: Can you keep track of higher moments without storing the whole stream?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
这是一篇关于如何在一次传递中计算平均值和标准差的文章,不存储任何数据数据。 一旦获得这两个统计数据,您就可以估计置信区间。 假设数据呈正态分布且存在大量数据点,95% 的置信区间将为
[mean - 1.96*stdev,mean + 1.96*stdev]
。对于较少数量的数据点,您的置信区间将为
[mean - c(n)*stdev,mean + c(n)*stdev]
,其中c(n)
取决于您的样本量和置信度。 对于 95% 的置信水平,以下是n
= 2, 3, 4, ..., 30 时的c(n)
值这些数字是
g(0.025, n-1)
,其中g
是具有 n 个自由度的 t 分布的逆 CDF。 如果您想要 99% 的置信区间,请将 0.025 替换为 0.005。 一般来说,对于1-alpha
的置信水平,请使用alpha/2
。这是生成上述常量的 R 命令。
这是博客文章解释了为什么上面的数字不像您预期的那么接近 1.96。
Here's an article on how to compute the mean and standard deviation in a single pass, not storing any data. Once you have the these two statistics, you can estimate a confidence interval. A 95% confidence interval would be
[mean - 1.96*stdev, mean + 1.96*stdev]
, assuming a normal distribution for your data and a large number of data points.For a smaller number of data points, your confidence interval would be
[mean - c(n)*stdev, mean + c(n)*stdev]
wherec(n)
depends on your sample size and your confidence level. For a 95% confidence level, here are your values ofc(n)
forn
= 2, 3, 4, ..., 30These numbers are
g(0.025, n-1)
whereg
is the inverse CDF of the t-distribution with n degrees of freedom. If you wanted a 99% confidence interval, replace 0.025 with 0.005. In general, for a confidence level of1-alpha
, usealpha/2
.Here's the R command that generated the constants above.
Here's a blog post explaining why the numbers above are not as close to 1.96 as you might expect.
[非常感谢约翰·D·库克(John D Cook)在整理这个答案时学到的很多东西!]
首先,这是不使用平方和的原因:http://www.johndcook.com/blog/2008/09/26/
您应该做什么:
跟踪计数 (n )、平均值 (u) 以及可以确定样本方差和标准误差的数量 (s)。 (改编自http://www.johndcook.com/standard_deviation.html。)
初始化<代码>n = u = s = 0。
对于每个新数据点,
x
:样本方差为
s/(n-1)
,样本均值的方差为s/(n-1) )/n
,样本均值的标准误为SE = sqrt(s/(n-1)/n)
。剩下的就是计算 Student-t
c
置信区间((0,1) 中的c
):其中
g
是逆 cdf (均值 0、方差 1 的 Student-t 分布(又名分位数),采用概率和自由度(比数据点数量少 1):其中 irib 是逆正则化不完全 beta函数:
其中
rib
是正则化不完全 beta 函数:其中
B(a,b)
是 Euler beta 函数,B(x0,x1,a,b )
是不完整的 beta 函数:典型的数值/统计库将具有 beta 函数的实现(或直接 Student-t 分布的逆 cdf)。 对于 C 语言来说,事实上的标准是 Gnu Scientific Library (GSL)。 通常会给出 beta 函数的 3 参数版本; 4 个参数的概括如下:
最后,这是 Mathematica 中的一个实现:
比较
当整个数据点列表可用时,
或最后,如果您不想加载 beta 函数等数学库,您可以硬编码
-g((1-c)/2, n-1)
的查找表。 这是c=.95
和n=2..100
:12.706204736174698、4.302652729749464、3.182446305283708、2.7764451051977934、
2.570581835636314、2.4469118511449666、2.3646242515927853、2.306004135204168、
2.262157162798205、2.2281388519862735、2.2009851600916384、2.178812829667226、
2.1603686564627917、2.1447866879178012、2.131449545559774、2.1199052992212533、
2.1098155778333156、2.100922040241039、2.093024054408307、2.0859634472658626、
2.0796138447276835、2.073873067904019、2.0686576104190477、2.0638985616280254、
2.0595385527532963、2.05552943864287、2.051830516480281、2.048407141795243、
2.0452296421327034、2.042272456301236、2.039513446396408、2.0369333434600976、
2.0345152974493392、2.032244509317719、2.030107928250338、2.0280940009804462、
2.0261924630291066、2.024394163911966、2.022690920036762、2.0210753903062715、
2.0195409704413745、2.018081702818439、2.016692199227822、2.0153675744437627、
2.0141033888808457、2.0128955989194246、2.011740513729764、2.0106347576242314、
2.0095752371292335、2.0085591121007527、2.007583770315835、2.0066468050616857、
2.005745995317864、2.0048792881880577、2.004044783289136、2.0032407188478696、
2.002465459291016、2.001717484145232、2.000995378088259、2.0002978220142578、
1.9996235849949402、1.998971517033376、1.9983405425207483、1.997729654317692、
1.9971379083920013、1.9965644189523084、1.996008354025304、1.9954689314298386、
1.994945415107228、1.9944371117711894、1.9939433678456229、1.993463566661884、
1.9929971258898527、1.9925434951809258、1.992102154002232、1.9916726096446793、
1.9912543953883763、1.9908470688116922、1.9904502102301198、1.990063421254452、
1.989686323456895、1.9893185571365664、1.9889597801751728、1.9886096669757192、
1.9882679074772156、1.9879342062390228、1.9876082815890748、1.9872898648311672、
1.9869786995062702、1.986674540703777、1.986377154418625、1.9860863169510985、
1.9858018143458114、1.9855234418666061、1.9852510035054973、1.9849843115224508、
1.9847231860139618、1.98446745450849、1.9842169515863888
渐近逼近
c=.95
的正态 (0,1) 分布的逆 CDF,即:参见 http://mathworld.wolfram.com/InverseErf.html 用于反
erf()
函数。请注意,在至少有 474 个数据点之前,
g((1-.95)/2,n-1)
不会舍入为 1.96。 当有 29 个数据点时,四舍五入为 2.0。根据经验,您应该使用 Student-t 而不是
n
的正常近似值,最多至少为 300,而不是按照传统观点为 30。 比照。 http://www.johndcook.com/blog/2008/11/12/< /a>.另请参阅康奈尔大学 Ping Li 的“改进压缩计数”。
[Huge thanks to John D Cook for much of what I learned in putting together this answer!]
First, here's the reason not to use sum-of-squares: http://www.johndcook.com/blog/2008/09/26/
What you should do instead:
Keep track of the count (n), the mean (u), and a quantity (s) from which sample variance and standard error can be determined. (Adapted from http://www.johndcook.com/standard_deviation.html.)
Initialize
n = u = s = 0
.For each new datapoint,
x
:The sample variance is then
s/(n-1)
, the variance of the sample mean iss/(n-1)/n
, and the standard error of the sample mean isSE = sqrt(s/(n-1)/n)
.It remains to compute the Student-t
c
-confidence interval (c
in (0,1)):where
g
is the inverse cdf (aka quantile) of the Student-t distribution with mean 0 and variance 1, taking a probability and the degrees of freedom (one less than the number of data points):where
irib
is the inverse regularized incomplete beta function:where
rib
is the regularized incomplete beta function:where
B(a,b)
is the Euler beta function andB(x0,x1,a,b)
is the incomplete beta function:Typical numerical/statistics libraries will have implementations of the beta function (or the inverse cdf of the Student-t distribution directly). For C, the de facto standard is the Gnu Scientific Library (GSL). Often a 3-argument version of the beta function is given; the generalization to 4 arguments is as follows:
Finally, here is an implementation in Mathematica:
Compare to
or, when the entire list of data points is available,
Finally, if you don't want to load math libraries for things like the beta function, you can hardcode a lookup table for
-g((1-c)/2, n-1)
. Here it is forc=.95
andn=2..100
:12.706204736174698, 4.302652729749464, 3.182446305283708, 2.7764451051977934,
2.570581835636314, 2.4469118511449666, 2.3646242515927853, 2.306004135204168,
2.262157162798205, 2.2281388519862735, 2.2009851600916384, 2.178812829667226,
2.1603686564627917, 2.1447866879178012, 2.131449545559774, 2.1199052992212533,
2.1098155778333156, 2.100922040241039, 2.093024054408307, 2.0859634472658626,
2.0796138447276835, 2.073873067904019, 2.0686576104190477, 2.0638985616280254,
2.0595385527532963, 2.05552943864287, 2.051830516480281, 2.048407141795243,
2.0452296421327034, 2.042272456301236, 2.039513446396408, 2.0369333434600976,
2.0345152974493392, 2.032244509317719, 2.030107928250338, 2.0280940009804462,
2.0261924630291066, 2.024394163911966, 2.022690920036762, 2.0210753903062715,
2.0195409704413745, 2.018081702818439, 2.016692199227822, 2.0153675744437627,
2.0141033888808457, 2.0128955989194246, 2.011740513729764, 2.0106347576242314,
2.0095752371292335, 2.0085591121007527, 2.007583770315835, 2.0066468050616857,
2.005745995317864, 2.0048792881880577, 2.004044783289136, 2.0032407188478696,
2.002465459291016, 2.001717484145232, 2.000995378088259, 2.0002978220142578,
1.9996235849949402, 1.998971517033376, 1.9983405425207483, 1.997729654317692,
1.9971379083920013, 1.9965644189523084, 1.996008354025304, 1.9954689314298386,
1.994945415107228, 1.9944371117711894, 1.9939433678456229, 1.993463566661884,
1.9929971258898527, 1.9925434951809258, 1.992102154002232, 1.9916726096446793,
1.9912543953883763, 1.9908470688116922, 1.9904502102301198, 1.990063421254452,
1.989686323456895, 1.9893185571365664, 1.9889597801751728, 1.9886096669757192,
1.9882679074772156, 1.9879342062390228, 1.9876082815890748, 1.9872898648311672,
1.9869786995062702, 1.986674540703777, 1.986377154418625, 1.9860863169510985,
1.9858018143458114, 1.9855234418666061, 1.9852510035054973, 1.9849843115224508,
1.9847231860139618, 1.98446745450849, 1.9842169515863888
which is asymptotically approaching the inverse CDF of a normal(0,1) distribution for
c=.95
, which is:See http://mathworld.wolfram.com/InverseErf.html for the inverse
erf()
function.Notice that
g((1-.95)/2,n-1)
doesn't round to 1.96 until there are at least 474 data points. It rounds to 2.0 when there are 29 data points.As a rule of thumb, you should use Student-t instead of the normal approximation for
n
up to at least 300, not 30 per conventional wisdom. Cf. http://www.johndcook.com/blog/2008/11/12/.See also "Improving Compressed Counting" by Ping Li of Cornell.
其中 t(x, n-1) 是自由度为 n-1 的 t 分布。
如果您使用 gsl
则无需存储超出平方和的任何数据。 现在,您可能会遇到数字问题,因为平方和可能非常大。 您可以使用 s 的替代定义
并进行两次传递。 我鼓励您考虑使用 gnu 科学库 或像 R 这样的包可以帮助您避免数字问题。 另外,请小心使用中心极限定理。 滥用它是目前正在发生的整个金融危机的部分原因。
Where t(x, n-1) is the t- distribution with n-1 degrees of freedom.
if you are using gsl
There's no need to store any data beyond the sum of squares. Now, you might have a numerical issue because the sum-of-squares can be quite large. You could use the alternate definition of s
and make two passes. I would encourage you to consider using gnu scientific library or a package like R to help you avoid numerical issues. Also, be careful about your use of the central limit theorem. Abuse of it is partially to blame for the whole financial crisis going on right now.
您不想累积平方和。 得到的统计数据在数字上不准确——您最终会减去两个相似的大数字。 您想要保持方差,或 (n-1)*方差,或类似的东西。
最直接的方法是增量累积数据点。 该公式并不复杂或难以推导(请参阅 John D. Cook 的链接)。
更准确的方法是将数据点成对递归地组合起来。 您可以使用 n 中的内存对数来做到这一点:寄存器 k 保存 2^k 个较旧数据点的统计数据,这些数据与 2^k 个较新数据点的统计数据相结合,以获得 2^(k+1) 个点的统计数据...
You don't want to accumulate the sum-of-squares. The resulting statistics are numerically inaccurate -- you'll end up subtracting two large, similar numbers. You want to maintain the variance, or (n-1)*variance, or something like that.
The straightforward way is to accumulate the datapoints incrementally. The formula is not complicated or hard to derive (see John D. Cook's link).
An even more accurate way to do it is to combine the datapoints pairwise-recursively. You can do this with memory logarithmic in n: register k holds statistics for 2^k older datapoints, which are combined with statistics for 2^k newer points to get statistics for 2^(k+1) points...
我认为你不必太担心
n
的大小,因为它很快就会超过30,此时可以认为分布是正常的。 如果您不想存储以前样本中的任何数据点,我认为使用贝叶斯递归对总体均值和方差参数进行后验推断(假设有一个正态模型)是最好的方法。您可以查看此文档进行联合推理平均值和方差,特别是方程 38a、38b 和 38c。
I think that you don't have to worry so much about the size of
n
because it will soon exceed the number of 30, where the distribution can be considered as normal. Using Bayesian recursion to make posterior inference on the population mean and variance parameters, assuming a normal model, is I think the best way, if you don't want to store any data points from previous samples.You can take a look at this document for joint inference for the mean and variance, and specifically equations 38a, 38b and 38c.
我想你可以。 我必须通过 Google/Wikipidia 才能找到它,所以我将其作为练习留给读者。
I think you can. I'd have to Google/Wikipidia for it so I'll leave that as an exercise for the reader.