当前位置：文江博客话题详情

计算平均置信区间而不存储所有数据点

发布于 2024-07-08 09:59:35 字数 357 浏览 12 评论 0原文

对于大的n（请参阅下文了解如何确定足够大的内容），根据中心极限定理，可以安全地将样本均值的分布视为正态（高斯），但我想要一个给出任意n的置信区间的过程。实现此目的的方法是使用具有 n-1 自由度的 Student T 分布。

所以问题是，给定您收集或一次遇到一个数据点流，如何计算 c（例如 c=.95）置信度数据点平均值的间隔（不存储所有先前遇到的数据）？

提出这个问题的另一种方式是：如何在不存储整个流的情况下跟踪数据流的第一时刻和第二时刻？

额外问题：您可以在不存储整个流的情况下跟踪更高的时刻吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

并安 2024-07-15 09:59:35

这是一篇关于如何在一次传递中计算平均值和标准差的文章，不存储任何数据数据。一旦获得这两个统计数据，您就可以估计置信区间。假设数据呈正态分布且存在大量数据点，95% 的置信区间将为[mean - 1.96*stdev,mean + 1.96*stdev]。

对于较少数量的数据点，您的置信区间将为 [mean - c(n)*stdev,mean + c(n)*stdev]，其中 c(n) 取决于您的样本量和置信度。对于 95% 的置信水平，以下是 n = 2, 3, 4, ..., 30 时的 c(n) 值

12.70620、4.302653、3.182446、2.776445、2.570582、2.446912、2.364624、2.306004、2.262157、2.228139、2.200985、2.178813、 2.160369、2.144787、2.131450、2.119905、2.109816、2.100922、2.093024、2.085963、2.079614、2.073873、2.068658、2.063899 , 2.059539, 2.055529, 2.051831, 2.048407, 2.045230

这些数字是 g(0.025, n-1)，其中 g 是具有 n 个自由度的 t 分布的逆 CDF。如果您想要 99% 的置信区间，请将 0.025 替换为 0.005。一般来说，对于 1-alpha 的置信水平，请使用 alpha/2。

这是生成上述常量的 R 命令。

n = seq(2, 30); qt(0.025, n-1)

这是博客文章解释了为什么上面的数字不像您预期的那么接近 1.96。

Here's an article on how to compute the mean and standard deviation in a single pass, not storing any data. Once you have the these two statistics, you can estimate a confidence interval. A 95% confidence interval would be [mean - 1.96*stdev, mean + 1.96*stdev], assuming a normal distribution for your data and a large number of data points.

For a smaller number of data points, your confidence interval would be [mean - c(n)*stdev, mean + c(n)*stdev] where c(n) depends on your sample size and your confidence level. For a 95% confidence level, here are your values of c(n) for n = 2, 3, 4, ..., 30

12.70620, 4.302653, 3.182446, 2.776445, 2.570582, 2.446912, 2.364624, 2.306004, 2.262157, 2.228139, 2.200985, 2.178813, 2.160369, 2.144787, 2.131450, 2.119905, 2.109816, 2.100922, 2.093024, 2.085963, 2.079614, 2.073873, 2.068658, 2.063899, 2.059539, 2.055529, 2.051831, 2.048407, 2.045230

These numbers are g(0.025, n-1) where g is the inverse CDF of the t-distribution with n degrees of freedom. If you wanted a 99% confidence interval, replace 0.025 with 0.005. In general, for a confidence level of 1-alpha, use alpha/2.

Here's the R command that generated the constants above.

n = seq(2, 30); qt(0.025, n-1)

Here's a blog post explaining why the numbers above are not as close to 1.96 as you might expect.

回复收藏 0 原文

浮华 2024-07-15 09:59:35

[非常感谢约翰·D·库克（John D Cook）在整理这个答案时学到的很多东西！]

首先，这是不使用平方和的原因：http://www.johndcook.com/blog/2008/09/26/

您应该做什么：

跟踪计数 (n )、平均值 (u) 以及可以确定样本方差和标准误差的数量 (s)。（改编自http://www.johndcook.com/standard_deviation.html。）

初始化<代码>n = u = s = 0。

对于每个新数据点，x：

u0 = u;
n ++;
u += (x - u) / n;
s += (x - u0) * (x - u);

样本方差为 s/(n-1)，样本均值的方差为 s/(n-1) )/n，样本均值的标准误为SE = sqrt(s/(n-1)/n)。

剩下的就是计算 Student-t c 置信区间（(0,1) 中的 c）：

u [plus or minus] SE*g((1-c)/2, n-1)

其中 g 是逆 cdf (均值 0、方差 1 的 Student-t 分布（又名分位数），采用概率和自由度（比数据点数量少 1）：

g(p,df) = sign(2*p-1)*sqrt(df)*sqrt(1/irib(1, -abs(2*p-1), df/2, 1/2) - 1)

其中 irib 是逆正则化不完全 beta函数：

irib(s0,s1,a,b) = z such that rib(s0,z,a,b) = s1

其中 rib 是正则化不完全 beta 函数：

rib(x0,x1,a,b) = B(x0,x1,a,b) / B(a,b)

其中 B(a,b) 是 Euler beta 函数，B(x0,x1,a,b ) 是不完整的 beta 函数：

B(a,b) = Gamma(a)*Gamma(b)/Gamma(a+b) = integral_0^1 t^(a-1)*(1-t)^(b-1) dt
B(x0,x1,a,b) = integral_x0^x1 t^(a-1)*(1-t)^(b-1) dt

典型的数值/统计库将具有 beta 函数的实现（或直接 Student-t 分布的逆 cdf）。对于 C 语言来说，事实上的标准是 Gnu Scientific Library (GSL)。通常会给出 beta 函数的 3 参数版本； 4 个参数的概括如下：

B(x0,x1,a,b) = B(x1,a,b) - B(x0,a,b)
rib(x0,x1,a,b) = rib(x1,a,b) - rib(x0,a,b)

最后，这是 Mathematica 中的一个实现：

(* Take current {n,u,s} and new data point; return new {n,u,s}. *)
update[{n_,u_,s_}, x_] := {n+1, u+(x-u)/(n+1), s+(x-u)(x-(u+(x-u)/(n+1)))}

Needs["HypothesisTesting`"];
g[p_, df_] := InverseCDF[StudentTDistribution[df], p]

(* Mean CI given n,u,s and confidence level c. *)
mci[n_,u_,s_, c_:.95] := With[{d = Sqrt[s/(n-1)/n]*g[(1-c)/2, n-1]}, 
  {u+d, u-d}]

比较

StudentTCI[u, SE, n-1, ConfidenceLevel->c]

当整个数据点列表可用时，

MeanCI[list, ConfidenceLevel->c]

或最后，如果您不想加载 beta 函数等数学库，您可以硬编码 -g((1-c)/2, n-1) 的查找表。这是 c=.95 和 n=2..100：

12.706204736174698、4.302652729749464、3.182446305283708、2.7764451051977934、
2.570581835636314、2.4469118511449666、2.3646242515927853、2.306004135204168、
2.262157162798205、2.2281388519862735、2.2009851600916384、2.178812829667226、
2.1603686564627917、2.1447866879178012、2.131449545559774、2.1199052992212533、
2.1098155778333156、2.100922040241039、2.093024054408307、2.0859634472658626、
2.0796138447276835、2.073873067904019、2.0686576104190477、2.0638985616280254、
2.0595385527532963、2.05552943864287、2.051830516480281、2.048407141795243、
2.0452296421327034、2.042272456301236、2.039513446396408、2.0369333434600976、
2.0345152974493392、2.032244509317719、2.030107928250338、2.0280940009804462、
2.0261924630291066、2.024394163911966、2.022690920036762、2.0210753903062715、
2.0195409704413745、2.018081702818439、2.016692199227822、2.0153675744437627、
2.0141033888808457、2.0128955989194246、2.011740513729764、2.0106347576242314、
2.0095752371292335、2.0085591121007527、2.007583770315835、2.0066468050616857、
2.005745995317864、2.0048792881880577、2.004044783289136、2.0032407188478696、
2.002465459291016、2.001717484145232、2.000995378088259、2.0002978220142578、
1.9996235849949402、1.998971517033376、1.9983405425207483、1.997729654317692、
1.9971379083920013、1.9965644189523084、1.996008354025304、1.9954689314298386、
1.994945415107228、1.9944371117711894、1.9939433678456229、1.993463566661884、
1.9929971258898527、1.9925434951809258、1.992102154002232、1.9916726096446793、
1.9912543953883763、1.9908470688116922、1.9904502102301198、1.990063421254452、
1.989686323456895、1.9893185571365664、1.9889597801751728、1.9886096669757192、
1.9882679074772156、1.9879342062390228、1.9876082815890748、1.9872898648311672、
1.9869786995062702、1.986674540703777、1.986377154418625、1.9860863169510985、
1.9858018143458114、1.9855234418666061、1.9852510035054973、1.9849843115224508、
1.9847231860139618、1.98446745450849、1.9842169515863888

渐近逼近 c=.95 的正态 (0,1) 分布的逆 CDF，即：

-sqrt(2)*InverseErf(-c) = 1.959963984540054235524594430520551527955550...

参见 http://mathworld.wolfram.com/InverseErf.html 用于反 erf() 函数。
请注意，在至少有 474 个数据点之前，g((1-.95)/2,n-1) 不会舍入为 1.96。当有 29 个数据点时，四舍五入为 2.0。

根据经验，您应该使用 Student-t 而不是 n 的正常近似值，最多至少为 300，而不是按照传统观点为 30。比照。 http://www.johndcook.com/blog/2008/11/12/< /a>.

另请参阅康奈尔大学 Ping Li 的“改进压缩计数”。

[Huge thanks to John D Cook for much of what I learned in putting together this answer!]

First, here's the reason not to use sum-of-squares: http://www.johndcook.com/blog/2008/09/26/

What you should do instead:

Keep track of the count (n), the mean (u), and a quantity (s) from which sample variance and standard error can be determined. (Adapted from http://www.johndcook.com/standard_deviation.html.)

Initialize n = u = s = 0.

For each new datapoint, x:

u0 = u;
n ++;
u += (x - u) / n;
s += (x - u0) * (x - u);

The sample variance is then s/(n-1), the variance of the sample mean is s/(n-1)/n, and the standard error of the sample mean is SE = sqrt(s/(n-1)/n).

It remains to compute the Student-t c-confidence interval (c in (0,1)):

u [plus or minus] SE*g((1-c)/2, n-1)

where g is the inverse cdf (aka quantile) of the Student-t distribution with mean 0 and variance 1, taking a probability and the degrees of freedom (one less than the number of data points):

g(p,df) = sign(2*p-1)*sqrt(df)*sqrt(1/irib(1, -abs(2*p-1), df/2, 1/2) - 1)

where irib is the inverse regularized incomplete beta function:

irib(s0,s1,a,b) = z such that rib(s0,z,a,b) = s1

where rib is the regularized incomplete beta function:

rib(x0,x1,a,b) = B(x0,x1,a,b) / B(a,b)

where B(a,b) is the Euler beta function and B(x0,x1,a,b) is the incomplete beta function:

B(a,b) = Gamma(a)*Gamma(b)/Gamma(a+b) = integral_0^1 t^(a-1)*(1-t)^(b-1) dt
B(x0,x1,a,b) = integral_x0^x1 t^(a-1)*(1-t)^(b-1) dt

Typical numerical/statistics libraries will have implementations of the beta function (or the inverse cdf of the Student-t distribution directly). For C, the de facto standard is the Gnu Scientific Library (GSL). Often a 3-argument version of the beta function is given; the generalization to 4 arguments is as follows:

B(x0,x1,a,b) = B(x1,a,b) - B(x0,a,b)
rib(x0,x1,a,b) = rib(x1,a,b) - rib(x0,a,b)

Finally, here is an implementation in Mathematica:

(* Take current {n,u,s} and new data point; return new {n,u,s}. *)
update[{n_,u_,s_}, x_] := {n+1, u+(x-u)/(n+1), s+(x-u)(x-(u+(x-u)/(n+1)))}

Needs["HypothesisTesting`"];
g[p_, df_] := InverseCDF[StudentTDistribution[df], p]

(* Mean CI given n,u,s and confidence level c. *)
mci[n_,u_,s_, c_:.95] := With[{d = Sqrt[s/(n-1)/n]*g[(1-c)/2, n-1]}, 
  {u+d, u-d}]

Compare to

StudentTCI[u, SE, n-1, ConfidenceLevel->c]

or, when the entire list of data points is available,

MeanCI[list, ConfidenceLevel->c]

Finally, if you don't want to load math libraries for things like the beta function, you can hardcode a lookup table for -g((1-c)/2, n-1). Here it is for c=.95 and n=2..100:

12.706204736174698, 4.302652729749464, 3.182446305283708, 2.7764451051977934,
2.570581835636314, 2.4469118511449666, 2.3646242515927853, 2.306004135204168,
2.262157162798205, 2.2281388519862735, 2.2009851600916384, 2.178812829667226,
2.1603686564627917, 2.1447866879178012, 2.131449545559774, 2.1199052992212533,
2.1098155778333156, 2.100922040241039, 2.093024054408307, 2.0859634472658626,
2.0796138447276835, 2.073873067904019, 2.0686576104190477, 2.0638985616280254,
2.0595385527532963, 2.05552943864287, 2.051830516480281, 2.048407141795243,
2.0452296421327034, 2.042272456301236, 2.039513446396408, 2.0369333434600976,
2.0345152974493392, 2.032244509317719, 2.030107928250338, 2.0280940009804462,
2.0261924630291066, 2.024394163911966, 2.022690920036762, 2.0210753903062715,
2.0195409704413745, 2.018081702818439, 2.016692199227822, 2.0153675744437627,
2.0141033888808457, 2.0128955989194246, 2.011740513729764, 2.0106347576242314,
2.0095752371292335, 2.0085591121007527, 2.007583770315835, 2.0066468050616857,
2.005745995317864, 2.0048792881880577, 2.004044783289136, 2.0032407188478696,
2.002465459291016, 2.001717484145232, 2.000995378088259, 2.0002978220142578,
1.9996235849949402, 1.998971517033376, 1.9983405425207483, 1.997729654317692,
1.9971379083920013, 1.9965644189523084, 1.996008354025304, 1.9954689314298386,
1.994945415107228, 1.9944371117711894, 1.9939433678456229, 1.993463566661884,
1.9929971258898527, 1.9925434951809258, 1.992102154002232, 1.9916726096446793,
1.9912543953883763, 1.9908470688116922, 1.9904502102301198, 1.990063421254452,
1.989686323456895, 1.9893185571365664, 1.9889597801751728, 1.9886096669757192,
1.9882679074772156, 1.9879342062390228, 1.9876082815890748, 1.9872898648311672,
1.9869786995062702, 1.986674540703777, 1.986377154418625, 1.9860863169510985,
1.9858018143458114, 1.9855234418666061, 1.9852510035054973, 1.9849843115224508,
1.9847231860139618, 1.98446745450849, 1.9842169515863888

which is asymptotically approaching the inverse CDF of a normal(0,1) distribution for c=.95, which is:

-sqrt(2)*InverseErf(-c) = 1.959963984540054235524594430520551527955550...

See http://mathworld.wolfram.com/InverseErf.html for the inverse erf() function.
Notice that g((1-.95)/2,n-1) doesn't round to 1.96 until there are at least 474 data points. It rounds to 2.0 when there are 29 data points.

As a rule of thumb, you should use Student-t instead of the normal approximation for n up to at least 300, not 30 per conventional wisdom. Cf. http://www.johndcook.com/blog/2008/11/12/.

See also "Improving Compressed Counting" by Ping Li of Cornell.

回复收藏 0 原文

飘然心甜 2024-07-15 09:59:35

   sigma = sqrt( (q - (s*s/n)) / (n-1) )
   delta = t(1-c/2,n-1) * sigma / sqrt(n)

其中 t(x, n-1) 是自由度为 n-1 的 t 分布。
如果您使用 gsl

t = gsl_cdf_tdist_Qinv (c/2.0, n-1)

则无需存储超出平方和的任何数据。现在，您可能会遇到数字问题，因为平方和可能非常大。您可以使用 s 的替代定义

sigma = sqrt(sum(( x_i - s/n )^2 / (n-1)))

并进行两次传递。我鼓励您考虑使用 gnu 科学库或像 R 这样的包可以帮助您避免数字问题。另外，请小心使用中心极限定理。滥用它是目前正在发生的整个金融危机的部分原因。

   sigma = sqrt( (q - (s*s/n)) / (n-1) )
   delta = t(1-c/2,n-1) * sigma / sqrt(n)

Where t(x, n-1) is the t- distribution with n-1 degrees of freedom.
if you are using gsl

t = gsl_cdf_tdist_Qinv (c/2.0, n-1)

There's no need to store any data beyond the sum of squares. Now, you might have a numerical issue because the sum-of-squares can be quite large. You could use the alternate definition of s

sigma = sqrt(sum(( x_i - s/n )^2 / (n-1)))

and make two passes. I would encourage you to consider using gnu scientific library or a package like R to help you avoid numerical issues. Also, be careful about your use of the central limit theorem. Abuse of it is partially to blame for the whole financial crisis going on right now.

回复收藏 0 原文