计算非正态分布的置信区间
首先,我应该指出,我的统计知识相当有限,所以如果我的问题看起来微不足道或者甚至没有意义,请原谅我。
我的数据似乎不符合正态分布。通常,当我绘制置信区间时,我会使用平均值±2标准差,但我认为这对于非均匀分布是不可接受的。我的样本量当前设置为 1000 个样本,这似乎足以确定它是否是正态分布。
我使用 Matlab 进行所有处理,那么 Matlab 中是否有任何函数可以轻松计算置信区间(例如 95%)?
我知道有“分位数”和“prctile”函数,但我不确定这是否是我需要使用的。函数“mle”还返回正态分布数据的置信区间,尽管您也可以提供自己的 pdf。
我可以使用 ks密度 为我的数据创建一个 pdf,然后将该 pdf 输入到 mle 函数中以给出置信区间吗?
另外,我将如何确定我的数据是否呈正态分布。我的意思是,我目前可以通过查看 ksdenth 的直方图或 pdf 来判断,但是有没有办法定量测量它?
谢谢!
First, I should specify that my knowledge of statistics is fairly limited, so please forgive me if my question seems trivial or perhaps doesn't even make sense.
I have data that doesn't appear to be normally distributed. Typically, when I plot confidence intervals, I would use the mean +- 2 standard deviations, but I don't think that is acceptible for a non-uniform distribution. My sample size is currently set to 1000 samples, which would seem like enough to determine if it was a normal distribution or not.
I use Matlab for all my processing, so are there any functions in Matlab that would make it easy to calculate the confidence intervals (say 95%)?
I know there are the 'quantile' and 'prctile' functions, but I'm not sure if that's what I need to use. The function 'mle' also returns confidence intervals for normally distributed data, although you can also supply your own pdf.
Could I use ksdensity to create a pdf for my data, then feed that pdf into the mle function to give me confidence intervals?
Also, how would I go about determining if my data is normally distributed. I mean I can currently tell just by looking at the histogram or pdf from ksdensity, but is there a way to quantitatively measure it?
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
所以有几个问题。以下是一些建议
您是对的,1000 个样本的平均值应该呈正态分布(除非您的数据是“重尾”,我假设情况并非如此)。要获得均值的
1-alpha
置信区间(在您的情况下alpha = 0.05
),您可以使用“norminv”函数。例如,假设我们想要数据X
样本的平均值为 95% CI,那么我们可以输入“测试数据样本是否呈正态分布”可以通过多种方式完成。一种简单的方法是使用 QQ 图。为此,请使用“qqplot(X)”,其中
X
是您的数据样本。如果结果近似为一条直线,则样本正常。如果结果不是一条直线,则样本不正常。例如,如果如上所述
X = exprnd(3,1000,1)
,则样本是非正态的,并且 qqplot 非常非线性:另一方面,如果数据正常,qqplot 将给出一条直线:
So there are a couple of questions there. Here are some suggestions
You are right that a mean of 1000 samples should be normally distributed (unless your data is "heavy tailed", which I'm assuming is not the case). to get a
1-alpha
-confidence interval for the mean (in your casealpha = 0.05
) you can use the 'norminv' function. For example say we wanted a 95% CI for the mean a sample of dataX
, then we can typeTesting if a data sample is normally distribution can be done in a lot of ways. One simple method is with a QQ plot. To do this, use 'qqplot(X)' where
X
is your data sample. If the result is approximately a straight line, the sample is normal. If the result is not a straight line, the sample is not normal.For example if
X = exprnd(3,1000,1)
as above, the sample is non-normal and the qqplot is very non-linear:On the other hand if the data is normal the qqplot will give a straight line:
您还可以考虑使用 bootci 函数进行引导。
You might consider, also, using bootstrapping, with the bootci function.
您可以使用[1]中提出的方法:
其中R = 四分位数范围,
SQN = N 的平方根
这通常用于缺口箱线图,这是非正态数据的有用数据可视化。如果两个中位数的缺口不重叠,则中位数在大约 95% 的置信水平上大约显着不同。
[1] McGill, R.、JW Tukey 和 WA Larsen。 “箱线图的变体。”美国统计学家。卷。 32,第 1 期,1978 年,第 12-16 页。
You may use the method proposed in [1]:
Where R = Interquartile Range,
SQN = Square Root of N
This is often used in notched box plots, a useful data visualization for non-normal data. If the notches of two medians do not overlap, the medians are, approximately, significantly different at about a 95% confidence level.
[1] McGill, R., J. W. Tukey, and W. A. Larsen. "Variations of Boxplots." The American Statistician. Vol. 32, No. 1, 1978, pp. 12–16.
您确定需要置信区间还是仅需要 90% 的随机数据范围?
如果您需要后者,我建议您使用 prctile()。例如,如果您有一个向量,其中包含随机变量的独立同分布样本,则可以通过运行来获取一些有用的信息
。这将在 [y(1), y(3)] 中返回 90% 样本出现的范围。在 y(2) 中,您得到样本的中位数。
尝试以下示例(使用正态分布变量):
Are you sure you need confidence intervals or just the 90% range of the random data?
If you need the latter, I suggest you use prctile(). For example, if you have a vector holding independent identically distributed samples of random variables, you can get some useful information by running
This will return in [y(1), y(3)] the range where 90% of your samples occur. And in y(2) you get the median of the sample.
Try the following example (using a normally distributed variable):
我没有使用过Matlab,但根据我对统计学的理解,如果你的分布不能被假设为正态分布,那么你必须将其视为Student t分布并计算置信区间和准确性。
http://www.stat.yale.edu/Courses/1997 -98/101/confint.htm
I have not used Matlab but from my understanding of statistics, if your distribution cannot be assumed to be normal distribution, then you have to take it as Student t distribution and calculate confidence Interval and accuracy.
http://www.stat.yale.edu/Courses/1997-98/101/confint.htm