在 python scipy 中实现 Kolmogorov Smirnov 测试

发布于 2024-12-11 18:21:36 字数 763 浏览 5 评论 0原文

我有一个关于 N 个数字的数据集，我想测试其正态性。我知道 scipy.stats 有一个 kstest 函数但没有关于如何使用它以及如何解释结果的示例。这里有熟悉的人可以给我一些建议吗？

根据文档，使用 kstest 返回两个数字：KS 检验统计量 D 和 p 值。如果 p 值大于显着性水平（例如 5%），那么我们不能拒绝数据来自给定分布的假设。

当我通过从正态分布中抽取 10000 个样本并测试高斯性来进行测试运行时：

import numpy as np
from scipy.stats import kstest

mu,sigma = 0.07, 0.89
kstest(np.random.normal(mu,sigma,10000),'norm')

我得到以下输出：

(0.04957880905196102, 8.9249710700788814e-22)

p 值小于 5%，这意味着我们可以拒绝数据呈正态分布的假设。但样本是从正态分布中抽取的！

有人可以理解并向我解释这里的差异吗？

（正态性测试是否假设 mu = 0 且 sigma = 1？如果是这样，我如何测试我的数据是否呈高斯分布但具有不同的 mu 和 sigma？）

原文

I have a data set on N numbers that I want to test for normality.
I know scipy.stats has a kstest function
but there are no examples on how to use it and how to interpret the results.
Is anyone here familiar with it that can give me some advice?

According to the documentation, using kstest returns two numbers, the KS test statistic D and the p-value.
If the p-value is greater than the significance level (say 5%), then we cannot reject the hypothesis that the data come from the given distribution.

When I do a test run by drawing 10000 samples from a normal distribution and testing for gaussianity:

import numpy as np
from scipy.stats import kstest

mu,sigma = 0.07, 0.89
kstest(np.random.normal(mu,sigma,10000),'norm')

I get the following output:

(0.04957880905196102, 8.9249710700788814e-22)

The p-value is less than 5% which means that we can reject the hypothesis that the data are normally distributed. But the samples were drawn from a normal distribution!

Can someone understand and explain to me the discrepancy here?

(Does testing for normality assume mu = 0 and sigma = 1? If so, how can I test that my data are gaussianly distributed but with a different mu and sigma?)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

末蓝 2024-12-18 18:21:36

您的数据是使用 mu=0.07 和 sigma=0.89 生成的。
您正在对照平均值为 0、标准差为 1 的正态分布来测试此数据。

零假设 (H0) 是指您的数据作为样本的分布等于标准正态分布，其中均值 0，标准偏差 1。

较小的 p 值表明，在概率 p 值的情况下，预计会出现与 D 一样大的检验统计量。

换句话说，（p 值约为 8.9e-22）H0 为真的可能性极小。

这是合理的，因为均值和标准偏差不匹配。

将您的结果与以下内容进行比较：

In [22]: import numpy as np
In [23]: import scipy.stats as stats
In [24]: stats.kstest(np.random.normal(0,1,10000),'norm')
Out[24]: (0.007038739782416259, 0.70477679457831155)

要测试您的数据是否为高斯分布，您可以对其进行移动和重新调整，使其在均值 0 和标准偏差 1 的情况下正常：

data=np.random.normal(mu,sigma,10000)
normed_data=(data-mu)/sigma
print(stats.kstest(normed_data,'norm'))
# (0.0085805670733036798, 0.45316245879609179)

警告： (非常感谢 user333700 （又名 scipy 开发人员 Josef Perktold)) 如果您不知道 mu 和 sigma，估计参数使 p 值无效：

import numpy as np
import scipy.stats as stats

mu = 0.3
sigma = 5

num_tests = 10**5
num_rejects = 0
alpha = 0.05
for i in xrange(num_tests):
    data = np.random.normal(mu, sigma, 10000)
    # normed_data = (data - mu) / sigma    # this is okay
    # 4915/100000 = 0.05 rejects at rejection level 0.05 (as expected)
    normed_data = (data - data.mean()) / data.std()    # this is NOT okay
    # 20/100000 = 0.00 rejects at rejection level 0.05 (not expected)
    D, pval = stats.kstest(normed_data, 'norm')
    if pval < alpha:
        num_rejects += 1
ratio = float(num_rejects) / num_tests
print('{}/{} = {:.2f} rejects at rejection level {}'.format(
    num_rejects, num_tests, ratio, alpha))

打印

20/100000 = 0.00 rejects at rejection level 0.05 (not expected)

显示 stats.kstest 可能不会拒绝预期数量的零假设
如果使用样本的平均值和标准差对样本进行归一化

normed_data = (data - data.mean()) / data.std()    # this is NOT okay

Your data was generated with mu=0.07 and sigma=0.89.
You are testing this data against a normal distribution with mean 0 and standard deviation of 1.

The null hypothesis (H0) is that the distribution of which your data is a sample is equal to the standard normal distribution with mean 0, std deviation 1.

The small p-value is indicating that a test statistic as large as D would be expected with probability p-value.

In other words, (with p-value ~8.9e-22) it is highly unlikely that H0 is true.

That is reasonable, since the means and std deviations don't match.

Compare your result with:

In [22]: import numpy as np
In [23]: import scipy.stats as stats
In [24]: stats.kstest(np.random.normal(0,1,10000),'norm')
Out[24]: (0.007038739782416259, 0.70477679457831155)

To test your data is gaussian, you could shift and rescale it so it is normal with mean 0 and std deviation 1:

data=np.random.normal(mu,sigma,10000)
normed_data=(data-mu)/sigma
print(stats.kstest(normed_data,'norm'))
# (0.0085805670733036798, 0.45316245879609179)

Warning: (many thanks to user333700 (aka scipy developer Josef Perktold)) If you don't know mu and sigma, estimating the parameters makes the p-value invalid:

import numpy as np
import scipy.stats as stats

mu = 0.3
sigma = 5

num_tests = 10**5
num_rejects = 0
alpha = 0.05
for i in xrange(num_tests):
    data = np.random.normal(mu, sigma, 10000)
    # normed_data = (data - mu) / sigma    # this is okay
    # 4915/100000 = 0.05 rejects at rejection level 0.05 (as expected)
    normed_data = (data - data.mean()) / data.std()    # this is NOT okay
    # 20/100000 = 0.00 rejects at rejection level 0.05 (not expected)
    D, pval = stats.kstest(normed_data, 'norm')
    if pval < alpha:
        num_rejects += 1
ratio = float(num_rejects) / num_tests
print('{}/{} = {:.2f} rejects at rejection level {}'.format(
    num_rejects, num_tests, ratio, alpha))

prints

20/100000 = 0.00 rejects at rejection level 0.05 (not expected)

which shows that stats.kstest may not reject the expected number of null hypotheses
if the sample is normalized using the sample's mean and standard deviation

normed_data = (data - data.mean()) / data.std()    # this is NOT okay

回复收藏 0 原文

在巴黎塔顶看东京樱花 2024-12-18 18:21:36

unutbu答案的更新：

对于仅依赖于位置和尺度但没有形状参数的分布，几个拟合优度检验统计量的分布与位置和尺度值无关。该分布是非标准的，但是，它可以制成表格并与基础分布的任何位置和规模一起使用。

具有估计位置和规模的正态分布的 Kolmogorov-Smirnov 检验也称为 Lilliefors 检验。

它现在可在 statsmodels 中使用，并具有相关决策范围的近似 p 值。

>>> import numpy as np
>>> mu,sigma = 0.07, 0.89
>>> x = np.random.normal(mu, sigma, 10000)
>>> import statsmodels.api as sm
>>> sm.stats.lilliefors(x)
(0.0055267411213540951, 0.66190841161592895)

大多数蒙特卡洛研究表明，Anderson-Darling 检验比 Kolmogorov-Smirnov 检验更有效。它可在具有临界值的 scipy.stats 和具有近似 p 值的 statsmodels 中使用：

>>> sm.stats.normal_ad(x)
(0.23016468240712129, 0.80657628536145665)

这两个检验都没有拒绝样本呈正态分布的原假设。
而问题中的 kstest 拒绝样本呈标准正态分布的原假设。

An update on unutbu's answer:

For distributions that only depend on the location and scale but do not have a shape parameter, the distributions of several goodness-of-fit test statistics are independent of the location and scale values. The distribution is non-standard, however, it can be tabulated and used with any location and scale of the underlying distribution.

The Kolmogorov-Smirnov test for the normal distribution with estimated location and scale is also called the Lilliefors test.

It is now available in statsmodels, with approximate p-values for the relevant decision range.

>>> import numpy as np
>>> mu,sigma = 0.07, 0.89
>>> x = np.random.normal(mu, sigma, 10000)
>>> import statsmodels.api as sm
>>> sm.stats.lilliefors(x)
(0.0055267411213540951, 0.66190841161592895)

Most Monte Carlo studies show that the Anderson-Darling test is more powerful than the Kolmogorov-Smirnov test. It is available in scipy.stats with critical values, and in statsmodels with approximate p-values:

>>> sm.stats.normal_ad(x)
(0.23016468240712129, 0.80657628536145665)

Neither of the test rejects the Null hypothesis that the sample is normal distributed.
While the kstest in the question rejects the Null hypothesis that the sample is standard normal distributed.

回复收藏 0 原文

牵你的手，一向走下去 2024-12-18 18:21:36

您可能还需要考虑使用 Shapiro-Wilk 检验，该检验“检验数据是从正态分布中提取的零假设”。它也在 scipy 中实现：

http://docs.scipy.org/doc/scipy/reference/ generated/scipy.stats.shapiro.html

您需要将数据直接传递到函数中。

import scipy

W, p = scipy.stats.shapiro(dataset)
print("Shapiro-Wilk test statistic, W:", W, "\n", "p-value:", p)

它返回类似：

 Shapiro-Wilk test statistic, W: 0.7761164903640747 
 p-value: 6.317247641091492e-37

With p << 0.01（或 0.05，如果您愿意的话，这并不重要），我们有充分的理由拒绝这些数据是从正态分布中提取的零假设。

You may also want to consider using the Shapiro-Wilk test, which "tests the null hypothesis that the data was drawn from a normal distribution." It's also implemented in scipy:

http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.shapiro.html

You'll need to pass your data directly into the function.

import scipy

W, p = scipy.stats.shapiro(dataset)
print("Shapiro-Wilk test statistic, W:", W, "\n", "p-value:", p)

Which returns something like:

 Shapiro-Wilk test statistic, W: 0.7761164903640747 
 p-value: 6.317247641091492e-37

With p << 0.01 (or 0.05, if you prefer - it doesn't matter,) we have good reason to reject the null hypothesis that these data were drawn from a normal distribution.

回复收藏 0 原文

花开柳相依 2024-12-18 18:21:36

作为对 @unutbu 答案的补充，您还可以提供 kstest 中测试分布的分布参数。假设我们有一些来自变量的样本（并将它们命名为 datax），并且我们想要检查这些样本是否可能不是来自对数正态分布、均匀分布或正态分布。请注意，对于 scipy stats，每个分布采用输入参数的方式略有不同。现在，由于 kstest 中的“args”（元组或序列），可以为您想要测试的 scipy.stats 分布提供参数。

:) 我还添加了使用两个样本测试的选项，以防您想以任何一种方式进行：

import numpy as np
from math import sqrt
from scipy.stats import kstest, ks_2samp, lognorm
import scipy.stats

def KSSeveralDists(data,dists_and_args,samplesFromDists=100,twosampleKS=True):
    returnable={}
    for dist in dists_and_args:
        try:
            if twosampleKS:
                try:
                    loc=dists_and_args[dist][0]
                    scale=dists_and_args[dist][1]
                    expression='scipy.stats.'+dist+'.rvs(loc=loc,scale=scale,size=samplesFromDists)'
                    sampledDist=eval(expression)
                except:
                    sc=dists_and_args[dist][0]
                    loc=dists_and_args[dist][1]
                    scale=dists_and_args[dist][2]
                    expression='scipy.stats.'+dist+'.rvs(sc,loc=loc,scale=scale,size=samplesFromDists)'
                    sampledDist=eval(expression)
                D,p=ks_2samp(data,sampledDist)
            else:
                D,p=kstest(data,dist,N=samplesFromDists,args=dists_and_args[dist])
        except:
            continue
        returnable[dist]={'KS':D,'p-value':p}
    return returnable

a=lambda m,std: m-std*sqrt(12.)/2.
b=lambda m,std: m+std*sqrt(12.)/2.
sz=2000

sc=0.5 #shape 
datax=lognorm.rvs(sc,loc=0.,scale=1.,size=sz)
normalargs=(datax.mean(),datax.std())

#suppose these are the parameters you wanted to pass for each distribution
dists_and_args={'norm':normalargs,
               'uniform':(a(*normalargs),b(*normalargs)),
               'lognorm':[0.5,0.,1.]
              }
print "two sample KS:"
print KSSeveralDists(datax,dists_and_args,samplesFromDists=sz,twosampleKS=True)
print "one sample KS:"
print KSSeveralDists(datax,dists_and_args,samplesFromDists=sz,twosampleKS=False)

它给出的输出如下：

两个样本 KS：
{'lognorm'：{'KS'：0.023499999999999965，'p值'：0.63384188886455217}，'范数'：{'KS'：0.10600000000000004，'p值'： 2.918766666723155e-10}, 'uniform': {'KS': 0.15300000000000002, 'p-value': 6.443660021191129e-21}}

一个样本 KS:
{'lognorm'：{'KS'：0.01763415915126032，'p值'：0.56275820961065193}，'范数'：{'KS'：0.10792612430093562，'p值'：0.0}，'均匀'： {'KS': 0.14910036159697559, 'p-value': 0.0}}

注意：对于 scipy.stats 均匀分布，a 和 b 取为 a=loc 和 b=loc + scale（参见文档）。

As a complement to the answer by @unutbu , you could also provide the distribution parameters for the test distribution in kstest. Suppose that we had some samples from a variable (and named them datax), and we wanted to check if those samples could possibly not come from a lognormal, a uniform, or a normal. Note that for scipy stats the way the input parameters are taken for each distribution varies a bit. Now, thanks to "args" (tuple or sequence) in kstest, is possible provide the arguments for the scipy.stats distribution you want to test against.

:) I also added the option of using a two-sample test, in case you wanted to do it either way:

import numpy as np
from math import sqrt
from scipy.stats import kstest, ks_2samp, lognorm
import scipy.stats

def KSSeveralDists(data,dists_and_args,samplesFromDists=100,twosampleKS=True):
    returnable={}
    for dist in dists_and_args:
        try:
            if twosampleKS:
                try:
                    loc=dists_and_args[dist][0]
                    scale=dists_and_args[dist][1]
                    expression='scipy.stats.'+dist+'.rvs(loc=loc,scale=scale,size=samplesFromDists)'
                    sampledDist=eval(expression)
                except:
                    sc=dists_and_args[dist][0]
                    loc=dists_and_args[dist][1]
                    scale=dists_and_args[dist][2]
                    expression='scipy.stats.'+dist+'.rvs(sc,loc=loc,scale=scale,size=samplesFromDists)'
                    sampledDist=eval(expression)
                D,p=ks_2samp(data,sampledDist)
            else:
                D,p=kstest(data,dist,N=samplesFromDists,args=dists_and_args[dist])
        except:
            continue
        returnable[dist]={'KS':D,'p-value':p}
    return returnable

a=lambda m,std: m-std*sqrt(12.)/2.
b=lambda m,std: m+std*sqrt(12.)/2.
sz=2000

sc=0.5 #shape 
datax=lognorm.rvs(sc,loc=0.,scale=1.,size=sz)
normalargs=(datax.mean(),datax.std())

#suppose these are the parameters you wanted to pass for each distribution
dists_and_args={'norm':normalargs,
               'uniform':(a(*normalargs),b(*normalargs)),
               'lognorm':[0.5,0.,1.]
              }
print "two sample KS:"
print KSSeveralDists(datax,dists_and_args,samplesFromDists=sz,twosampleKS=True)
print "one sample KS:"
print KSSeveralDists(datax,dists_and_args,samplesFromDists=sz,twosampleKS=False)

which gives as an output something like:

two sample KS:
{'lognorm': {'KS': 0.023499999999999965, 'p-value': 0.63384188886455217}, 'norm': {'KS': 0.10600000000000004, 'p-value': 2.918766666723155e-10}, 'uniform': {'KS': 0.15300000000000002, 'p-value': 6.443660021191129e-21}}

one sample KS:
{'lognorm': {'KS': 0.01763415915126032, 'p-value': 0.56275820961065193}, 'norm': {'KS': 0.10792612430093562, 'p-value': 0.0}, 'uniform': {'KS': 0.14910036159697559, 'p-value': 0.0}}

Note: For the scipy.stats uniform distribution, a and b are taken as a=loc and b=loc + scale (see documentation).

回复收藏 0 原文

~没有更多了~