简介:我有一个包含 30,000 多个整数值的列表,范围从 0 到 47(含),例如[0,0,0,0,..,1,1,1,1 ,...,2,2,2,2,...,47,47,47,...]
从某些连续分布中采样。列表中的值不一定按顺序排列,但顺序对于此问题并不重要。
问题:根据我的分布,我想计算任何给定值的 p 值(看到更大值的概率)。例如,正如您所看到的,0 的 p 值将接近 1,而更高数字的 p 值将趋于 0。
我不知道我是否正确,但为了确定概率,我认为我需要拟合我的将数据转换为最适合描述我的数据的理论分布。我认为需要某种拟合优度检验来确定最佳模型。
有没有办法在Python(Scipy
或Numpy
)中实现这样的分析?
您能举出一些例子吗?
INTRODUCTION: I have a list of more than 30,000 integer values ranging from 0 to 47, inclusive, e.g.[0,0,0,0,..,1,1,1,1,...,2,2,2,2,...,47,47,47,...]
sampled from some continuous distribution. The values in the list are not necessarily in order, but order doesn't matter for this problem.
PROBLEM: Based on my distribution I would like to calculate p-value (the probability of seeing greater values) for any given value. For example, as you can see p-value for 0 would be approaching 1 and p-value for higher numbers would be tending to 0.
I don't know if I am right, but to determine probabilities I think I need to fit my data to a theoretical distribution that is the most suitable to describe my data. I assume that some kind of goodness of fit test is needed to determine the best model.
Is there a way to implement such an analysis in Python (Scipy
or Numpy
)?
Could you present any examples?
发布评论
评论(13)
使用误差平方和 (SSE) 进行分布拟合
这是对 Saullo 的答案 的更新和修改,它使用了完整的列表当前的
scipy.stats
分布并返回分布直方图和数据直方图之间 SSE 最小的分布。进行拟合的示例
使用 来自 < 的厄尔尼诺数据集 code>statsmodels,拟合分布并确定误差。返回误差最小的分布。
所有发行版
最佳拟合分布
示例代码
Distribution Fitting with Sum of Square Error (SSE)
This is an update and modification to Saullo's answer, that uses the full list of the current
scipy.stats
distributions and returns the distribution with the least SSE between the distribution's histogram and the data's histogram.Example Fitting
Using the El Niño dataset from
statsmodels
, the distributions are fit and error is determined. The distribution with the least error is returned.All Distributions
Best Fit Distribution
Example Code
SciPy v1.6.0 中实现了 90 多个分布函数。您可以使用它们的
fit()
方法。检查下面的代码以了解更多详细信息:参考文献:
拟合分布、拟合优度、p 值。是否可以使用 Scipy (Python) 来完成此操作?
使用 Scipy 进行分布拟合
这里有一个包含名称的列表Scipy 0.12.0 (VI) 中可用的所有分布函数:
There are more than 90 implemented distribution functions in SciPy v1.6.0. You can test how some of them fit to your data using their
fit()
method. Check the code below for more details:References:
Fitting distributions, goodness of fit, p-value. Is it possible to do this with Scipy (Python)?
Distribution fitting with Scipy
And here a list with the names of all distribution functions available in Scipy 0.12.0 (VI):
您可以尝试 distfit 库。如果您有更多问题,请告诉我,我也是这个开源库的开发者。
请注意,在这种情况下,由于均匀分布,所有点都将很重要。如果需要,您可以使用 dist.y_pred 进行过滤。
更详细的信息和示例可以在文档页面中找到。
You can try the distfit library. In case you have more questions, let me know, I am also the developer of this open-source library.
Note that in this case, all points will be significant because of the uniform distribution. You can filter with the dist.y_pred if required.
More detailed information and examples can be found at the documentation pages.
@Saullo Castro 提到的
fit()
方法提供了最大似然估计 (MLE)。数据的最佳分布是为您提供最高的分布,可以通过多种不同的方式确定:例如1,即为您提供最高对数似然的分布。
2、为您提供最小 AIC、BIC 或 BICc 值的那个(请参阅 wiki:http://en. wikipedia.org/wiki/Akaike_information_criterion,基本上可以看作是根据参数数量调整的对数似然,因为参数越多的分布预计会更好地拟合)
3,最大化的分布贝叶斯后验概率。 (参见维基:http://en.wikipedia.org/wiki/Posterior_probability)
当然,如果您已经有一个应该描述您的数据的分布(基于您特定领域的理论)并且想要坚持使用它,您将跳过识别最佳拟合分布的步骤。
scipy
没有提供计算对数似然的函数(尽管提供了 MLE 方法),但硬编码很容易:参见 `scipy.stat.distributions` 的内置概率密度函数是否比用户提供了一个吗?fit()
method mentioned by @Saullo Castro provides maximum likelihood estimates (MLE). The best distribution for your data is the one give you the highest can be determined by several different ways: such as1, the one that gives you the highest log likelihood.
2, the one that gives you the smallest AIC, BIC or BICc values (see wiki: http://en.wikipedia.org/wiki/Akaike_information_criterion, basically can be viewed as log likelihood adjusted for number of parameters, as distribution with more parameters are expected to fit better)
3, the one that maximize the Bayesian posterior probability. (see wiki: http://en.wikipedia.org/wiki/Posterior_probability)
Of course, if you already have a distribution that should describe you data (based on the theories in your particular field) and want to stick to that, you will skip the step of identifying the best fit distribution.
scipy
does not come with a function to calculate log likelihood (although MLE method is provided), but hard code one is easy: see Is the build-in probability density functions of `scipy.stat.distributions` slower than a user provided one?AFAICU,你的分布是离散的(而且只是离散的)。因此,只需计算不同值的频率并将其标准化就足以满足您的目的。因此,举一个例子来证明这一点:
因此,看到高于 1 的值的概率很简单(根据 互补累积分布函数 (ccdf):
请注意ccdf 与 生存函数 (sf),但它也有定义具有离散分布,而 sf 仅针对连续分布定义。
AFAICU, your distribution is discrete (and nothing but discrete). Therefore just counting the frequencies of different values and normalizing them should be enough for your purposes. So, an example to demonstrate this:
Thus, probability of seeing values higher than
1
is simply (according to the complementary cumulative distribution function (ccdf):Please note that ccdf is closely related to survival function (sf), but it's also defined with discrete distributions, whereas sf is defined only for contiguous distributions.
以下代码是一般答案的版本,但进行了更正和清晰。
如果你想做更详细的分析,我推荐Phittter库
The following code is the version of the general answer but with corrections and clarity.
If you want to do a more detailed analysis, I recommend Phittter library
虽然上面的许多答案都是完全有效的,但似乎没有人完全回答你的问题,特别是以下部分:
参数方法
这是您所描述的使用一些理论分布并将参数拟合到数据的过程,并且有一些关于如何做到这一点的出色答案。
非参数方法
但是,也可以使用非参数方法来解决您的问题,这意味着您根本不假设任何基础分布。
通过使用所谓的经验分布函数,该函数等于:
Fn(x)= SUM( I[X<=x] ) / n。所以低于 x 的值的比例。
正如上述答案之一所指出的,您感兴趣的是逆 CDF(累积分布函数),它等于 1-F(x)
可以证明经验分布函数将收敛到生成数据的任何“真实”CDF。
此外,可以通过以下方式直接构建 1-alpha 置信区间:
则 P( L(X) <= F(X) <= U(X) ) >= 1-alpha对于所有 x。
我很惊讶 PierrOz 答案有 0 票,而它是使用非参数对问题的完全有效的答案估计 F(x) 的方法。
请注意,您提到的对于任何 x>47 的 P(X>=x)=0 问题只是个人偏好,可能会导致您选择参数方法而不是非参数方法。然而,这两种方法对于您的问题都是完全有效的解决方案。
有关上述陈述的更多详细信息和证明,我建议您查看
“所有统计:统计推断简明课程,作者:Larry A. Wasserman”。一本关于参数和非参数推理的优秀书籍。
编辑:
由于您特别要求一些 python 示例,因此可以使用 numpy 来完成:
While many of the above answers are completely valid, no one seems to answer your question completely, specifically the part:
The parametric approach
This is the process you're describing of using some theoretical distribution and fitting the parameters to your data and there's some excellent answers how to do this.
The non-parametric approach
However, it's also possible to use a non-parametric approach to your problem, which means you do not assume any underlying distribution at all.
By using the so-called Empirical distribution function which equals:
Fn(x)= SUM( I[X<=x] ) / n. So the proportion of values below x.
As was pointed out in one of the above answers is that what you're interested in is the inverse CDF (cumulative distribution function), which is equal to 1-F(x)
It can be shown that the empirical distribution function will converge to whatever 'true' CDF that generated your data.
Furthermore, it is straightforward to construct a 1-alpha confidence interval by:
Then P( L(X) <= F(X) <= U(X) ) >= 1-alpha for all x.
I'm quite surprised that PierrOz answer has 0 votes, while it's a completely valid answer to the question using a non-parametric approach to estimating F(x).
Note that the issue you mention of P(X>=x)=0 for any x>47 is simply a personal preference that might lead you to chose the parametric approach above the non-parametric approach. Both approaches however are completely valid solutions to your problem.
For more details and proofs of the above statements I would recommend having a look at
'All of Statistics: A Concise Course in Statistical Inference by Larry A. Wasserman'. An excellent book on both parametric and non-parametric inference.
EDIT:
Since you specifically asked for some python examples it can be done using numpy:
我发现最简单的方法是使用 fitter 模块,您只需
pip install fitter
即可。您所要做的就是通过 pandas 导入数据集。
它具有内置功能,可以从 scipy 搜索所有 80 个分布,并通过各种方法获得最适合数据的结果。示例:
这里作者提供了一个发行版列表,因为扫描所有 80 个发行版可能非常耗时。
这将为您提供 5 个最佳分布及其拟合标准:
您还拥有
distributions=get_common_distributions()
属性,其中包含大约 10 个最常用的分布,并为您拟合和检查它们。它还具有许多其他功能,例如绘制直方图,并且可以在此处。
对于科学家、工程师和一般人来说,这是一个被严重低估的模块。
The easiest way I found was by using fitter module and you can simply
pip install fitter
.All you got to do is import the dataset by pandas.
It has built-in function to search all 80 distributions from scipy and get the best fit to the data by various methods. Example:
Here the author has provided a list of distributions since scanning all 80 can be time consuming.
This will get you 5 best distributions with their fit criteria:
You also have
distributions=get_common_distributions()
attribute which has about 10 most commonly used distributions, and fits and checks them for you.It also has a bunch of other functions like plotting histograms and all and complete documentation can be found here.
It is a seriously underrated module for scientists, engineers, and in general.
对我来说这听起来像是概率密度估计问题。
另请参阅http://jpktd.blogspot.com/2009/03 /using-gaussian-kernel-densis.html。
It sounds like probability density estimation problem to me.
Also see http://jpktd.blogspot.com/2009/03/using-gaussian-kernel-density.html.
将数据存储在字典中怎么样,其中键是 0 到 47 之间的数字,值是原始列表中相关键出现的次数?
因此,您的可能性 p(x) 将是大于 x 的键的所有值的总和除以 30000。
What about storing your data in a dictionary where keys would be the numbers between 0 and 47 and values the number of occurrences of their related keys in your original list?
Thus your likelihood p(x) will be the sum of all the values for keys greater than x divided by 30000.
通过 OpenTURNS,我将使用 BIC 标准来选择适合此类数据的最佳分布。这是因为这个标准并没有给具有更多参数的分布带来太多优势。事实上,如果分布的参数越多,拟合的分布就越容易接近数据。此外,Kolmogorov-Smirnov 在这种情况下可能没有意义,因为测量值中的微小误差都会对 p 值产生巨大影响。
为了说明该过程,我加载了厄尔尼诺数据,其中包含从 1950 年到 2010 年的 732 个每月温度测量值:
使用
GetContinouslyUniVariateFactories
静态函数可以轻松获取 30 个内置单变量工厂分布方法。完成后,BestModelBIC
静态方法将返回最佳模型和相应的 BIC 分数。打印:
为了以图形方式比较与直方图的拟合,我使用最佳分布的
drawPDF
方法。这会产生:
有关此主题的更多详细信息,请参见 BestModelBIC 文档。可以将 Scipy 发行版包含在 SciPyDistribution 甚至带有 ChaosPyDistribution,但我想当前的脚本满足了最实际的需求目的。
With OpenTURNS, I would use the BIC criteria to select the best distribution that fits such data. This is because this criteria does not give too much advantage to the distributions which have more parameters. Indeed, if a distribution has more parameters, it is easier for the fitted distribution to be closer to the data. Moreover, the Kolmogorov-Smirnov may not make sense in this case, because a small error in the measured values will have a huge impact on the p-value.
To illustrate the process, I load the El-Nino data, which contains 732 monthly temperature measurements from 1950 to 2010:
It is easy to get the 30 of built-in univariate factories of distributions with the
GetContinuousUniVariateFactories
static method. Once done, theBestModelBIC
static method returns the best model and the corresponding BIC score.which prints:
In order to graphically compare the fit to the histogram, I use the
drawPDF
methods of the best distribution.This produces:
More details on this topic are presented in the BestModelBIC doc. It would be possible to include the Scipy distribution in the SciPyDistribution or even with ChaosPy distributions with ChaosPyDistribution, but I guess that the current script fulfills most practical purposes.
我从第一个答案重新设计了分布函数,其中包含一个选择参数,用于选择一个拟合优度测试,这将缩小适合数据的分布函数:
然后继续 make_pdf 函数以根据您的拟合优度检验。
I redesign the distribution function from first answer where I included a selection parameter for selecting one of Goodness-to-fit tests which will narrow down the distribution function which fits the data:
then continue to make_pdf function to get the selected distribution based on the your Goodness-of-fit test/s.
基于 Timothy Davenports 回答,我重构了代码,使其可用作库,并使其可作为 github 和 pypi 项目使用请参阅:
一个目标是使密度选项可用并将结果输出为文件。查看实现的主要部分:
该库还有一个单元测试,请参见例如
正态分布测试
向项目添加问题。还可以进行讨论。
下面的代码可能不是最新的,请使用 pypi 或 github 存储库获取最新版本。
Based on Timothy Davenports answer i have refactored the code to be useable as a library and made it available as a github and pypi project see:
One goal is to make the density option available and to output the result as files. See the main part of the implementation:
There is are also a unit tests for the library see e.g.
Normal distribution test
Please add issues to the project if you see problems or room for improvement. Discussions is also enabled.
the code below might not be up-todate please use pypi or the github repository for the most current version.