如何拟合用 scipy.stats.rv_continuous 定义的分布？

发布于 2025-01-16 13:22:43 字数 2473 浏览 1 评论 0原文

我想用Python中的分布组合来拟合数据，并以最合乎逻辑的方式通过scipy.stats.rv_continuous。我能够使用此类定义一个新的分布并拟合一些人工数据，但是拟合产生的变量比分布的自由参数多两个变量，我不明白如何解释这些变量。此外，安装速度非常慢，因此任何有关如何加快速度的建议将不胜感激。

这里有一个最小可重现的例子（为了这个问题，我将使用正态分布和对数正态分布的组合）：

import numpy as np
import scipy.stats as stats

# Create the new distribution combining a normal and lognormal distr
def lognorm(x,s,loc,scale):
    return(stats.lognorm.pdf(x, s = s, loc = loc, scale = scale))
def norm(x,loc,scale):
    return(stats.norm.pdf(x, loc = loc, scale = scale))

class combo_dist_gen(stats.rv_continuous):
    "Gaussian and lognormal combination"
    def _pdf(self, x,  s1, loc1, scale1, loc2, scale2):
        return (lognorm(x, s1, loc1, scale1) + norm(x, loc2, scale2))

combo_dist = combo_dist_gen(name='combo_dist')

# Generate some artificial data
gen_data = np.append(stats.norm.rvs(loc=0.2, scale=0.1, size=5000),\
    stats.lognorm.rvs(size=5000, s=0.1, loc=0.2, scale=0.5))

# Fit the data with the new distribution
# I provide initial values not too far from the original distribution
Fit_results = combo_dist.fit(gen_data, 0.15, 0.15, 0.6, 0.25, 0.05)

拟合速度非常慢的一部分似乎有效，但是它返回 7 个变量，而原始分布只有5 个自由参数：

print(Fit_results)
(0.0608036989522803, 0.030858042734341062, 0.9475658421131599, 0.4083398045761335, 0.11227588564167855, -0.15941656336149485, 0.8806248445561231)

我不明白这两个附加变量是什么以及它们如何进入分布的定义。

如果我使用拟合结果生成一个新的 pdf，我可以很好地重现原始分布，但仅使用所有 7 个变量：

xvals = np.linspace(-1,3, 1000)
gen_data_pdf = (lognorm(xvals,0.1, 0.2, 0.5)+norm(x, 0.2,0.1))
ydata1 = combo_dist.pdf(xvals,*Fit_results)
ydata2 = combo_dist.pdf(xvals,*Fit_results[:5])

plt.figure()
plt.plot(xvals, gen_data_pdf, label = 'Original distribution')
plt.plot(xvals, ydata1, label = 'Fitted distribution, all parameters')
plt.plot(xvals, ydata2, label = 'Fitted distribution, only first 5 parameters')

plt.legend()

ps1 官方文档对我来说有点晦涩难懂，似乎没有提供任何有用的示例。这里有一些答案提供了一些解释（例如此处和此处），但它们似乎都没有解决我的问题。

PS2 我知道组合分布的 pdf 没有标准化为 1。在我最初的实现中，我将 pdf 除以 2，但由于某种原因，额外的除法拟合不起作用（运行时错误，不收敛）

原文

I would like to fit data with a combination of distributions in python and the most logical way it seems to be via scipy.stats.rv_continuous. I was able to define a new distribution using this class and to fit some artificial data, however the fit produces 2 variables more than the free parameters of the distribution and I don't understand how to interpret these. In addition, the fit is very slow so any suggestion on how to speed it up would be highly appreciated.

Here a minimum reproducible example (for the sake of this question I will be using the combination of a normal and a lognormal distributions):

import numpy as np
import scipy.stats as stats

# Create the new distribution combining a normal and lognormal distr
def lognorm(x,s,loc,scale):
    return(stats.lognorm.pdf(x, s = s, loc = loc, scale = scale))
def norm(x,loc,scale):
    return(stats.norm.pdf(x, loc = loc, scale = scale))

class combo_dist_gen(stats.rv_continuous):
    "Gaussian and lognormal combination"
    def _pdf(self, x,  s1, loc1, scale1, loc2, scale2):
        return (lognorm(x, s1, loc1, scale1) + norm(x, loc2, scale2))

combo_dist = combo_dist_gen(name='combo_dist')

# Generate some artificial data
gen_data = np.append(stats.norm.rvs(loc=0.2, scale=0.1, size=5000),\
    stats.lognorm.rvs(size=5000, s=0.1, loc=0.2, scale=0.5))

# Fit the data with the new distribution
# I provide initial values not too far from the original distribution
Fit_results = combo_dist.fit(gen_data, 0.15, 0.15, 0.6, 0.25, 0.05)

A part from being very slow the fit seems to work, however it returns 7 variable while the original distribution only has 5 free parameters:

print(Fit_results)
(0.0608036989522803, 0.030858042734341062, 0.9475658421131599, 0.4083398045761335, 0.11227588564167855, -0.15941656336149485, 0.8806248445561231)

I don't understand what these 2 additional variables are and how they enter into the definition of the distribution.

If I generate a new pdf using the fit results I can reproduce well the original distribution but only using all the 7 variables:

xvals = np.linspace(-1,3, 1000)
gen_data_pdf = (lognorm(xvals,0.1, 0.2, 0.5)+norm(x, 0.2,0.1))
ydata1 = combo_dist.pdf(xvals,*Fit_results)
ydata2 = combo_dist.pdf(xvals,*Fit_results[:5])

plt.figure()
plt.plot(xvals, gen_data_pdf, label = 'Original distribution')
plt.plot(xvals, ydata1, label = 'Fitted distribution, all parameters')
plt.plot(xvals, ydata2, label = 'Fitted distribution, only first 5 parameters')

plt.legend()

p.s.1
The official documentation is a bit obscure to me and doesn't seem to provide any useful example. Here on SO there are a few answers providing some explanations (like here and here) but none of them seem to address my issue.

p.s.2
I am aware that the pdf of the combined distribution is not normalized to 1. In my original implementation I was dividing the pdf by 2 but for some reason with the additional division the fit didn't work (RuntimeError, no convergence)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

七颜 2025-01-23 13:22:43

这两个变量是 loc 和 scale 参数，用于根据文档移动和缩放分布。只需通过以下方式修复值：

Fit_results = combo_dist.fit(gen_data, 0.15, 0.15, 0.6, 0.25, 0.05,
                             floc=0, fscale=1)

The 2 variables are the loc and scale parameters to shift and scale the distribution according to the documentation. Just fix the values by:

Fit_results = combo_dist.fit(gen_data, 0.15, 0.15, 0.6, 0.25, 0.05,
                             floc=0, fscale=1)

回复收藏 0 原文

~没有更多了~

关于作者

半夏半凉

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

如何拟合用 scipy.stats.rv_continuous 定义的分布？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

tomoekana

无边思念无边月

眼角的笑意。

在风中等你

是你

syong71

友情链接

如何拟合用 scipy.stats.rv_continuous 定义的分布？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

tomoekana

无边思念无边月

眼角的笑意。

在风中等你

是你

syong71

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。