使用Sklearn&#x27的KDE函数绘制数据的概率分布

发布于 2025-02-06 06:41:44 字数 1370 浏览 1 评论 0原文

我有许多变量的样本。我想使用这些样品绘制变量的概率分布。我正在使用高斯内核的内核密度估计。我正在为此目的使用sklearn库。这是我实施的示例代码:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KernelDensity

# -- data
init_range = 0.0793
X = np.random.uniform(low=-init_range, high=init_range, size=133280)[:, np.newaxis]

# -- kernel density estimation
kde = KernelDensity(kernel="gaussian", bandwidth=0.2).fit(X)
X_plot = np.linspace(min(X).item(), max(X).item(), 1000)[:, np.newaxis]
log_dens = kde.score_samples(X_plot)

# -- plot density
plt.plot( X_plot[:, 0], np.exp(log_dens), lw=2, linestyle="-")
plt.ylim([0, 2.1])
plt.show()

以下是结果输出:

“在此处输入图像说明”

如您所见,Y轴上的值高于一个。因此,Y轴未显示概率分布。我进一步绘制了该数据的直方图:

# -- plot hist
n_bins = 40
weights = np.ones_like(X) / float(len(X))
prob, bins, _ = plt.hist(X, n_bins, density=False, histtype='step', color='red', weights=weights)
plt.show()

结果如下:

“在此处输入图像说明”

这是有意义的,当垃圾箱总结到一个:0.025*40 = 1

我正在很难理解为什么我的KDE情节不是分布。我该如何解决?我是否缺少一个归一化步骤?

I have a number of samples of a variable. I would like to use these samples to plot the probability distribution of the variable. I'm using kernel density estimation with a Gaussian kernel. I'm using sklearn library for this purpose. Here is the sample code I have implemented:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KernelDensity

# -- data
init_range = 0.0793
X = np.random.uniform(low=-init_range, high=init_range, size=133280)[:, np.newaxis]

# -- kernel density estimation
kde = KernelDensity(kernel="gaussian", bandwidth=0.2).fit(X)
X_plot = np.linspace(min(X).item(), max(X).item(), 1000)[:, np.newaxis]
log_dens = kde.score_samples(X_plot)

# -- plot density
plt.plot( X_plot[:, 0], np.exp(log_dens), lw=2, linestyle="-")
plt.ylim([0, 2.1])
plt.show()

Below is the resulting output:

enter image description here

As you can see, the value on the y axis is above one. Hence, the y axis is NOT showing the probability distribution. I further plotted the histogram for this data:

# -- plot hist
n_bins = 40
weights = np.ones_like(X) / float(len(X))
prob, bins, _ = plt.hist(X, n_bins, density=False, histtype='step', color='red', weights=weights)
plt.show()

and the result is below:

enter image description here

which makes sense as the bins sum up to one: 0.025*40=1

I'm having a hard time understanding why my kde plot is not a distribution. How can I fix this? Is there a normalization step that I'm missing?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

提笔书几行 2025-02-13 06:41:44

首先,如果您扩展了x_plot轴的限制(即x_plot = np.linspace(-1,1,1,...)),您会看到您的KDE估计一个相当高的高斯,曲线下的区域仍然为1。
由于假定的分布是连续的:确切点没有真正的概率,因此您不应将其视为Y值;间隔的估计概率是曲线下的各个面积。

示例代码以验证达到0-0.004范围的估计概率(与直方图大致相同的宽度大致相同):

import scipy.integrate as integrate
interval = np.linspace(0, 0.004, 1000)[:, np.newaxis]
log_dens = kde.score_samples(interval)
print(integrate.trapz(np.exp(log_dens), interval[:,0]))

其次,一旦检查了曲线下的区域,您将看到当前的超参数不会过于准确地估计,减少频段或选择其他算法可能会有所帮助。

您也可以应用网格搜索以找到最少不准确的算法和bandwith,尽管除非减少样本量,否则这将需要大量时间。同样,选择狭窄的带宽可能会导致光滑。

from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(KernelDensity(), {'kernel':['gaussian', 'tophat'],'bandwidth': np.logspace(-2, 0, 10)}, cv=5, n_jobs=-1)
grid.fit(X)
print(f"best hyperparameters: {grid.best_params_}")
kde = grid.best_estimator_

First, if you extend the limits of your X_plot axis (i.e. X_plot = np.linspace(-1, 1,...)), you'll see that your KDE estimates a rather tall gaussian, and the area under curve is still 1.
Density values over 1 are perfectly legal, since the assumed distribution is continuous: there's no real probabilities for the exact points, and you should not treat your Y values as such; estimated probabilty for an interval is the respective area under curve.

Sample code to verify the estimated probability of hitting 0-0.004 range (roughly the same width as your histogram bin):

import scipy.integrate as integrate
interval = np.linspace(0, 0.004, 1000)[:, np.newaxis]
log_dens = kde.score_samples(interval)
print(integrate.trapz(np.exp(log_dens), interval[:,0]))

Second, once you check the area under curve you'll see your current hyperparameters aren't yielding too accurate of an estimation, reducing the bandwith or choosing a different algo might help.

You can also apply grid search to find the least inaccurate algo and bandwith, though this will take a good amount of time unless you reduce your sample size; also, choosing a narrow bandwidth may result in undersmoothing.

from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(KernelDensity(), {'kernel':['gaussian', 'tophat'],'bandwidth': np.logspace(-2, 0, 10)}, cv=5, n_jobs=-1)
grid.fit(X)
print(f"best hyperparameters: {grid.best_params_}")
kde = grid.best_estimator_
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文