使用Sklearn＆＃x27的KDE函数绘制数据的概率分布

发布于 2025-02-06 06:41:44 字数 1370 浏览 1 评论 0原文

我有许多变量的样本。我想使用这些样品绘制变量的概率分布。我正在使用高斯内核的内核密度估计。我正在为此目的使用sklearn库。这是我实施的示例代码：

import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KernelDensity

# -- data
init_range = 0.0793
X = np.random.uniform(low=-init_range, high=init_range, size=133280)[:, np.newaxis]

# -- kernel density estimation
kde = KernelDensity(kernel="gaussian", bandwidth=0.2).fit(X)
X_plot = np.linspace(min(X).item(), max(X).item(), 1000)[:, np.newaxis]
log_dens = kde.score_samples(X_plot)

# -- plot density
plt.plot( X_plot[:, 0], np.exp(log_dens), lw=2, linestyle="-")
plt.ylim([0, 2.1])
plt.show()

以下是结果输出：

如您所见，Y轴上的值高于一个。因此，Y轴未显示概率分布。我进一步绘制了该数据的直方图：

# -- plot hist
n_bins = 40
weights = np.ones_like(X) / float(len(X))
prob, bins, _ = plt.hist(X, n_bins, density=False, histtype='step', color='red', weights=weights)
plt.show()

结果如下：

这是有意义的，当垃圾箱总结到一个：0.025*40 = 1

我正在很难理解为什么我的KDE情节不是分布。我该如何解决？我是否缺少一个归一化步骤？

原文

I have a number of samples of a variable. I would like to use these samples to plot the probability distribution of the variable. I'm using kernel density estimation with a Gaussian kernel. I'm using sklearn library for this purpose. Here is the sample code I have implemented:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KernelDensity

# -- data
init_range = 0.0793
X = np.random.uniform(low=-init_range, high=init_range, size=133280)[:, np.newaxis]

# -- kernel density estimation
kde = KernelDensity(kernel="gaussian", bandwidth=0.2).fit(X)
X_plot = np.linspace(min(X).item(), max(X).item(), 1000)[:, np.newaxis]
log_dens = kde.score_samples(X_plot)

# -- plot density
plt.plot( X_plot[:, 0], np.exp(log_dens), lw=2, linestyle="-")
plt.ylim([0, 2.1])
plt.show()

Below is the resulting output:

As you can see, the value on the y axis is above one. Hence, the y axis is NOT showing the probability distribution. I further plotted the histogram for this data:

# -- plot hist
n_bins = 40
weights = np.ones_like(X) / float(len(X))
prob, bins, _ = plt.hist(X, n_bins, density=False, histtype='step', color='red', weights=weights)
plt.show()

and the result is below:

which makes sense as the bins sum up to one: 0.025*40=1

I'm having a hard time understanding why my kde plot is not a distribution. How can I fix this? Is there a normalization step that I'm missing?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

提笔书几行 2025-02-13 06:41:44

首先，如果您扩展了x_plot轴的限制（即x_plot = np.linspace（-1，1，1，...）），您会看到您的KDE估计一个相当高的高斯，曲线下的区域仍然为1。
由于假定的分布是连续的：确切点没有真正的概率，因此您不应将其视为Y值；间隔的估计概率是曲线下的各个面积。

示例代码以验证达到0-0.004范围的估计概率（与直方图大致相同的宽度大致相同）：

import scipy.integrate as integrate
interval = np.linspace(0, 0.004, 1000)[:, np.newaxis]
log_dens = kde.score_samples(interval)
print(integrate.trapz(np.exp(log_dens), interval[:,0]))

其次，一旦检查了曲线下的区域，您将看到当前的超参数不会过于准确地估计，减少频段或选择其他算法可能会有所帮助。

您也可以应用网格搜索以找到最少不准确的算法和bandwith，尽管除非减少样本量，否则这将需要大量时间。同样，选择狭窄的带宽可能会导致光滑。

from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(KernelDensity(), {'kernel':['gaussian', 'tophat'],'bandwidth': np.logspace(-2, 0, 10)}, cv=5, n_jobs=-1)
grid.fit(X)
print(f"best hyperparameters: {grid.best_params_}")
kde = grid.best_estimator_

First, if you extend the limits of your X_plot axis (i.e. X_plot = np.linspace(-1, 1,...)), you'll see that your KDE estimates a rather tall gaussian, and the area under curve is still 1.
Density values over 1 are perfectly legal, since the assumed distribution is continuous: there's no real probabilities for the exact points, and you should not treat your Y values as such; estimated probabilty for an interval is the respective area under curve.

Sample code to verify the estimated probability of hitting 0-0.004 range (roughly the same width as your histogram bin):

import scipy.integrate as integrate
interval = np.linspace(0, 0.004, 1000)[:, np.newaxis]
log_dens = kde.score_samples(interval)
print(integrate.trapz(np.exp(log_dens), interval[:,0]))

Second, once you check the area under curve you'll see your current hyperparameters aren't yielding too accurate of an estimation, reducing the bandwith or choosing a different algo might help.

You can also apply grid search to find the least inaccurate algo and bandwith, though this will take a good amount of time unless you reduce your sample size; also, choosing a narrow bandwidth may result in undersmoothing.

from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(KernelDensity(), {'kernel':['gaussian', 'tophat'],'bandwidth': np.logspace(-2, 0, 10)}, cv=5, n_jobs=-1)
grid.fit(X)
print(f"best hyperparameters: {grid.best_params_}")
kde = grid.best_estimator_

回复收藏 0 原文

~没有更多了~