使用Sklearn&#x27的KDE函数绘制数据的概率分布
我有许多变量的样本。我想使用这些样品绘制变量的概率分布。我正在使用高斯内核的内核密度估计。我正在为此目的使用sklearn
库。这是我实施的示例代码:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KernelDensity
# -- data
init_range = 0.0793
X = np.random.uniform(low=-init_range, high=init_range, size=133280)[:, np.newaxis]
# -- kernel density estimation
kde = KernelDensity(kernel="gaussian", bandwidth=0.2).fit(X)
X_plot = np.linspace(min(X).item(), max(X).item(), 1000)[:, np.newaxis]
log_dens = kde.score_samples(X_plot)
# -- plot density
plt.plot( X_plot[:, 0], np.exp(log_dens), lw=2, linestyle="-")
plt.ylim([0, 2.1])
plt.show()
以下是结果输出:
如您所见,Y轴上的值高于一个。因此,Y轴未显示概率分布。我进一步绘制了该数据的直方图:
# -- plot hist
n_bins = 40
weights = np.ones_like(X) / float(len(X))
prob, bins, _ = plt.hist(X, n_bins, density=False, histtype='step', color='red', weights=weights)
plt.show()
结果如下:
这是有意义的,当垃圾箱总结到一个:0.025*40 = 1
我正在很难理解为什么我的KDE情节不是分布。我该如何解决?我是否缺少一个归一化步骤?
I have a number of samples of a variable. I would like to use these samples to plot the probability distribution of the variable. I'm using kernel density estimation with a Gaussian kernel. I'm using sklearn
library for this purpose. Here is the sample code I have implemented:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KernelDensity
# -- data
init_range = 0.0793
X = np.random.uniform(low=-init_range, high=init_range, size=133280)[:, np.newaxis]
# -- kernel density estimation
kde = KernelDensity(kernel="gaussian", bandwidth=0.2).fit(X)
X_plot = np.linspace(min(X).item(), max(X).item(), 1000)[:, np.newaxis]
log_dens = kde.score_samples(X_plot)
# -- plot density
plt.plot( X_plot[:, 0], np.exp(log_dens), lw=2, linestyle="-")
plt.ylim([0, 2.1])
plt.show()
Below is the resulting output:
As you can see, the value on the y axis is above one. Hence, the y axis is NOT showing the probability distribution. I further plotted the histogram for this data:
# -- plot hist
n_bins = 40
weights = np.ones_like(X) / float(len(X))
prob, bins, _ = plt.hist(X, n_bins, density=False, histtype='step', color='red', weights=weights)
plt.show()
and the result is below:
which makes sense as the bins sum up to one: 0.025*40=1
I'm having a hard time understanding why my kde plot is not a distribution. How can I fix this? Is there a normalization step that I'm missing?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
首先,如果您扩展了x_plot轴的限制(即
x_plot = np.linspace(-1,1,1,...)
),您会看到您的KDE估计一个相当高的高斯,曲线下的区域仍然为1。由于假定的分布是连续的:确切点没有真正的概率,因此您不应将其视为Y值;间隔的估计概率是曲线下的各个面积。
示例代码以验证达到0-0.004范围的估计概率(与直方图大致相同的宽度大致相同):
其次,一旦检查了曲线下的区域,您将看到当前的超参数不会过于准确地估计,减少频段或选择其他算法可能会有所帮助。
您也可以应用网格搜索以找到最少不准确的算法和bandwith,尽管除非减少样本量,否则这将需要大量时间。同样,选择狭窄的带宽可能会导致光滑。
First, if you extend the limits of your X_plot axis (i.e.
X_plot = np.linspace(-1, 1,...)
), you'll see that your KDE estimates a rather tall gaussian, and the area under curve is still 1.Density values over 1 are perfectly legal, since the assumed distribution is continuous: there's no real probabilities for the exact points, and you should not treat your Y values as such; estimated probabilty for an interval is the respective area under curve.
Sample code to verify the estimated probability of hitting 0-0.004 range (roughly the same width as your histogram bin):
Second, once you check the area under curve you'll see your current hyperparameters aren't yielding too accurate of an estimation, reducing the bandwith or choosing a different algo might help.
You can also apply grid search to find the least inaccurate algo and bandwith, though this will take a good amount of time unless you reduce your sample size; also, choosing a narrow bandwidth may result in undersmoothing.