将 scipy.stats.gaussian_kde 与二维数据一起使用

发布于 2024-10-01 02:05:42 字数 883 浏览 0 评论 0原文

我正在尝试使用 scipy.stats .gaussian_kde class 来平滑一些收集到的经纬度信息的离散数据，所以最后显示的有点类似于等高线图，其中高密度是峰值，低密度是峰值山谷。

我很难将二维数据集放入 gaussian_kde 类中。我已经尝试弄清楚它如何处理一维数据，所以我认为二维数据应该是这样的：

from scipy import stats
from numpy import array
data = array([[1.1, 1.1],
              [1.2, 1.2],
              [1.3, 1.3]])
kde = stats.gaussian_kde(data)
kde.evaluate([1,2,3],[1,2,3])

这就是说我在 [1.1, 1.1], [1.2, 1.2]、[1.3、1.3]。我想使用 1 到 3 进行核密度估计，在 x 和 y 轴上使用宽度 1。

创建gaussian_kde时，它一直给我这个错误：

raise LinAlgError("singular matrix")
numpy.linalg.linalg.LinAlgError: singular matrix

查看gaussian_kde的源代码，我意识到我思考数据集含义的方式与计算维度的方式完全不同，但我找不到任何示例代码来显示多维数据如何与该模块配合使用。有人可以帮我提供一些使用 gaussian_kde 处理多维数据的示例方法吗？

原文

I'm trying to use the scipy.stats.gaussian_kde class to smooth out some discrete data collected with latitude and longitude information, so it shows up as somewhat similar to a contour map in the end, where the high densities are the peak and low densities are the valley.

I'm having a hard time putting a two-dimensional dataset into the gaussian_kde class. I've played around to figure out how it works with 1 dimensional data, so I thought 2 dimensional would be something along the lines of:

from scipy import stats
from numpy import array
data = array([[1.1, 1.1],
              [1.2, 1.2],
              [1.3, 1.3]])
kde = stats.gaussian_kde(data)
kde.evaluate([1,2,3],[1,2,3])

which is saying that I have 3 points at [1.1, 1.1], [1.2, 1.2], [1.3, 1.3]. and I want to have the kernel density estimation using from 1 to 3 using width of 1 on x and y axis.

When creating the gaussian_kde, it keeps giving me this error:

raise LinAlgError("singular matrix")
numpy.linalg.linalg.LinAlgError: singular matrix

Looking into the source code of gaussian_kde, I realize that the way I'm thinking about what dataset means is completely different from how the dimensionality is calculate, but I could not find any sample code showing how multi-dimension data works with the module. Could someone help me with some sample ways to use gaussian_kde with multi-dimensional data?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

回忆追雨的时光 2024-10-08 02:05:42

这个示例似乎就是您正在寻找的内容：

import numpy as np
import scipy.stats as stats
from matplotlib.pyplot import imshow

# Create some dummy data
rvs = np.append(stats.norm.rvs(loc=2,scale=1,size=(2000,1)),
                stats.norm.rvs(loc=0,scale=3,size=(2000,1)),
                axis=1)

kde = stats.kde.gaussian_kde(rvs.T)

# Regular grid to evaluate kde upon
x_flat = np.r_[rvs[:,0].min():rvs[:,0].max():128j]
y_flat = np.r_[rvs[:,1].min():rvs[:,1].max():128j]
x,y = np.meshgrid(x_flat,y_flat)
grid_coords = np.append(x.reshape(-1,1),y.reshape(-1,1),axis=1)

z = kde(grid_coords.T)
z = z.reshape(128,128)

imshow(z,aspect=x_flat.ptp()/y_flat.ptp())

在此处输入图像描述

显然，轴需要修复。

绘制数据的散点图

scatter(rvs[:,0],rvs[:,1])

您还可以使用在此处输入图像描述

This example seems to be what you're looking for:

import numpy as np
import scipy.stats as stats
from matplotlib.pyplot import imshow

# Create some dummy data
rvs = np.append(stats.norm.rvs(loc=2,scale=1,size=(2000,1)),
                stats.norm.rvs(loc=0,scale=3,size=(2000,1)),
                axis=1)

kde = stats.kde.gaussian_kde(rvs.T)

# Regular grid to evaluate kde upon
x_flat = np.r_[rvs[:,0].min():rvs[:,0].max():128j]
y_flat = np.r_[rvs[:,1].min():rvs[:,1].max():128j]
x,y = np.meshgrid(x_flat,y_flat)
grid_coords = np.append(x.reshape(-1,1),y.reshape(-1,1),axis=1)

z = kde(grid_coords.T)
z = z.reshape(128,128)

imshow(z,aspect=x_flat.ptp()/y_flat.ptp())

enter image description here

Axes need fixing, obviously.

You can also do a scatter plot of the data with

scatter(rvs[:,0],rvs[:,1])

enter image description here

回复收藏 0 原文

伊面 2024-10-08 02:05:42

我认为您将内核密度估计与插值或内核回归混合在一起。如果您有较大的点样本，KDE 会估计点的分布。

我不确定你想要哪种插值，但 scipy.interpolate 中的样条线或 rbf 会更合适。

如果您想要一维内核回归，那么您可以在 scikits.statsmodels 中找到具有多个不同内核的版本。

更新：这是一个示例（如果这是您想要的）

>>> data = 2 + 2*np.random.randn(2, 100)
>>> kde = stats.gaussian_kde(data)
>>> kde.evaluate(np.array([[1,2,3],[1,2,3]]))
array([ 0.02573917,  0.02470436,  0.03084282])

gaussian_kde 在行中具有变量，在列中具有观察结果，因此与统计中通常的方向相反。在您的示例中，所有三个点都在一条线上，因此具有完美的相关性。我猜这就是奇异矩阵的原因。

调整数组方向并添加一个小噪声，该示例有效，但看起来仍然非常集中，例如 (3,3) 附近没有任何样本点：

>>> data = np.array([[1.1, 1.1],
              [1.2, 1.2],
              [1.3, 1.3]]).T
>>> data = data + 0.01*np.random.randn(2,3)
>>> kde = stats.gaussian_kde(data)
>>> kde.evaluate(np.array([[1,2,3],[1,2,3]]))
array([  7.70204299e+000,   1.96813149e-044,   1.45796523e-251])

I think you are mixing up kernel density estimation with interpolation or maybe kernel regression. KDE estimates the distribution of points if you have a larger sample of points.

I'm not sure which interpolation you want, but either the splines or rbf in scipy.interpolate will be more appropriate.

If you want one-dimensional kernel regression, then you can find a version in scikits.statsmodels with several different kernels.

update: here is an example (if this is what you want)

>>> data = 2 + 2*np.random.randn(2, 100)
>>> kde = stats.gaussian_kde(data)
>>> kde.evaluate(np.array([[1,2,3],[1,2,3]]))
array([ 0.02573917,  0.02470436,  0.03084282])

gaussian_kde has variables in rows and observations in columns, so reversed orientation from the usual in stats. In your example, all three points are on a line, so it has perfect correlation. That is, I guess, the reason for the singular matrix.

Adjusting the array orientation and adding a small noise, the example works, but still looks very concentrated, for example you don't have any sample point near (3,3):

>>> data = np.array([[1.1, 1.1],
              [1.2, 1.2],
              [1.3, 1.3]]).T
>>> data = data + 0.01*np.random.randn(2,3)
>>> kde = stats.gaussian_kde(data)
>>> kde.evaluate(np.array([[1,2,3],[1,2,3]]))
array([  7.70204299e+000,   1.96813149e-044,   1.45796523e-251])

回复收藏 0 原文

維他命╮ 2024-10-08 02:05:42

我发现很难理解 SciPy 手册中关于 gaussian_kde 如何处理 2D 数据的描述。这是一个解释，旨在补充 @endolith 的示例。我将代码分为几个步骤，并带有注释来解释不太直观的部分。

首先，导入：

import numpy as np
import scipy.stats as st
from matplotlib.pyplot import imshow, show

创建一些虚拟数据：这些是“X”和“Y”点坐标的一维数组。

np.random.seed(142)  # for reproducibility
x = st.norm.rvs(loc=2, scale=1, size=2000)
y = st.norm.rvs(loc=0, scale=3, size=2000)

对于二维密度估计，必须使用包含“X”和“Y”数据集的两行数组来初始化 gaussian_kde 对象。在 NumPy 术语中，我们“垂直堆叠它们”：

xy = np.vstack((x, y))

因此“X”数据位于第一行 xy[0,:]，“Y”数据位于第二行 xy [1,:] 和 xy.shape 是 (2, 2000)。现在创建 gaussian_kde 对象：

dens = st.gaussian_kde(xy)

我们将在二维网格上评估估计的二维密度 PDF。在 NumPy 中创建此类网格的方法不止一种。我在这里展示了一种与 @endolith 的方法不同（但功能上等效）的方法：

gx, gy = np.mgrid[x.min():x.max():128j, y.min():y.max():128j]
gxy = np.dstack((gx, gy)) # shape is (128, 128, 2)

gxy 是一个 3-D 数组，[i,j]- gxy 的第一个元素包含相应“X”和“Y”值的 2 元素列表： gxy[i, j] 的值为 [ gx[i], gy[j]]。

我们必须在每个二维网格点上调用 dens() （或 dens.pdf() 这是同一件事）。 NumPy 为此目的提供了一个非常优雅的函数：

z = np.apply_along_axis(dens, 2, gxy)

换句话说，可调用的 dens（也可能是 dens.pdf）沿着 axis=2 调用/code>（第三个轴）位于 3-D 数组 gxy 中，并且值应作为 2-D 数组返回。唯一的问题是 z 的形状将是 (128,128,1) 而不是我期望的 (128,128) 。请注意，文档说：

out [返回值，LD] 的形状与 arr 的形状相同，除了沿着
轴尺寸。该轴已删除，并替换为新尺寸
等于 func1d 返回值的形状。所以如果 func1d 返回
标量 out 的维度将比 arr 少一维。

最有可能的是 dens() 返回了一个 1 长元组，而不是我所希望的标量。我没有进一步调查这个问题，因为这很容易解决：

z = z.reshape(128, 128)

之后我们可以生成图像：

imshow(z, aspect=gx.ptp() / gy.ptp())
show()  # needed if you try this in PyCharm

这是图像。（请注意，我也实现了 @endolith 的版本，并得到了与此无法区分的图像。）

< img src="https://i.sstatic.net/ZGI4G.png" alt="上述命令的输出">

I found it difficult to understand the SciPy manual's description of how gaussian_kde works with 2D data. Here is an explanation which is intended to complement @endolith 's example. I divided the code into several steps with comments to explain the less intuitive bits.

First, the imports:

import numpy as np
import scipy.stats as st
from matplotlib.pyplot import imshow, show

Create some dummy data: these are 1-D arrays of the "X" and "Y" point coordinates.

np.random.seed(142)  # for reproducibility
x = st.norm.rvs(loc=2, scale=1, size=2000)
y = st.norm.rvs(loc=0, scale=3, size=2000)

For 2-D density estimation the gaussian_kde object has to be initialised with an array with two rows containing the "X" and "Y" datasets. In NumPy terminology, we "stack them vertically":

xy = np.vstack((x, y))

so the "X" data is in the first row xy[0,:] and the "Y" data are in the second row xy[1,:] and xy.shape is (2, 2000). Now create the gaussian_kde object:

dens = st.gaussian_kde(xy)

We will evaluate the estimated 2-D density PDF on a 2-D grid. There is more than one way of creating such a grid in NumPy. I show here an approach which is different from (but functionally equivalent to) @endolith 's method:

gx, gy = np.mgrid[x.min():x.max():128j, y.min():y.max():128j]
gxy = np.dstack((gx, gy)) # shape is (128, 128, 2)

gxy is a 3-D array, the [i,j]-th element of gxy contains a 2-element list of the corresponding "X" and "Y" values: gxy[i, j] 's value is [ gx[i], gy[j] ].

We have to invoke dens() (or dens.pdf() which is the same thing) on each of the 2-D grid points. NumPy has a very elegant function for this purpose:

z = np.apply_along_axis(dens, 2, gxy)

In words, the callable dens (could have been dens.pdf as well) is invoked along axis=2 (the third axis) in the 3-D array gxy and the values should be returned as a 2-D array. The only glitch is that the shape of z will be (128,128,1) and not (128,128) what I expected. Note that the documentation says that:

The shape of out [the return value, L.D.] is identical to the shape of arr, except along the
axis dimension. This axis is removed, and replaced with new dimensions
equal to the shape of the return value of func1d. So if func1d returns
a scalar out will have one fewer dimensions than arr.

Most likely dens() returned a 1-long tuple and not a scalar which I was hoping for. I didn't investigate the issue any further, because this is easy to fix:

z = z.reshape(128, 128)

after which we can generate the image:

imshow(z, aspect=gx.ptp() / gy.ptp())
show()  # needed if you try this in PyCharm

Here is the image. (Note that I have implemented @endolith 's version as well and got an image indistinguishable from this one.)

回复收藏 0 原文

清秋悲枫 2024-10-08 02:05:42

最佳答案中发布的示例对我不起作用。我必须稍微调整一下它，现在它可以工作了：

import numpy as np
import scipy.stats as stats
from matplotlib import pyplot as plt

# Create some dummy data
rvs = np.append(stats.norm.rvs(loc=2,scale=1,size=(2000,1)),
                stats.norm.rvs(loc=0,scale=3,size=(2000,1)),
                axis=1)

kde = stats.kde.gaussian_kde(rvs.T)

# Regular grid to evaluate kde upon
x_flat = np.r_[rvs[:,0].min():rvs[:,0].max():128j]
y_flat = np.r_[rvs[:,1].min():rvs[:,1].max():128j]
x,y = np.meshgrid(x_flat,y_flat)
grid_coords = np.append(x.reshape(-1,1),y.reshape(-1,1),axis=1)

z = kde(grid_coords.T)
z = z.reshape(128,128)

plt.imshow(z,aspect=x_flat.ptp()/y_flat.ptp())
plt.show()

The example posted in the top answer didn't work for me. I had to tweak it little bit and it works now:

import numpy as np
import scipy.stats as stats
from matplotlib import pyplot as plt

# Create some dummy data
rvs = np.append(stats.norm.rvs(loc=2,scale=1,size=(2000,1)),
                stats.norm.rvs(loc=0,scale=3,size=(2000,1)),
                axis=1)

kde = stats.kde.gaussian_kde(rvs.T)

# Regular grid to evaluate kde upon
x_flat = np.r_[rvs[:,0].min():rvs[:,0].max():128j]
y_flat = np.r_[rvs[:,1].min():rvs[:,1].max():128j]
x,y = np.meshgrid(x_flat,y_flat)
grid_coords = np.append(x.reshape(-1,1),y.reshape(-1,1),axis=1)

z = kde(grid_coords.T)
z = z.reshape(128,128)

plt.imshow(z,aspect=x_flat.ptp()/y_flat.ptp())
plt.show()

回复收藏 0 原文

~没有更多了~