将数组按行与向量关联

发布于 2025-01-09 15:43:33 字数 997 浏览 1 评论 0原文

我有一个维度为 mxn 的数组 X,对于每一行 m 我想获得与向量 y< 的相关性/em> 尺寸为 n

在 Matlab 中,这可以通过 corr 函数 corr(X,y) 来实现。然而,对于 Python 来说,使用 np.corrcoef 函数似乎不可能:

import numpy as np
X = np.random.random([1000, 10])
y = np.random.random(10)
np.corrcoef(X,y).shape

结果是形状 (1001, 1001)。但当 X 的维数很大时,这种方法就会失败。就我而言,存在错误:

numpy.core._exceptions._ArrayMemoryError: Unable to allocate 5.93 TiB for an array with shape (902630, 902630) and data type float64

由于 X.shape[0] 维度为 902630。

我的问题是,如何才能获得与向量的行相关性,从而产生所有相关性的形状 (1000,)?

当然,这可以通过列表理解来完成:

np.array([np.corrcoef(X[i, :], y)[0,1] for i in range(X.shape[0])])

因此,目前我正在使用 numba 和 for 循环,运行超过 900000 个元素。但我认为对于这个问题可能有一个更有效的矩阵运算函数。

编辑: Pandas 提供的 corrwith 函数也解决了这个问题:

X_df = pd.DataFrame(X)
y_s = pd.Series(y)
X_df.corrwith(y_s)

该实现允许不同的相关类型计算,但似乎没有作为矩阵运算实现,因此速度非常慢。可能有更有效的实现。

I have an array X with dimension mxn, for every row m I want to get a correlation with a vector y with dimension n.

In Matlab this would be possible with the corr function corr(X,y). For Python however this does not seem possible with the np.corrcoef function:

import numpy as np
X = np.random.random([1000, 10])
y = np.random.random(10)
np.corrcoef(X,y).shape

Which results in shape (1001, 1001). But this will fail when the dimension of X is large. In my case, there is an error:

numpy.core._exceptions._ArrayMemoryError: Unable to allocate 5.93 TiB for an array with shape (902630, 902630) and data type float64

Since the X.shape[0] dimension is 902630.

My question is, how can I only get the row wise correlations with the vector resulting in shape (1000,) of all correlations?

Of course this could be done via a list comprehension:

np.array([np.corrcoef(X[i, :], y)[0,1] for i in range(X.shape[0])])

Currently I am therefore using numba with a for loop running through the >900000 elemens. But I think there could be a much more efficient matrix operation function for this problem.

EDIT:
Pandas provides with the corrwith function also a method for this problem:

X_df = pd.DataFrame(X)
y_s = pd.Series(y)
X_df.corrwith(y_s)

The implementation allows for different correlation type calculations, but does not seem to be implemmented as a matrix operation and is therefore really slow. Probably there is a more efficient implementation.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

一杆小烟枪 2025-01-16 15:43:33

这应该可以以矢量化方式计算具有指定 y 的每一行的相关系数。

X = np.random.random([1000, 10])
y = np.random.random(10)
r = (len(y) * np.sum(X * y[None, :], axis=-1) - (np.sum(X, axis=-1) * np.sum(y))) / (np.sqrt((len(y) * np.sum(X**2, axis=-1) - np.sum(X, axis=-1) ** 2) * (len(y) * np.sum(y**2) - np.sum(y)**2)))
print(r[0], np.corrcoef(X[0], y))
0.4243951, 0.4243951

This should work to compute the correlation coefficient for each row with a specified y in a vectorized manner.

X = np.random.random([1000, 10])
y = np.random.random(10)
r = (len(y) * np.sum(X * y[None, :], axis=-1) - (np.sum(X, axis=-1) * np.sum(y))) / (np.sqrt((len(y) * np.sum(X**2, axis=-1) - np.sum(X, axis=-1) ** 2) * (len(y) * np.sum(y**2) - np.sum(y)**2)))
print(r[0], np.corrcoef(X[0], y))
0.4243951, 0.4243951
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文