Numpy.eig 和 PCA 中的方差百分比

发布于 2024-10-15 05:22:35 字数 1143 浏览 6 评论 0原文

从我们离开的地方继续...

所以我可以使用 linalg.eig 或 linalg .svd 来计算 PCA。当输入相同的数据时,每个数据集都会返回不同的主成分/特征向量和特征值(我目前正在使用 Iris 数据集)。

查看此处或任何其他将 PCA 应用于Iris 数据集,我会发现特征值为 [2.9108 0.9212 0.1474 0.0206]。 eig 方法为我提供了一组不同的特征值/向量,我不介意使用它们,除了这些特征值一旦求和就等于维数 (4),并且可用于找出每个分量对总方差的贡献程度。

获取由 linalg.eig 返回的特征值我无法做到这一点。例如,返回的值为 [9206.53059607 314.10307292 12.03601935 3.53031167]。在这种情况下,方差比例将为[0.96542969 0.03293797 0.00126214 0.0003702]此其他页面说(“由组件解释的变异比例只是它的特征值除以特征值之和。”)

由于每个维度解释的方差应该是常数(我认为),所以这些比例是错误的。因此,如果我使用 svd() 返回的值(所有教程中使用的值),我可以获得每个维度的正确变化百分比,但我想知道为什么返回值by eig 不能这样使用。

我假设返回的结果仍然是投影变量的有效方法,那么有没有办法对其进行转换,以便我可以获得每个变量解释的正确方差比例?换句话说,我可以使用 eig 方法并且仍然获得每个变量的方差比例吗?此外,这种映射是否只能在特征值中完成,以便我可以同时拥有真实特征值和归一化特征值?

顺便说一句,抱歉写得太长了。这是一个 (::) 来说明到目前为止。假设您不只是读过这一行。

Picking up from where we left...

So I can use linalg.eig or linalg.svd to compute the PCA. Each one returns different Principal Components/Eigenvectors and Eigenvalues when they're fed the same data (I'm currently using the Iris dataset).

Looking here or any other tutorial with the PCA applied to the Iris dataset, I'll find that the Eigenvalues are [2.9108 0.9212 0.1474 0.0206]. The eig method gives me a different set of eigenvalues/vectors to work with which I don't mind, except that these eigenvalues, once summed, equal the number of dimensions (4) and can be used to find how much each component contributes to the total variance.

Taking the eigenvalues returned by linalg.eig I can't do that. For example, the values returned are [9206.53059607 314.10307292 12.03601935 3.53031167]. The proportion of variance in this case would be [0.96542969 0.03293797 0.00126214 0.0003702]. This other page says that ("The proportion of the variation explained by a component is just its eigenvalue divided by the sum of the eigenvalues.")

Since the variance explained by each dimension should be constant (I think), these proportions are wrong. So, if I use the values returned by svd(), which are the values used in all tutorials, I can get the correct percentage of variation from each dimension, but I'm wondering why the values returned by eig can't be used like that.

I assume the results returned are still a valid way to project the variables, so is there a way to transform them so that I can get the correct proportion of variance explained by each variable? In other words, can I use the eig method and still have the proportion of variance for each variable? Additionally, could this mapping be done only in the eigenvalues so that I can have both the real eigenvalues and the normalized ones?

Sorry for the long writeup btw. Here's a (::) for having gotten this far. Assuming you didn't just read this line.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

寄居者 2024-10-22 05:22:35

采用 Doug 对上一个问题的回答并实现以下两个函数,我得到如下所示的输出:

def pca_eig(orig_data):
    data = array(orig_data)
    data = (data - data.mean(axis=0)) / data.std(axis=0)
    C = corrcoef(data, rowvar=0)
    w, v = linalg.eig(C)
    print "Using numpy.linalg.eig"
    print w
    print v

def pca_svd(orig_data):
    data = array(orig_data)
    data = (data - data.mean(axis=0)) / data.std(axis=0)
    C = corrcoef(data, rowvar=0)
    u, s, v = linalg.svd(C)
    print "Using numpy.linalg.svd"
    print u
    print s
    print v

输出:

Using numpy.linalg.eig
[ 2.91081808  0.92122093  0.14735328  0.02060771]
[[ 0.52237162 -0.37231836 -0.72101681  0.26199559]
 [-0.26335492 -0.92555649  0.24203288 -0.12413481]
 [ 0.58125401 -0.02109478  0.14089226 -0.80115427]
 [ 0.56561105 -0.06541577  0.6338014   0.52354627]]

Using numpy.linalg.svd
[[-0.52237162 -0.37231836  0.72101681  0.26199559]
 [ 0.26335492 -0.92555649 -0.24203288 -0.12413481]
 [-0.58125401 -0.02109478 -0.14089226 -0.80115427]
 [-0.56561105 -0.06541577 -0.6338014   0.52354627]]
[ 2.91081808  0.92122093  0.14735328  0.02060771]
[[-0.52237162  0.26335492 -0.58125401 -0.56561105]
 [-0.37231836 -0.92555649 -0.02109478 -0.06541577]
 [ 0.72101681 -0.24203288 -0.14089226 -0.6338014 ]
 [ 0.26199559 -0.12413481 -0.80115427  0.52354627]]

在这两种情况下,我都得到了所需的特征值。

Taking Doug's answer to your previous question and implementing the following two functions, I get the output shown below:

def pca_eig(orig_data):
    data = array(orig_data)
    data = (data - data.mean(axis=0)) / data.std(axis=0)
    C = corrcoef(data, rowvar=0)
    w, v = linalg.eig(C)
    print "Using numpy.linalg.eig"
    print w
    print v

def pca_svd(orig_data):
    data = array(orig_data)
    data = (data - data.mean(axis=0)) / data.std(axis=0)
    C = corrcoef(data, rowvar=0)
    u, s, v = linalg.svd(C)
    print "Using numpy.linalg.svd"
    print u
    print s
    print v

Output:

Using numpy.linalg.eig
[ 2.91081808  0.92122093  0.14735328  0.02060771]
[[ 0.52237162 -0.37231836 -0.72101681  0.26199559]
 [-0.26335492 -0.92555649  0.24203288 -0.12413481]
 [ 0.58125401 -0.02109478  0.14089226 -0.80115427]
 [ 0.56561105 -0.06541577  0.6338014   0.52354627]]

Using numpy.linalg.svd
[[-0.52237162 -0.37231836  0.72101681  0.26199559]
 [ 0.26335492 -0.92555649 -0.24203288 -0.12413481]
 [-0.58125401 -0.02109478 -0.14089226 -0.80115427]
 [-0.56561105 -0.06541577 -0.6338014   0.52354627]]
[ 2.91081808  0.92122093  0.14735328  0.02060771]
[[-0.52237162  0.26335492 -0.58125401 -0.56561105]
 [-0.37231836 -0.92555649 -0.02109478 -0.06541577]
 [ 0.72101681 -0.24203288 -0.14089226 -0.6338014 ]
 [ 0.26199559 -0.12413481 -0.80115427  0.52354627]]

In both cases, I get the desired eigenvalues.

(り薆情海 2024-10-22 05:22:35

您确定两种情况的数据相同并且尺寸顺序正确(您没有发送旋转数组吗?)?我敢打赌,如果您正确使用它们,您会发现它们都会给出相同的结果;)

Are you sure the data for both cases are the same and the correct order of dimensions(your not sending in the rotated array are you?)? I bet you'll find they both give the same results if you use them right ;)

夕色琉璃 2024-10-22 05:22:35

我知道有三种方法可以进行 PCA:从相关矩阵、协方差矩阵的特征值分解中导出,或者在未缩放和未中心化的数据上导出。听起来您正在传递 linalg.eig 正在处理未缩放的数据。无论如何,这只是一个猜测。解决您的问题的更好位置是 stats.stackexchange.com。 math.stackexchange.com 上的人们不使用实际数字。 :)

There are three ways I know of to do PCA: derived from an eigenvalue decomposition of the correlation matrix, the covariance matrix, or on the unscaled and uncentered data. It sounds like you are passing linalg.eig is working on the unscaled data. Anyway, that is just a guess. A better place for your question is stats.stackexchange.com. The folks on math.stackexchange.com don't use actual numbers. :)

原野 2024-10-22 05:22:35

我建议对 PCA 使用 SVD(奇异值分解),因为
1)它直接为您提供所需的值和矩阵
2)它很坚固。
请参阅 SO 上的 principal-component-analysis-in-python 示例(令人惊讶) 虹膜数据。
运行它可以

read iris.csv: (150, 4)
Center -= A.mean: [ 5.84  3.05  3.76  1.2 ]
Center /= A.std: [ 0.83  0.43  1.76  0.76]

SVD: A (150, 4) -> U (150, 4)  x  d diagonal  x  Vt (4, 4)
d^2: 437 138 22.1 3.09
% variance: [  72.77   95.8    99.48  100.  ]
PC 0 weights: [ 0.52 -0.26  0.58  0.57]
PC 1 weights: [-0.37 -0.93 -0.02 -0.07]

看到 SVD 的对角矩阵 d 的平方,
给出 PC 0、PC 1 的总方差的比例...

这有帮助吗?

I'd suggest using SVD, singular value decomposition, for PCA, because
1) it gives you directly values and matrices you need
2) it's robust.
See principal-component-analysis-in-python on SO for an example with (surprise) iris data.
Running it gives

read iris.csv: (150, 4)
Center -= A.mean: [ 5.84  3.05  3.76  1.2 ]
Center /= A.std: [ 0.83  0.43  1.76  0.76]

SVD: A (150, 4) -> U (150, 4)  x  d diagonal  x  Vt (4, 4)
d^2: 437 138 22.1 3.09
% variance: [  72.77   95.8    99.48  100.  ]
PC 0 weights: [ 0.52 -0.26  0.58  0.57]
PC 1 weights: [-0.37 -0.93 -0.02 -0.07]

You see that the diagonal matrix d from SVD, squared,
gives the proportion of total variance from PC 0, PC 1 ...

Does this help ?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文