Numpy.eig 和 PCA 中的方差百分比
所以我可以使用 linalg.eig 或 linalg .svd 来计算 PCA。当输入相同的数据时,每个数据集都会返回不同的主成分/特征向量和特征值(我目前正在使用 Iris 数据集)。
查看此处或任何其他将 PCA 应用于Iris 数据集,我会发现特征值为 [2.9108 0.9212 0.1474 0.0206]
。 eig 方法为我提供了一组不同的特征值/向量,我不介意使用它们,除了这些特征值一旦求和就等于维数 (4),并且可用于找出每个分量对总方差的贡献程度。
获取由 linalg.eig
返回的特征值我无法做到这一点。例如,返回的值为 [9206.53059607 314.10307292 12.03601935 3.53031167]
。在这种情况下,方差比例将为[0.96542969 0.03293797 0.00126214 0.0003702]
。 此其他页面说(“由组件解释的变异比例只是它的特征值除以特征值之和。”)
由于每个维度解释的方差应该是常数(我认为),所以这些比例是错误的。因此,如果我使用 svd() 返回的值(所有教程中使用的值),我可以获得每个维度的正确变化百分比,但我想知道为什么返回值by eig
不能这样使用。
我假设返回的结果仍然是投影变量的有效方法,那么有没有办法对其进行转换,以便我可以获得每个变量解释的正确方差比例?换句话说,我可以使用 eig
方法并且仍然获得每个变量的方差比例吗?此外,这种映射是否只能在特征值中完成,以便我可以同时拥有真实特征值和归一化特征值?
顺便说一句,抱歉写得太长了。这是一个 (::)
来说明到目前为止。假设您不只是读过这一行。
Picking up from where we left...
So I can use linalg.eig or linalg.svd to compute the PCA. Each one returns different Principal Components/Eigenvectors and Eigenvalues when they're fed the same data (I'm currently using the Iris dataset).
Looking here or any other tutorial with the PCA applied to the Iris dataset, I'll find that the Eigenvalues are [2.9108 0.9212 0.1474 0.0206]
. The eig
method gives me a different set of eigenvalues/vectors to work with which I don't mind, except that these eigenvalues, once summed, equal the number of dimensions (4) and can be used to find how much each component contributes to the total variance.
Taking the eigenvalues returned by linalg.eig
I can't do that. For example, the values returned are [9206.53059607 314.10307292 12.03601935 3.53031167]
. The proportion of variance in this case would be [0.96542969 0.03293797 0.00126214 0.0003702]
. This other page says that ("The proportion of the variation explained by a component is just its eigenvalue divided by the sum of the eigenvalues.")
Since the variance explained by each dimension should be constant (I think), these proportions are wrong. So, if I use the values returned by svd()
, which are the values used in all tutorials, I can get the correct percentage of variation from each dimension, but I'm wondering why the values returned by eig
can't be used like that.
I assume the results returned are still a valid way to project the variables, so is there a way to transform them so that I can get the correct proportion of variance explained by each variable? In other words, can I use the eig
method and still have the proportion of variance for each variable? Additionally, could this mapping be done only in the eigenvalues so that I can have both the real eigenvalues and the normalized ones?
Sorry for the long writeup btw. Here's a (::)
for having gotten this far. Assuming you didn't just read this line.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
采用 Doug 对上一个问题的回答并实现以下两个函数,我得到如下所示的输出:
输出:
在这两种情况下,我都得到了所需的特征值。
Taking Doug's answer to your previous question and implementing the following two functions, I get the output shown below:
Output:
In both cases, I get the desired eigenvalues.
您确定两种情况的数据相同并且尺寸顺序正确(您没有发送旋转数组吗?)?我敢打赌,如果您正确使用它们,您会发现它们都会给出相同的结果;)
Are you sure the data for both cases are the same and the correct order of dimensions(your not sending in the rotated array are you?)? I bet you'll find they both give the same results if you use them right ;)
我知道有三种方法可以进行 PCA:从相关矩阵、协方差矩阵的特征值分解中导出,或者在未缩放和未中心化的数据上导出。听起来您正在传递 linalg.eig 正在处理未缩放的数据。无论如何,这只是一个猜测。解决您的问题的更好位置是 stats.stackexchange.com。 math.stackexchange.com 上的人们不使用实际数字。 :)
There are three ways I know of to do PCA: derived from an eigenvalue decomposition of the correlation matrix, the covariance matrix, or on the unscaled and uncentered data. It sounds like you are passing linalg.eig is working on the unscaled data. Anyway, that is just a guess. A better place for your question is stats.stackexchange.com. The folks on math.stackexchange.com don't use actual numbers. :)
我建议对 PCA 使用 SVD(奇异值分解),因为
1)它直接为您提供所需的值和矩阵
2)它很坚固。
请参阅 SO 上的 principal-component-analysis-in-python 示例(令人惊讶) 虹膜数据。
运行它可以
看到 SVD 的对角矩阵 d 的平方,
给出 PC 0、PC 1 的总方差的比例...
这有帮助吗?
I'd suggest using SVD, singular value decomposition, for PCA, because
1) it gives you directly values and matrices you need
2) it's robust.
See principal-component-analysis-in-python on SO for an example with (surprise) iris data.
Running it gives
You see that the diagonal matrix d from SVD, squared,
gives the proportion of total variance from PC 0, PC 1 ...
Does this help ?