我的 PCA 有什么问题吗？

发布于 2024-10-14 11:04:17 字数 1679 浏览 8 评论 0原文

我的代码：

from numpy import *

def pca(orig_data):
    data = array(orig_data)
    data = (data - data.mean(axis=0)) / data.std(axis=0)
    u, s, v = linalg.svd(data)
    print s #should be s**2 instead!
    print v

def load_iris(path):
    lines = []
    with open(path) as input_file:
        lines = input_file.readlines()
    data = []
    for line in lines:
        cur_line = line.rstrip().split(',')
        cur_line = cur_line[:-1]
        cur_line = [float(elem) for elem in cur_line]
        data.append(array(cur_line))
    return array(data)

if __name__ == '__main__':
    data = load_iris('iris.data')
    pca(data)

iris 数据集： http://archive. ics.uci.edu/ml/machine-learning-databases/iris/iris.data

输出：

[ 20.89551896  11.75513248   4.7013819    1.75816839]
[[ 0.52237162 -0.26335492  0.58125401  0.56561105]
 [-0.37231836 -0.92555649 -0.02109478 -0.06541577]
 [ 0.72101681 -0.24203288 -0.14089226 -0.6338014 ]
 [ 0.26199559 -0.12413481 -0.80115427  0.52354627]]

所需输出：
特征值 - [2.9108 0.9212 0.1474 0.0206]
主要成分 - 与我得到的相同，但转置所以好吧我猜

另外，linalg.eig 函数的输出是什么？根据维基百科上的 PCA 描述，我应该是这样的：

cov_mat = cov(orig_data)
val, vec = linalg.eig(cov_mat)
print val

但它与我在网上找到的教程中的输出并不真正匹配。另外，如果我有 4 个维度，我认为我应该有 4 个特征值，而不是像 eig 给我的那样 150 个。我做错了什么吗？

编辑：我注意到这些值相差 150，这是数据集中的元素数量。另外，特征值相加应该等于维数，在本例中为 4。我不明白为什么会出现这种差异。如果我简单地将特征值除以 len(data) ，我可以获得我想要的结果，但我不明白为什么。无论哪种方式，特征值的比例都不会改变，但它们对我来说很重要，所以我想了解发生了什么。

原文

My code:

from numpy import *

def pca(orig_data):
    data = array(orig_data)
    data = (data - data.mean(axis=0)) / data.std(axis=0)
    u, s, v = linalg.svd(data)
    print s #should be s**2 instead!
    print v

def load_iris(path):
    lines = []
    with open(path) as input_file:
        lines = input_file.readlines()
    data = []
    for line in lines:
        cur_line = line.rstrip().split(',')
        cur_line = cur_line[:-1]
        cur_line = [float(elem) for elem in cur_line]
        data.append(array(cur_line))
    return array(data)

if __name__ == '__main__':
    data = load_iris('iris.data')
    pca(data)

The iris dataset: http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data

Output:

[ 20.89551896  11.75513248   4.7013819    1.75816839]
[[ 0.52237162 -0.26335492  0.58125401  0.56561105]
 [-0.37231836 -0.92555649 -0.02109478 -0.06541577]
 [ 0.72101681 -0.24203288 -0.14089226 -0.6338014 ]
 [ 0.26199559 -0.12413481 -0.80115427  0.52354627]]

Desired Output:
Eigenvalues - [2.9108 0.9212 0.1474 0.0206]
Principal Components - Same as I got but transposed so okay I guess

Also, what's with the output of the linalg.eig function? According to the PCA description on wikipedia, I'm supposed to this:

cov_mat = cov(orig_data)
val, vec = linalg.eig(cov_mat)
print val

But it doesn't really match the output in the tutorials I found online. Plus, if I have 4 dimensions, I thought I should have 4 eigenvalues and not 150 like the eig gives me. Am I doing something wrong?

edit: I've noticed that the values differ by 150, which is the number of elements in the dataset. Also, the eigenvalues are supposed to add to be equal to the number of dimensions, in this case, 4. What I don't understand is why this difference is happening. If I simply divided the eigenvalues by len(data) I could get the result I want, but I don't understand why. Either way the proportion of the eigenvalues isn't altered, but they are important to me so I'd like to understand what's going on.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

℉絮湮 2024-10-21 11:04:17

您分解了错误的矩阵。

主成分分析需要操纵特征向量/特征值
协方差矩阵，而不是数据本身。从 mxn 数据矩阵创建的协方差矩阵将是一个沿主对角线具有 1 的 mxm 矩阵。

您确实可以使用cov函数，但您需要进一步操作数据。使用类似的函数可能更容易一些，corrcoef：

import numpy as NP
import numpy.linalg as LA

# a simulated data set with 8 data points, each point having five features
data = NP.random.randint(0, 10, 40).reshape(8, 5)

# usually a good idea to mean center your data first:
data -= NP.mean(data, axis=0)

# calculate the covariance matrix 
C = NP.corrcoef(data, rowvar=0)
# returns an m x m matrix, or here a 5 x 5 matrix)

# now get the eigenvalues/eigenvectors of C:
eval, evec = LA.eig(C)

为了获得特征向量/特征值，我没有使用 SVD 分解协方差矩阵，
不过，你当然可以。我的偏好是使用 NumPy（或 SciPy）中的 eig 来计算它们
LA 模块——它比 svd 更容易使用，返回值是特征向量
和特征值本身，仅此而已。相比之下，如您所知，svd 不会直接返回这些内容。

假设 SVD 函数可以分解任何矩阵，而不仅仅是方阵（eig 函数仅限于方阵）；然而在进行 PCA 时，你总是需要分解一个方阵，
无论您的数据采用何种形式。这是显而易见的，因为您的矩阵
在 PCA 中分解为协方差矩阵，根据定义，它始终是方阵
（即，列是原始矩阵的各个数据点，同样
对于行，每个单元格都是这两点的协方差，如所证明的
主对角线下方的数据——给定的数据点与其自身具有完美的协方差）。

You decomposed the wrong matrix.

Principal Component Analysis requires manipulating the eigenvectors/eigenvalues
of the covariance matrix, not the data itself. The covariance matrix, created from an m x n data matrix, will be an m x m matrix with ones along the main diagonal.

You can indeed use the cov function, but you need further manipulation of your data. It's probably a little easier to use a similar function, corrcoef:

import numpy as NP
import numpy.linalg as LA

# a simulated data set with 8 data points, each point having five features
data = NP.random.randint(0, 10, 40).reshape(8, 5)

# usually a good idea to mean center your data first:
data -= NP.mean(data, axis=0)

# calculate the covariance matrix 
C = NP.corrcoef(data, rowvar=0)
# returns an m x m matrix, or here a 5 x 5 matrix)

# now get the eigenvalues/eigenvectors of C:
eval, evec = LA.eig(C)

To get the eigenvectors/eigenvalues, I did not decompose the covariance matrix using SVD,
though, you certainly can. My preference is to calculate them using eig in NumPy's (or SciPy's)
LA module--it is a little easier to work with than svd, the return values are the eigenvectors
and eigenvalues themselves, and nothing else. By contrast, as you know, svd doesn't return these these directly.

Granted the SVD function will decompose any matrix, not just square ones (to which the eig function is limited); however when doing PCA, you'll always have a square matrix to decompose,
regardless of the form that your data is in. This is obvious because the matrix you
are decomposing in PCA is a covariance matrix, which by definition is always square
(i.e., the columns are the individual data points of the original matrix, likewise
for the rows, and each cell is the covariance of those two points, as evidenced
by the ones down the main diagonal--a given data point has perfect covariance with itself).

回复收藏 0 原文

江城子 2024-10-21 11:04:17

SVD(A) 返回的左奇异值是 AA^T 的特征向量。

数据集 A 的协方差矩阵为： 1/(N-1) * AA^T

现在，当您使用 SVD 进行 PCA 时，您必须将 A 矩阵中的每个条目除以 (N-1)，这样就得到具有正确尺度的协方差的特征值。

就您而言，N=150 并且您没有进行此除法，因此存在差异。

此处详细解释了

回复收藏 0 原文

温折酒 2024-10-21 11:04:17

（你能问一个问题吗？或者至少单独列出你的问题。你的帖子读起来就像一股意识流，因为你没有问一个问题。）

你可能错误地使用了 cov首先不转置矩阵。如果 cov_mat 为 4×4，则 eig 将产生四个特征值和四个特征向量。
请注意，SVD 和 PCA 虽然相关，但并不完全相同。令 X 为 4×150 观测值矩阵，其中每个 4 元素列都是单个观测值。那么，以下是等效的：
a. X 的左奇异向量，
b. X 的主成分，
c. XX^T 的特征向量。
此外，XX^T 的特征值等于 X 的奇异值的平方。为了了解这一切，让 X 具有 SVD X = QSV^T，其中 S 是奇异值的对角矩阵。然后考虑特征分解 D = Q^TXX^TQ，其中 D 是特征值的对角矩阵。将 X 替换为其 SVD，看看会发生什么。