主成分分析 - 两种方法，产生不同的结果

发布于 2025-01-10 07:56:47 字数 3002 浏览 0 评论 0原文

我正在尝试手动进行主成分分析。

但是，我遇到了一个问题：

问题：当使用两种不同的方法从头开始“手动”进行 PCA 时，我没有得到相同的结果

让我通过一个示例来展示该问题：

考虑以下 numpy 数组 (1)：

x = np.array([
    [1.5, 2.3, 5.2, 3.2, 5.5],
    [3.5, 4.2, 6.5, 8.9, 7.5],
    [9.6, 8.2, 7.1, 9.3, 1.1],
    [3.1, 2.7, 2.9, 3.5, 9.6],
    [1.1, 6.7, 2.3, 3.5, 9.5]])

在第一种方法中，我首先计算协方差矩阵 (2)：

CovM = np.cov(x.T)

根据协方差矩阵，我计算特征向量和特征值(3)：

eig_vals, eig_vecs = np.linalg.eig(CovM)

然后我将特征值从最高到最低排序（PC 计算所需）(4)：

eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]
eig_pairs.sort(key=lambda x: x[0], reverse=True)

然后存储我的特征向量(5)< /strong>：

matrix_w = np.hstack((eig_pairs[0][1].reshape(5,1), 
                  eig_pairs[1][1].reshape(5,1),
                  eig_pairs[2][1].reshape(5,1),
                  eig_pairs[3][1].reshape(5,1)
                ))

计算我的主成分，我得到(6)：

x.dot(matrix_w)

array([[ 1.57663265, -2.09793105,  6.2812378 ,  0.02538662],
   [ 5.18837375, -4.40212324, 11.50839994,  0.36124448],
   [13.58239293, -5.93222094,  6.90492473,  1.19626484],
   [-0.23287223, -5.19146231,  8.05750048,  3.01045736],
   [-0.33881008, -8.33915097,  7.44838336, -0.16441424]])

现在，请注意这些结果，它们将与第二种方法不同，这正是问题所在！虽然它们具有相同的大小，但它们的符号不同（正值和负值）

现在，对于第二种方法：

考虑与之前 (1) 相同的 numpy 数组。我将重用该部分，但从头开始计算其他所有内容，因此：

numpy 数组 x 的协方差矩阵可以如下找到：(7)：

cov_mat = np.cov(x.T)

类似地，我们可以像以前一样计算特征值和特征向量(8)：

eigen_values, eigen_vectors = np.linalg.eigh(cov_mat)

现在，我可以使用 (9) 对特征值从最高到最低进行排序：

sorted_index = np.argsort(eigen_values)[::-1]
sorted_eigenvalue = eigen_values[sorted_index]
sorted_eigenvectors = eigen_vectors[:,sorted_index]

我选择保留 4 个分量（前 4 个 PC），因此(10)：

n_components = 4
eigenvector_subset = sorted_eigenvectors[:,0:n_components]

类似地，和以前一样，我可以计算主成分(11)：

np.dot(eigenvector_subset.transpose(),x.transpose()).transpose()

array([[ -1.57663265,   2.09793105,   6.2812378 ,   0.02538662],
   [ -5.18837375,   4.40212324,  11.50839994,   0.36124448],
   [-13.58239293,   5.93222094,   6.90492473,   1.19626484],
   [  0.23287223,   5.19146231,   8.05750048,   3.01045736],
   [  0.33881008,   8.33915097,   7.44838336,  -0.16441424]])

上面，说明了这个问题。当比较 (11) 与 (6) 的结果时，我发现由于某种我无法解释的原因，我得到了不同的符号（正数和负数） - 我试图解决这个问题，我注意到的一件事是，在 中排序时，如果我使用 (3) 的定义，即 eig_vals 和 eig_vecs 而不是 (8) (9) 我确实这样做得到相同的结果，但这似乎有点可疑，因为我计算协方差矩阵以及特征值和特征向量的方式在这两种方法中本质上是相同的，还是我遗漏了一些东西？

如果我的问题不清楚且不简洁，请告诉我

原文

I'm trying to conduct Principal Component Analysis manually.

However, I've witnessed an issue:

Problem: when conducting PCA "manually" from scratch, using two different approaches, I don't get the same results

Let me showcase the issue by means of an example:

Consider the following numpy array (1):

x = np.array([
    [1.5, 2.3, 5.2, 3.2, 5.5],
    [3.5, 4.2, 6.5, 8.9, 7.5],
    [9.6, 8.2, 7.1, 9.3, 1.1],
    [3.1, 2.7, 2.9, 3.5, 9.6],
    [1.1, 6.7, 2.3, 3.5, 9.5]])

In the first approach, I first calculate the Covariance Matrix (2):

CovM = np.cov(x.T)

From the covariance matrix, I calculate the eigenvector and eigenvalues (3):

eig_vals, eig_vecs = np.linalg.eig(CovM)

I then sort my eigen values from highest to lowest (needed for PC calculation) (4):

eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]
eig_pairs.sort(key=lambda x: x[0], reverse=True)

I then store my eigenvectors (5):

matrix_w = np.hstack((eig_pairs[0][1].reshape(5,1), 
                  eig_pairs[1][1].reshape(5,1),
                  eig_pairs[2][1].reshape(5,1),
                  eig_pairs[3][1].reshape(5,1)
                ))

Calculating my principal components I get (6):

x.dot(matrix_w)

array([[ 1.57663265, -2.09793105,  6.2812378 ,  0.02538662],
   [ 5.18837375, -4.40212324, 11.50839994,  0.36124448],
   [13.58239293, -5.93222094,  6.90492473,  1.19626484],
   [-0.23287223, -5.19146231,  8.05750048,  3.01045736],
   [-0.33881008, -8.33915097,  7.44838336, -0.16441424]])

Now, note these results, they're going to be different from the second approach and that's exactly the issue! While they have the same magnitude they differ in signs (positive and negative values)

Now, for the second approach:

Consider the same numpy array from before (1). I will reuse that part, but calculate everything else from scratch, hence:

Covariance matrix for numpy array, x, can be found as: (7):

cov_mat = np.cov(x.T)

Similarily, we can calculate the eigenvalues and eigenvectors as before (8):

eigen_values, eigen_vectors = np.linalg.eigh(cov_mat)

Now, I can sort my eigenvalues from highest to lowest, using (9):

sorted_index = np.argsort(eigen_values)[::-1]
sorted_eigenvalue = eigen_values[sorted_index]
sorted_eigenvectors = eigen_vectors[:,sorted_index]

I choose to retain 4 components (first 4 PC), hence (10):

n_components = 4
eigenvector_subset = sorted_eigenvectors[:,0:n_components]

Similarly, as before, I can calculate the principal components (11):

np.dot(eigenvector_subset.transpose(),x.transpose()).transpose()

array([[ -1.57663265,   2.09793105,   6.2812378 ,   0.02538662],
   [ -5.18837375,   4.40212324,  11.50839994,   0.36124448],
   [-13.58239293,   5.93222094,   6.90492473,   1.19626484],
   [  0.23287223,   5.19146231,   8.05750048,   3.01045736],
   [  0.33881008,   8.33915097,   7.44838336,  -0.16441424]])

The above, shows the very issue. When comparing the results of (11) with (6), I see that I get different signs (positive and negative) for some reason that I can't explain - I tried to resolve the issue, one thing I noticed is if I use the definition of (3) i.e., eig_vals and eig_vecs instead of (8) when sorting in (9) I do indeed get the same results, but this seems a bit suspicious, given that the way I calculate the covariance matrices and the eigenvalues and eigenvectors are essentially the same in either approach, or am I missing something?

Please let me know if my question is not clear and consise

分享到QQ

分享到微博