Matlab：如何在matlab中使用PCA查找数据集中的哪些变量可以被丢弃？

发布于 2024-12-07 11:38:24 字数 835 浏览 4 评论 0原文

我正在使用 PCA 来找出数据集中的哪些变量由于与其他变量高度相关而被冗余。我在之前使用 zscore 标准化的数据上使用 princomp matlab 函数：

[coeff, PC, eigenvalues] = princomp(zscore(x))

我知道特征值告诉我数据集的变化程度涵盖每个主成分，而系数告诉我第 i 个原始变量有多少位于第 j 个主成分中（其中 i - 行，j - 列）。

因此，我假设要找出原始数据集中哪些变量最重要，哪些变量最不重要，我应该将 coeff 矩阵乘以 特征值 - coeff 值代表每个组件有多少变量，特征值表明该组件的重要性。这是我的完整代码：

[coeff, PC, eigenvalues] = princomp(zscore(x));
e = eigenvalues./sum(eigenvalues);
abs(coeff)/e

但这并没有真正显示任何内容 - 我在以下集合上尝试了它，其中变量 1 与变量 2 完全相关 (v2 = v1 + 2)：

     v1    v2    v3
     1     3     4
     2     4    -1
     4     6     9
     3     5    -2

但我的计算结果如下

v1 0.5525
v2 0.5525
v3 0.5264

：这并没有真正显示出任何东西。我预计变量 2 的结果表明它远不如 v1 或 v3 重要。我的哪个假设是错误的？

原文

I am using PCA to find out which variables in my dataset are redundand due to being highly correlated with other variables. I am using princomp matlab function on the data previously normalized using zscore:

[coeff, PC, eigenvalues] = princomp(zscore(x))

I know that eigenvalues tell me how much variation of the dataset covers every principal component, and that coeff tells me how much of i-th original variable is in the j-th principal component (where i - rows, j - columns).

So I assumed that to find out which variables out of the original dataset are the most important and which are the least I should multiply the coeff matrix by eigenvalues - coeff values represent how much of every variable each component has and eigenvalues tell how important this component is.
So this is my full code:

[coeff, PC, eigenvalues] = princomp(zscore(x));
e = eigenvalues./sum(eigenvalues);
abs(coeff)/e

But this does not really show anything - I tried it on a following set, where variable 1 is fully correlated with variable 2 (v2 = v1 + 2):

     v1    v2    v3
     1     3     4
     2     4    -1
     4     6     9
     3     5    -2

but the results of my calculations were following:

v1 0.5525
v2 0.5525
v3 0.5264

and this does not really show anything. I would expect the result for variable 2 show that it is far less important than v1 or v3.
Which of my assuptions is wrong?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

薔薇婲 2024-12-14 11:38:24

编辑现在我已经完全修改了答案，因为我明白了哪些假设是错误的。

在解释什么在 OP 中不起作用之前，让我先确保我们使用相同的术语。在主成分分析中，目标是获得能够很好地分离观测值的坐标变换，并且这可以使得在较低维空间中描述数据（即不同的多维观测值）变得容易。当观察由多个测量组成时，它们就是多维的。如果线性独立观测值少于测量值，我们预计至少一个特征值为零，因为例如 3D 空间中的两个线性独立观测向量可以由 2D 平面描述。

如果我们有一个由四个观测值组成的数组

x = [    1     3     4
         2     4    -1
         4     6     9
         3     5    -2];

，每个观测值具有三个测量值，princomp(x) 将找到四个观测值所跨越的低维空间。由于存在两个相互依赖的测量，其中一个特征值将接近于零，因为测量空间只是 2D 而不是 3D，这可能是您想要找到的结果。事实上，如果您检查特征向量 (coeff)，您会发现前两个分量非常明显地共线，

coeff = princomp(x)
coeff =
      0.10124      0.69982      0.70711
      0.10124      0.69982     -0.70711
       0.9897     -0.14317   1.1102e-16

因为前两个分量实际上指向相反的方向，变换后的观测值的前两个分量的值本身是没有意义的：[1 1 25] 相当于 [1000 1000 25]。

现在，如果我们想找出任何测量值是否是线性相关的，并且我们是否真的想为此使用主成分，因为在现实生活中，测量值可能不是完全共线的，我们感兴趣的是为机器学习应用程序找到良好的描述符向量，将三个测量视为“观察值”并运行 princomp(x') 更有意义。由于因此只有三个“观测值”，但有四个“测量值”，因此第四个特征向量将为零。然而，由于有两个线性相关的观测值，我们只剩下两个非零特征值：

eigenvalues =
       24.263
       3.7368
            0
            0

要找出哪些测量值高度相关（如果使用特征向量，则实际上没有必要）转换测量值作为机器学习的输入），最好的方法是查看测量值之间的相关性：

corr(x)
  ans =
        1            1      0.35675
        1            1      0.35675
  0.35675      0.35675            1

毫不奇怪，每个测量值都与其自身完全相关，并且 v1 与v2。

编辑2

但特征值告诉我们新空间中哪些向量最重要（涵盖大部分变化），系数还告诉我们每个变量在每个分量中占多少。所以我假设我们可以使用这些数据来找出哪些原始变量具有最大的方差，因此是最重要的（并去掉那些代表少量的变量）

如果您的观察结果在一个测量变量中显示出很小的方差，那么这是有效的（例如，其中 x = [1 2 3;1 4 22;1 25 -25;1 11 100];，因此第一个变量对方差没有贡献）。然而，通过共线测量，两个向量都拥有等价信息，并且对方差的贡献相同。因此，特征向量（系数）可能彼此相似。

为了让 @agnieszka 的评论继续有意义，我在下面保留了我的答案的原始第 1-4 点。请注意，#3 是对特征向量除以特征值的响应，这对我来说没有多大意义。

向量应该是行，而不是列（每个向量是一个
观察）。
coeff 返回主体的基向量
分量，其顺序与原始输入关系不大
要查看主分量的重要性，可以使用eigenvalues/sum(eigenvalues)
如果您有两个共线向量，你不能说第一个重要而第二个不重要。你怎么知道事情不应该是相反的？如果您想测试共线性，您应该检查数组的秩，或者对标准化（即 norm 等于 1）向量调用 unique。

EDIT I have completely reworked the answer now that I understand which assumptions were wrong.

Before explaining what doesn't work in the OP, let me make sure we'll have the same terminology. In principal component analysis, the goal is to obtain a coordinate transformation that separates the observations well, and that may make it easy to describe the data , i.e. the different multi-dimensional observations, in a lower-dimensional space. Observations are multidimensional when they're made up from multiple measurements. If there are fewer linearly independent observations than there are measurements, we expect at least one of the eigenvalues to be zero, because e.g. two linearly independent observation vectors in a 3D space can be described by a 2D plane.

If we have an array

x = [    1     3     4
         2     4    -1
         4     6     9
         3     5    -2];

that consists of four observations with three measurements each, princomp(x) will find the lower-dimensional space spanned by the four observations. Since there are two co-dependent measurements, one of the eigenvalues will be near zero, since the space of measurements is only 2D and not 3D, which is probably the result you wanted to find. Indeed, if you inspect the eigenvectors (coeff), you find that the first two components are extremely obviously collinear

coeff = princomp(x)
coeff =
      0.10124      0.69982      0.70711
      0.10124      0.69982     -0.70711
       0.9897     -0.14317   1.1102e-16

Since the first two components are, in fact, pointing in opposite directions, the values of the first two components of the transformed observations are, on their own, meaningless: [1 1 25] is equivalent to [1000 1000 25].

Now, if we want to find out whether any measurements are linearly dependent, and if we really want to use principal components for this, because in real life, measurements my not be perfectly collinear and we are interested in finding good vectors of descriptors for a machine-learning application, it makes a lot more sense to consider the three measurements as "observations", and run princomp(x'). Since there are thus three "observations" only, but four "measurements", the fourth eigenvector will be zero. However, since there are two linearly dependent observations, we're left with only two non-zero eigenvalues:

eigenvalues =
       24.263
       3.7368
            0
            0

To find out which of the measurements are so highly correlated (not actually necessary if you use the eigenvector-transformed measurements as input for e.g. machine learning), the best way would be to look at the correlation between the measurements:

corr(x)
  ans =
        1            1      0.35675
        1            1      0.35675
  0.35675      0.35675            1

Unsurprisingly, each measurement is perfectly correlated with itself, and v1 is perfectly correlated with v2.

EDIT2

but the eigenvalues tell us which vectors in the new space are most important (cover the most of variation) and also coefficients tell us how much of each variable is in each component. so I assume we can use this data to find out which of the original variables hold the most of variance and thus are most important (and get rid of those that represent small amount)

This works if your observations show very little variance in one measurement variable (e.g. where x = [1 2 3;1 4 22;1 25 -25;1 11 100];, and thus the first variable contributes nothing to the variance). However, with collinear measurements, both vectors hold equivalent information, and contribute equally to the variance. Thus, the eigenvectors (coefficients) are likely to be similar to one another.

In order for @agnieszka's comments to keep making sense, I have left the original points 1-4 of my answer below. Note that #3 was in response to the division of the eigenvectors by the eigenvalues, which to me didn't make a lot of sense.

the vectors should be in rows, not columns (each vector is an
observation).
coeff returns the basis vectors of the principal
components, and its order has little to do with the original input
To see the importance of the principal components, you use eigenvalues/sum(eigenvalues)
If you have two collinear vectors, you can't say that the first is important and the second isn't. How do you know that it shouldn't be the other way around? If you want to test for colinearity, you should check the rank of the array instead, or call unique on normalized (i.e. norm equal to 1) vectors.