内核 PCA 的碎石图

发布于 2025-01-12 16:38:56 字数 1011 浏览 0 评论 0原文

我正在尝试为 Kernel PCA 绘制屏幕图。我的 X 有 78 个特征，有 247K 个样本。我是内核 PCA 的新手，但我已多次使用线性 PCA 的屏幕图。下面的代码绘制了线性 PCA 的屏幕图。我想使用屏幕图来决定在实际安装之前需要的组件数量。

pca = PCA().fit(X)
plt.figure()
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of  Principle Components')
plt.ylabel('Variance (%)') #for each component
plt.title('Dataset Explained Variance')
plt.show()

我尝试以相同的方式复制内核 PCA，但内核 PCA 不存在 explained_variance_ratio_ 方法这就是为什么我按照以下方式这样做。

pca = KernelPCA(kernel='rbf',gamma=10,fit_inverse_transform=False).fit_transform(scaled_merged.iloc[0:1000:,])
explained_variance = np.var(pca, axis=0)
explained_variance_ratio = explained_variance / np.sum(explained_variance)
plt.figure()
plt.plot(np.cumsum(explained_variance_ratio))
plt.xlabel('Number of Components')
plt.ylabel('Variance (%)') #for each component
plt.title('Dataset Explained Variance')
plt.show()

kernel PCA 代码的碎石图存在一些问题，它表明我需要 150 个分量才能表达接近 90% 的方差。我的代码有什么问题吗？

原文

I am trying to do a scree plot for Kernel PCA. I have 78 features in my X with 247K samples. I am new to kernel PCA however I have utilized scree plot for linear PCA multiple times. The below code does the scree plot for linear PCA. I want to use the scree plot to decide the number of components I will need before actually fitting it in.

pca = PCA().fit(X)
plt.figure()
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of  Principle Components')
plt.ylabel('Variance (%)') #for each component
plt.title('Dataset Explained Variance')
plt.show()

I tried to replicate the same way for kernel pca but explained_variance_ratio_ method doesn't exist for kernel PCA which is why I did it the following way.

pca = KernelPCA(kernel='rbf',gamma=10,fit_inverse_transform=False).fit_transform(scaled_merged.iloc[0:1000:,])
explained_variance = np.var(pca, axis=0)
explained_variance_ratio = explained_variance / np.sum(explained_variance)
plt.figure()
plt.plot(np.cumsum(explained_variance_ratio))
plt.xlabel('Number of Components')
plt.ylabel('Variance (%)') #for each component
plt.title('Dataset Explained Variance')
plt.show()

The scree plot for kernel PCA code has some problem it shows that I need 150 components to express close to 90% variance. Is there something wrong I am doing with my code?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

孤独难免 2025-01-19 16:38:56

原因很简单。 核 PCA (kPCA) 中的特征值总和与特征空间中的总解释方差相关，具体取决于您选择的核函数。使用 RBF 核，kPCA 相当于无限维特征空间中的经典 PCA。
kPCA 的特征值相当于特征空间中的特征值。
这就是为什么碎石图与对应于线性核函数的 PCA 的碎石图不同。因此，如果使用线性内核，它在输入空间中应该与 PCA 相同。

公平比较的正确代码是：

pca = KernelPCA(kernel='linear', fit_inverse_transform=False).fit_transform(scaled_merged.iloc[0:1000:,]

简而言之，除了线性核函数之外，kPCA 中的特征值不应像经典 PCA 那样进行解释。 kPCA 的优化问题是经典 PCA 在特征空间中的对偶问题，而不是在输入空间中。

参考文献：

Schölkopf, B.、Smola, A. 和穆勒，KR (1998)。作为核特征值问题的非线性分量分析。神经计算, 10(5), 1299-1319.

The reason is simple. The sum of eigenvalues in kernel PCA (kPCA) is associated to the total explained variance in the feature space, depending on your choice of kernel function. With an RBF kernel, kPCA is equivalent to classical PCA in the feature space of infinite dimension.
The eigenvalues for kPCA are equivalent to those in the feature space.
That's why the scree plot is different from those for PCA which corresponds to a linear kernel function. So, if you use a linear kernel, it should be the same as PCA in the input space.

The correct code for a fair comparison is

pca = KernelPCA(kernel='linear', fit_inverse_transform=False).fit_transform(scaled_merged.iloc[0:1000:,]

In short, the eigenvalues in kPCA is not supposed to be interpreted like classical PCA except for a linear kernel function. The optimization problem for kPCA is the dual problem of classical PCA in the feature space, not in the input space.

Reference:

Schölkopf, B., Smola, A., & Müller, K. R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural computation, 10(5), 1299-1319.

回复收藏 0 原文

~没有更多了~