Python 中的主成分分析
我想使用主成分分析(PCA)来降维。 numpy 或 scipy 是否已经有了它,或者我必须使用 numpy.linalg.eigh
?
我不仅仅想使用奇异值分解 (SVD),因为我的输入数据非常高维(~460 维),所以我认为 SVD 会比计算协方差矩阵的特征向量慢。
我希望找到一个预制的、经过调试的实现,它已经就何时使用哪种方法做出了正确的决定,并且可能还进行了我不知道的其他优化。
I'd like to use principal component analysis (PCA) for dimensionality reduction. Does numpy or scipy already have it, or do I have to roll my own using numpy.linalg.eigh
?
I don't just want to use singular value decomposition (SVD) because my input data are quite high-dimensional (~460 dimensions), so I think SVD will be slower than computing the eigenvectors of the covariance matrix.
I was hoping to find a premade, debugged implementation that already makes the right decisions for when to use which method, and which maybe does other optimizations that I don't know about.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(11)
几个月后,这是一个小类 PCA 和一张图片:
Months later, here's a small class PCA, and a picture:
使用 numpy.linalg.svd 的 PCA 非常简单。这是一个简单的演示:
PCA using
numpy.linalg.svd
is super easy. Here's a simple demo:您可以使用sklearn:
You can use sklearn:
matplotlib.mlab 有一个 PCA 实现。
matplotlib.mlab has a PCA implementation.
您可以查看 MDP。
我没有机会亲自测试它,但我已经为它添加了 PCA 功能的书签。
You might have a look at MDP.
I have not had the chance to test it myself, but I've bookmarked it exactly for the PCA functionality.
SVD 应该可以在 460 维上正常工作。在我的 Atom 上网本上大约需要 7 秒。 eig() 方法需要更多时间(它应该使用更多浮点运算)并且几乎总是不太准确。
如果您的示例少于 460 个,那么您要做的是将散布矩阵 (x - datamean)^T(x -mean) 对角化,假设您的数据点是列,然后左乘 (x - datamean)。如果维度多于数据,那么可能会更快。
SVD should work fine with 460 dimensions. It takes about 7 seconds on my Atom netbook. The eig() method takes more time (as it should, it uses more floating point operations) and will almost always be less accurate.
If you have less than 460 examples then what you want to do is diagonalize the scatter matrix (x - datamean)^T(x - mean), assuming your data points are columns, and then left-multiplying by (x - datamean). That might be faster in the case where you have more dimensions than data.
您可以使用
scipy.linalg
轻松地“滚动”您自己的数据集(假设有一个预先居中的数据集data
):那么
evs
就是您的特征值, evmat 是你的投影矩阵。如果要保留
d
维度,请使用前d
特征值和前d
特征向量。鉴于 scipy.linalg 具有分解和 numpy 矩阵乘法,您还需要什么?
You can quite easily "roll" your own using
scipy.linalg
(assuming a pre-centered datasetdata
):Then
evs
are your eigenvalues, andevmat
is your projection matrix.If you want to keep
d
dimensions, use the firstd
eigenvalues and firstd
eigenvectors.Given that
scipy.linalg
has the decomposition and numpy the matrix multiplications, what else do you need?我刚刚读完机器学习:算法视角这本书。书中的所有代码示例都是由 Python 编写的(并且几乎是用 Numpy 编写的)。 chatper10.2 主成分分析的代码片段值得一读。它使用 numpy.linalg.eig。
顺便说一句,我认为 SVD 可以很好地处理 460 * 460 尺寸。我在一台非常旧的 PC:Pentium III 733mHz 上使用 numpy/scipy.linalg.svd 计算了 6500*6500 SVD。老实说,该脚本需要大量内存(约1.xG)和大量时间(约30分钟)才能获得SVD结果。
但我认为 460*460 在现代 PC 上不会是一个大问题,除非你需要进行大量的 SVD。
I just finish reading the book Machine Learning: An Algorithmic Perspective. All code examples in the book was written by Python(and almost with Numpy). The code snippet of chatper10.2 Principal Components Analysis maybe worth a reading. It use numpy.linalg.eig.
By the way, I think SVD can handle 460 * 460 dimensions very well. I have calculate a 6500*6500 SVD with numpy/scipy.linalg.svd on a very old PC:Pentium III 733mHz. To be honest, the script needs a lot of memory(about 1.xG) and a lot of time(about 30 minutes) to get the SVD result.
But I think 460*460 on a modern PC will not be a big problem unless u need do SVD a huge number of times.
您不需要完整的奇异值分解 (SVD),因为它可以计算所有特征值和特征向量,并且对于大型矩阵来说可能会令人望而却步。
scipy 及其稀疏模块提供了适用于稀疏矩阵和稠密矩阵的通用线性代数函数,其中包括 eig* 系列函数:
http://docs.scipy.org/ doc/scipy/reference/sparse.linalg.html#matrix-factorizations
Scikit-learn< /a> 提供了 Python PCA 实现目前仅支持密集矩阵。
时间:
You do not need full Singular Value Decomposition (SVD) at it computes all eigenvalues and eigenvectors and can be prohibitive for large matrices.
scipy and its sparse module provide generic linear algrebra functions working on both sparse and dense matrices, among which there is the eig* family of functions :
http://docs.scipy.org/doc/scipy/reference/sparse.linalg.html#matrix-factorizations
Scikit-learn provides a Python PCA implementation which only support dense matrices for now.
Timings :
这里是使用 numpy、scipy 和 C 扩展的 Python PCA 模块的另一个实现。该模块使用 SVD 或用 C 语言实现的 NIPALS(非线性迭代偏最小二乘)算法执行 PCA。
Here is another implementation of a PCA module for python using numpy, scipy and C-extensions. The module carries out PCA using either a SVD or the NIPALS (Nonlinear Iterative Partial Least Squares) algorithm which is implemented in C.
如果您使用 3D 矢量,则可以使用工具带 vg 简洁地应用 SVD。它是 numpy 之上的一个轻层。
如果您只想要第一个主要组件,还有一个方便的别名:
我在上次启动时创建了该库,其动机是这样的:在 NumPy 中冗长或不透明的简单想法。
If you're working with 3D vectors, you can apply SVD concisely using the toolbelt vg. It's a light layer on top of numpy.
There's also a convenient alias if you only want the first principal component:
I created the library at my last startup, where it was motivated by uses like this: simple ideas which are verbose or opaque in NumPy.