在大型矩阵上用java进行PCA
我有一个非常大的矩阵(大约 500000 * 20000),其中包含我将使用 pca 分析的数据。为此,我使用 ParallelColt 库,但都使用奇异值分解和特征值分解,以获得协方差矩阵的特征向量和特征值。但是这些方法浪费了堆,我得到了“OutOfMemory”错误...
同样使用SparseDoubleMatrix2D(数据非常稀疏)错误仍然存在,所以我问你:我该如何解决这个问题?
改变图书馆?
I have a very large matrix (about 500000 * 20000) containing the data that I would analyze with pca. To do this I'm using ParallelColt library, but both using singular value decomposition and eigenvalues decomposition in order to get the eigenvectors and eigenvalues of the covariance matrix. But these methods waste the heap and I get "OutOfMemory" errors...
Also using SparseDoubleMatrix2D (the data are very sparse) the errors still remain, so I ask you : how can I solve this problem ?
Change library ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您可以使用 Oja 规则计算 PCA:它是一种迭代算法,可改进 PCA 的估计,一次一个向量。它比通常的 PCA 慢,但要求您在内存中仅存储一个向量。它在数值上也非常稳定
http://en.wikipedia.org/wiki/Oja%27s_rule
You can compute PCA with Oja's rule : it's an iterative algorithm, improving an estimate of the PCA, one vector a time. It's slower than the usual PCA, but requires you to store only one vector in memory. It's also very numerically stable
http://en.wikipedia.org/wiki/Oja%27s_rule
我不确定更改库是否有帮助。您将需要双精度数(每个 8 个字节)。我不知道在这种情况下协方差矩阵的维数是多少,但是切换库不会改变底层计算太多。
运行时 -Xmx 设置是什么?烫发代的大小怎么样?也许你可以增加它们。
算法是立即停止还是运行一段时间?如果是后者,您可以使用 Visual VM 1.3.3 附加到该进程(下载并安装所有插件)。它会让您看到堆、线程等上发生了什么。可以帮助您找出根本原因。
Google 搜索“大型矩阵的 Java 特征值”出现此库来自 Google。如果您在评论中向下滚动,我想知道块 Lanczos 特征值分析可能会有所帮助。如果您可以获得特征值的子集就足够了。
这些 SVM 实现声称对大型数据集有用:
http://www.support-vector- machines.org/SVM_soft.html
我认为您不能要求 JVM 超过 2GB:
http://www.theserverside.com/discussions/thread.tss?thread_id=26347
根据 Oracle 的说法,您需要一个在 64 位操作系统上运行的 64 位 JVM:
http://www.oracle.com/technetwork/java/hotspotfaq-138619.html #gc_heap_32bit
I'm not sure that changing libraries will help. You're going to need doubles (8 bytes per). I don't know what the dimension of the covariance matrix would be in this case, but switching libraries won't change the underlying calculations much.
What is the -Xmx setting when you run? What about the perm gen size? Perhaps you can increase them.
Does the algorithm halt immediately or does it run for a while? If it's the latter, you can attach to the process using Visual VM 1.3.3 (download and install all the plugins). It'll let you see what's happening on the heap, threads, etc. Could help you ferret out the root cause.
A Google search for "Java eigenvalue of large matricies" turned up this library from Google. If you scroll down in the comments I wonder of a block Lanczos eigenvalue analysis might help. It might be enough if you can get a subset of the eigenvalues.
These SVM implementations claim to be useful for large datasets:
http://www.support-vector-machines.org/SVM_soft.html
I don't think you can ask for more than 2GB for a JVM:
http://www.theserverside.com/discussions/thread.tss?thread_id=26347
According to Oracle, you'll need a 64-bit JVM running on a 64-bit OS:
http://www.oracle.com/technetwork/java/hotspotfaq-138619.html#gc_heap_32bit
我针对此类问题构建了一些稀疏的增量算法。方便的是,它是建立在 Colt 之上的。
请参阅下面的rickl-cluster 库中的HallMarshalMartin 类。您可以一次向其提供行块,因此它应该可以解决您的内存问题。
该代码可在 GPL 下获取。恐怕我刚刚发布了它,所以它的文档很少,希望它是相当不言自明的。有一些 JUnit 测试应该有助于使用。
http://open.trickl.com/trickl-pca/index.html
I built some sparse, incremental algorithms for just this sort of problem. Conveniently, it's built on top of Colt.
See the HallMarshalMartin class in trickl-cluster library below. You can feed it chunks of rows at a time, so it should solve your memory issues.
The code is available under the GPL. I'm afraid I've only just released it, so it's short on documentation, hopefully it's fairly self explanatory. There are JUnit tests that should help with usage.
http://open.trickl.com/trickl-pca/index.html