R 中非常大的稀疏矩阵上的 k 均值聚类?
我正在尝试在一个非常大的矩阵上进行一些 k 均值聚类。
该矩阵大约有 500000 行 x 4000 列,但非常稀疏(每行只有几个“1”值)。
整个内容无法放入内存,因此我将其转换为稀疏 ARFF 文件。但R显然无法读取稀疏ARFF文件格式。我还有纯 CSV 文件形式的数据。
R 中是否有任何包可以有效加载此类稀疏矩阵?然后,我将使用 cluster 包中的常规 k-means 算法继续。
非常感谢
I am trying to do some k-means clustering on a very large matrix.
The matrix is approximately 500000 rows x 4000 cols yet very sparse (only a couple of "1" values per row).
The whole thing does not fit into memory, so I converted it into a sparse ARFF file. But R obviously can't read the sparse ARFF file format. I also have the data as a plain CSV file.
Is there any package available in R for loading such sparse matrices efficiently? I'd then use the regular k-means algorithm from the cluster package to proceed.
Many thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
bigmemory 包(或者现在的包系列 - 请参阅它们的 网站)使用 k-means 作为大数据扩展分析的运行示例。特别请参阅包含 k-means 函数的子包 biganalytics。
The bigmemory package (or now family of packages -- see their website) used k-means as running example of extended analytics on large data. See in particular the sub-package biganalytics which contains the k-means function.
请检查:
干杯。
Please check:
Cheers.
Sparkcl执行稀疏层次聚类和稀疏k-means聚类
这对于适合 R 的(因此适合内存)矩阵应该很有用。
http://cran.r-project.org/web/packages/sparcl /sparcl.pdf
==
对于非常大的矩阵,我会尝试使用 Apache Spark 稀疏矩阵和 MLlib 的解决方案 - 仍然不知道它现在的实验性如何:
https://spark.apache.org/docs/latest/api /scala/index.html#org.apache.spark.mllib.linalg.Matrices$
https://spark.apache.org/docs/latest/mllib-clustering.html
sparkcl performs sparse hierarchical clustering and sparse k-means clustering
This should be good for R-suitable (so - fitting into memory) matrices.
http://cran.r-project.org/web/packages/sparcl/sparcl.pdf
==
For really big matrices, I would try a solution with Apache Spark sparse matrices, and MLlib - still, do not know how experimental it is now:
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.Matrices$
https://spark.apache.org/docs/latest/mllib-clustering.html
R 有一个特殊的 SparseM 包可以有效地容纳它。如果这不起作用,我会尝试使用性能更高的语言,例如 C。
There's a special SparseM package for R that can hold it efficiently. If that doesn't work, I would try going to a higher performance language, like C.