具有非常大矩阵的 K 均值

发布于 2024-11-15 18:14:57 字数 309 浏览 6 评论 0原文

我必须在一个非常大的矩阵(大约 300.000x100.000 个值,超过 100Gb)上执行 k 均值聚类。我想知道我是否可以使用 R 软件或 weka 来执行此操作。 我的计算机是一台多处理器,具有 8GB 内存和数百 GB 可用空间。

我有足够的空间进行计算,但加载这样的矩阵似乎是 R 的问题(我不认为使用 bigmemory 包对我有帮助,大矩阵会自动使用我的所有 RAM,然后是我的交换文件,如果空间不足)。

所以我的问题是:我应该使用什么软件(最终与其他一些软件包或自定义设置相关)。

谢谢你帮助我。

注:我用的是linux。

I have to perform a k-means clustering on a really huge matrix (about 300.000x100.000 values which is more than 100Gb). I want to know if I can use R software to perform this or weka.
My computer is a multiprocessor with 8Gb of ram and hundreds Gb of free space.

I have enough space for calculations but loading such a matrix seems to be a problem with R (I don't think that using the bigmemory package would help me and big matrix use automatically all my RAM then my swap file if not enough space).

So my question is : what software should I use (eventually in association with some other packages or custom settings).

Thanks for helping me.

Note : I use linux.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

梦纸 2024-11-22 18:14:57

一定是K-means吗?另一种可能的方法是首先将数据转换为网络,然后应用图聚类。我是 MCL 的作者,这是一种在生物信息学中经常使用的算法。链接到的实现应该可以轻松扩展到具有数百万个节点的网络 - 假设您有 100K 个属性,您的示例将有 300K 个节点。通过这种方法,数据将在数据转换步骤中自然地被修剪——而该步骤很可能成为瓶颈。如何计算两个向量之间的距离?在我处理过的应用程序中,我使用了 Pearson 或 Spearman 相关性,并且 MCL 附带了软件,可以在大规模数据上高效地执行此计算(它可以利用多个 CPU 和多台机器)。

数据大小仍然存在问题,因为大多数聚类算法将要求您至少执行所有成对比较至少一次。您的数据真的存储为一个巨大的矩阵吗?输入中有很多零吗?或者,你有办法丢弃较小的元素吗?您是否可以访问多于一台机器来分发这些计算?

Does it have to be K-means? Another possible approach is to transform your data into a network first, then apply graph clustering. I am the author of MCL, an algorithm used quite often in bioinformatics. The implementation linked to should easily scale up to networks with millions of nodes - your example would have 300K nodes, assuming that you have 100K attributes. With this approach, the data will be naturally pruned in the data transformation step - and that step will quite likely become the bottleneck. How do you compute the distance between two vectors? In the applications that I have dealt with I used the Pearson or Spearman correlation, and MCL is shipped with software to efficiently perform this computation on large scale data (it can utilise multiple CPUs and multiple machines).

There is still an issue with the data size, as most clustering algorithms will require you to at least perform all pairwise comparisons at least once. Is your data really stored as a giant matrix? Do you have many zeros in the input? Alternatively, do you have a way of discarding smaller elements? Do you have access to more than one machine in order to distribute these computations?

花开雨落又逢春i 2024-11-22 18:14:57

我保留链接(这对特定用户有用)但我同意加文的评论!
要对大数据执行 k-means 聚类,您可以使用 Revolution R 中实现的 rxKmeans 函数R 的企业专有实现(我知道这可能是一个问题);该功能似乎能够管理此类数据。

I keep the link (that can be useful to the specific user) but I agree with Gavin's comment!
To perform a k-means clustering on Big Data you can use the rxKmeans function implemented in the Revolution R Enterprise proprietary implementation of R (I know this can be a problem); this function seems to be capable of manage that kind of data.

如此安好 2024-11-22 18:14:57

因为我们对数据一无所知,也不了解提问者的目标
为此,只需几个常规链接:
我。 Guyon 的视频讲座 — 还有许多论文和书籍。
stats.stackexchange 上的功能选择

Since we know nothing at all about the data, nor the questioner's goals
for it, just a couple of general links:
I. Guyon's video lectures — many papers and books too.
feature selection on stats.stackexchange

妞丶爷亲个 2024-11-22 18:14:57

看看 Mahout,它会对大型数据集执行 k 操作:

http://mahout.apache.org/

Check out Mahout, it will do k means on a large data set:

http://mahout.apache.org/

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文