100万个对象的层次聚类
谁能向我指出一个可以聚类约 100 万个对象的分层聚类工具(最好在 python 中)?我尝试过 hcluster
以及 橙色。
hcluster
在处理 18k 对象时遇到了问题。 Orange 能够在几秒钟内对 18k 对象进行集群,但在处理 100k 对象时失败(内存饱和并最终崩溃)。
我在 Ubuntu 11.10 上运行 64 位 Xeon CPU (2.53GHz) 和 8GB RAM + 3GB swap。
Can anyone point me to a hierarchical clustering tool (preferable in python) that can cluster ~1 Million objects? I have tried hcluster
and also Orange.
hcluster
had trouble with 18k objects. Orange was able to cluster 18k objects in seconds, but failed with 100k objects (saturated memory and eventually crashed).
I am running on a 64bit Xeon CPU (2.53GHz) and 8GB of RAM + 3GB swap on Ubuntu 11.10.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
问题可能是他们会尝试计算完整的 2D 距离矩阵(大约 8 GB,双精度),然后他们的算法无论如何都会在 O(n^3) 时间内运行。
您应该认真考虑使用不同的聚类算法。层次聚类速度很慢,而且结果通常根本不令人信服。特别是对于数百万个对象,您不能仅查看树状图来选择合适的切割。
如果你真的想继续分层集群,我相信 ELKI (尽管是 Java)有一个 <
SLINK
的 code>O(n^2) 实现。如果有 100 万个对象,速度应该提高大约 100 万倍。我不知道他们是否也已经有了CLINK
。我不确定除了单链接和完整链接之外,是否真的存在任何其他变体的子O(n^3)
算法。考虑使用其他算法。例如,k-means 可以很好地随对象数量进行缩放(通常也不是很好,除非您的数据非常干净且规则)。在我看来,一旦您对参数有了感觉,
DBSCAN
和OPTICS
就相当不错了。如果您的数据集是低维的,则可以通过适当的索引结构来很好地加速它们。如果您有一个查询时间为O(log n)
的索引,那么它们的运行时间应该为O(n log n)
。这对于大型数据集来说可以产生巨大的影响。我个人在 110k 图像数据集上使用OPTICS
没有出现任何问题,因此我可以想象它在您的系统上可以很好地扩展到 100 万张图像。The problem probably is that they will try to compute the full 2D distance matrix (about 8 GB naively with double precision) and then their algorithm will run in
O(n^3)
time anyway.You should seriously consider using a different clustering algorithm. Hierarchical clustering is slow and the results are not at all convincing usually. In particular for millions of objects, where you can't just look at the dendrogram to choose the appropriate cut.
If you really want to continue hierarchical clustering, I belive that ELKI (Java though) has a
O(n^2)
implementation ofSLINK
. Which at 1 million objects should be approximately 1 million times as fast. I don't know if they already haveCLINK
, too. And I'm not sure if there actually is any sub-O(n^3)
algorithm for other variants than single-link and complete-link.Consider using other algorithms. k-means for example scales very well with the number of objects (it's just not very good usually either, unless your data is very clean and regular).
DBSCAN
andOPTICS
are quite good in my opinion, once you have a feel for the parameters. If your data set is low dimensional, they can be accelerated quite well with an appropriate index structure. They should then run inO(n log n)
, if you have an index withO(log n)
query time. Which can make a huge difference for large data sets. I've personally usedOPTICS
on a 110k images data set without problems, so I can imagine it scales up well to 1 million on your system.为了击败 O(n^2),你必须首先减少 1M 点(文档)
例如 1000 堆,每堆 1000 个点,或 100 堆,每堆 10k,或者 ...
两种可能的方法:
从 15k 个点构建一个层次树,然后将其余的一个一个地添加:
时间 ~ 1M * 树深度
首先构建 100 或 1000 个平面簇,
然后构建 100 或 1000 个聚类中心的层次树。
其中任何一个的效果如何取决于关键
目标树的大小和形状——
有多少层,有多少片叶子?
你用什么软件,
您需要多少小时/天进行聚类?
对于扁平簇方法,
K-d_tree
对于 2d、3d、20d、甚至 128d 的点工作得很好——不是你的情况。
我对文本聚类几乎一无所知;
Locality-sensitive_hashing ?
看看 scikit-learn 集群 --
它有多种方法,包括 DBSCAN。
补充:另见
google-all-pairs-similarity-search
“在稀疏向量数据中查找所有相似向量对的算法”,Beyardo 等人。 2007年
SO 分层聚类启发式
To beat O(n^2), you'll have to first reduce your 1M points (documents)
to e.g. 1000 piles of 1000 points each, or 100 piles of 10k each, or ...
Two possible approaches:
build a hierarchical tree from say 15k points, then add the rest one by one:
time ~ 1M * treedepth
first build 100 or 1000 flat clusters,
then build your hierarchical tree of the 100 or 1000 cluster centres.
How well either of these might work depends critically
on the size and shape of your target tree --
how many levels, how many leaves ?
What software are you using,
and how many hours / days do you have to do the clustering ?
For the flat-cluster approach,
K-d_tree s
work fine for points in 2d, 3d, 20d, even 128d -- not your case.
I know hardly anything about clustering text;
Locality-sensitive_hashing ?
Take a look at scikit-learn clustering --
it has several methods, including DBSCAN.
Added: see also
google-all-pairs-similarity-search
"Algorithms for finding all similar pairs of vectors in sparse vector data", Beyardo et el. 2007
SO hierarchical-clusterization-heuristics