每个簇大小具有上限要求的聚类算法
我需要将大约 50000 个点划分为不同的簇。有一个要求:每个簇的大小不能超过K。有没有任何聚类算法可以完成这项工作?
请注意,每个簇的上限 K 都是相同的,比如 100。
I need to do a partition of approximately 50000 points into distinct clusters. There is one requirement: the size of every cluster cannot exceed K. Is there any clustering algorithm that can do this job?
Please note that upper bound, K, of every cluster is the same, say 100.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
大多数聚类算法可用于创建树,其中最低级别只是单个元素 - 要么是因为它们自然地通过连接成对元素然后连接元素组来“自下而上”地工作,要么因为 - 就像 K 均值一样,它们可用于反复将组分成更小的组。
一旦你有了一棵树,你就可以决定在哪里分割子树以形成大小 <= 100 的簇。修剪现有的树通常非常容易。假设您想要划分现有树以最小化您创建的集群的一些成本总和。您可能有:
Most clustering algorithms can be used to create a tree in which the lowest level is just a single element - either because they naturally work "bottom up" by joining pairs of elements and then groups of joined elements, or because - like K-Means, they can be used to repeatedly split groups into smaller groups.
Once you have a tree, you can decide where to split off subtrees to form your clusters of size <= 100. Pruning an existing tree is often quite easy. Suppose that you want to divide an existing tree to minimise the sum of some cost of the clusters you create. You might have:
一种方法是使用分层 K-means,但是你不断地分裂每个大于K的簇,直到它们都变小。
另一种(在某种意义上相反的方法)是使用分层凝聚聚类 ,即一种自下而上的方法,并再次确保如果它们将形成一个大小为 > 的新簇,则不要合并簇。 K.
One way is to use hierarchical K-means, but you keep splitting each cluster which is larger than K, until all of them are smaller.
Another (in some sense opposite approach) would be to use hierarchical agglomerative clustering, i.e. a bottom up approach and again make sure you don't merge cluster if they'll form a new one of size > K.
朴素聚类的问题在于,您确实必须计算一个距离矩阵,该距离矩阵保存 A 与集合中每个其他成员的距离。这取决于您是否对总体进行了预处理,或者是否将聚类合并为典型个体,然后再次重新计算距离矩阵。
The issue with naive clustering is that you do indeed have to calculate a distance matrix that holds the distance of A from every other member in the set. It depends whether you've pre-processed the population or your amalgamating the clusters into typical individuals then recalculating the distance matrix again.