您使用什么方法来选择 k 均值和 EM 中的最佳簇数?
有许多聚类算法可用。一种流行的算法是 K 均值,其中基于给定数量的聚类,该算法迭代以找到对象的最佳聚类。
在 k 均值聚类中,您使用什么方法来确定数据中的聚类数量?
R 中是否有可用的软件包包含用于确定正确簇数的 V 折交叉验证方法?
另一种常用的方法是期望最大化(EM)算法,它为每个实例分配一个概率分布,表明它属于每个簇的概率。
这个算法是用R实现的吗?
如果是,它是否可以选择通过交叉验证自动选择最佳簇数?
您是否更喜欢其他聚类方法?
Many algorithms for clustering are available. A popular algorithm is the K-means where, based on a given number of clusters, the algorithm iterates to find best clusters for the objects.
What method do you use to determine the number of clusters in the data in k-means clustering?
Does any package available in R contain the V-fold cross-validation
method for determining the right number of clusters?
Another well used approach is Expectation Maximization (EM) algorithm which assigns a probability distribution to each instance which indicates the probability of it belonging to each of the clusters.
Is this algorithm implemented in R?
If it is, does it have the option to automatically select the optimum number of clusters by cross validation?
Do you prefer some other clustering method instead?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
对于大型“稀疏”数据集,我强烈推荐“亲和传播”方法。
与 k 均值相比,它具有优越的性能,并且本质上是确定性的。
http://www.psi.toronto.edu/affinitypropagation/
它发表在《科学》杂志上。
然而,最佳聚类算法的选择取决于所考虑的数据集。 K 均值是一种教科书方法,很可能有人开发了一种更适合您的数据集类型的更好算法/
这是 Andrew Moore 教授(CMU、Google)关于 K 均值和层次聚类的很好的教程。
http://www.autonlab.org/tutorials/kmeans.html
For large "sparse" datasets i would seriously recommend "Affinity propagation" method.
It has superior performance compared to k means and it is deterministic in nature.
http://www.psi.toronto.edu/affinitypropagation/
It was published in journal "Science".
However the choice of optimal clustering algorithm depends on the data set under consideration. K Means is a text book method and it is very likely that some one has developed a better algorithm more suitable for your type of dataset/
This is a good tutorial by Prof. Andrew Moore (CMU, Google) on K Means and Hierarchical Clustering.
http://www.autonlab.org/tutorials/kmeans.html
上周,我为 K 均值聚类程序编写了这样一个估计聚类数量的算法。我使用了中概述的方法:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.70.9687&rep=rep1&type=pdf
我最大的实现问题是我必须找到合适的集群验证索引(即错误度量)可以工作。现在是处理速度的问题,但目前的结果看起来还算合理。
Last week I coded up such an estimate-the-number-of-clusters algorithm for a K-Means clustering program. I used the method outlined in:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.70.9687&rep=rep1&type=pdf
My biggest implementation problem was that I had to find a suitable Cluster Validation Index (ie error metric) that would work. Now it is a matter of processing speed, but the results currently look reasonable.