使用带有 L 方法的平滑器来确定 K 均值聚类的数量
在应用 L 方法来确定数据集中 k 均值簇的数量之前,是否有人尝试过对评估指标应用平滑器?如果是这样,它是否改善了结果?或者允许更少数量的 k 均值试验,从而大幅提高速度?您使用哪种平滑算法/方法?
“L-方法”详细说明如下: 确定层次聚类/分割算法中的聚类/分段数量 ,萨尔瓦多和Chan
这计算一系列不同试验簇计数的评估指标。然后,为了找到拐点(出现最佳数量的簇),使用线性回归拟合两条线。应用一个简单的迭代过程来改善膝部拟合 - 这使用现有的评估指标计算,并且不需要重新运行 k 均值。
对于评估指标,我使用简化版邓斯指数的倒数。简化速度(基本上我的直径和簇间计算被简化)。倒数使指数朝着正确的方向运行(即通常越低越好)。
K 均值是一种随机算法,因此通常会运行多次并选择最佳拟合。这工作得很好,但是当你对 1..N 个集群执行此操作时,时间很快就会增加。因此,控制运行次数符合我的利益。总体处理时间可能决定我的实现是否实用 - 如果我无法加快速度,我可能会放弃此功能。
Has anyone tried to apply a smoother to the evaluation metric before applying the L-method to determine the number of k-means clusters in a dataset? If so, did it improve the results? Or allow a lower number of k-means trials and hence much greater increase in speed? Which smoothing algorithm/method did you use?
The "L-Method" is detailed in:
Determining the Number of Clusters/Segments in Hierarchical Clustering/Segmentation Algorithms, Salvador & Chan
This calculates the evaluation metric for a range of different trial cluster counts. Then, to find the knee (which occurs for an optimum number of clusters), two lines are fitted using linear regression. A simple iterative process is applied to improve the knee fit - this uses the existing evaluation metric calculations and does not require any re-runs of the k-means.
For the evaluation metric, I am using a reciprocal of a simplified version of the Dunns Index. Simplified for speed (basically my diameter and inter-cluster calculations are simplified). The reciprocal is so that the index works in the correct direction (ie. lower is generally better).
K-means is a stochastic algorithm, so typically it is run multiple times and the best fit chosen. This works pretty well, but when you are doing this for 1..N clusters the time quickly adds up. So it is in my interest to keep the number of runs in check. Overall processing time may determine whether my implementation is practical or not - I may ditch this functionality if I cannot speed it up.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我过去曾在这里问过一个类似问题就这样。我的问题是想出一个一致的方法来找到你所描述的 L 形的膝盖。所讨论的曲线代表了模型的复杂性和拟合度之间的权衡。
最佳解决方案是找到最大距离
d
的点如图所示:注意:我还没有尚未阅读您链接到的论文..
I had asked a similar question in the past here on SO. My question was about coming up with a consistent way of finding the knee to the L-shape you described. The curves in question represented the trade-off between complexity and a fit measure of the model.
The best solution was to find the point with the maximum distance
d
according to the figure shown:Note: I haven't read the paper you linked to yet..