寻找Kmeans聚类的优化数字
我有一个具有长2D数组的文本文件。每个元素的数字在1到6之间。
我可以使用以下帖子中提供的指南聚集数据:
但我想知道如何使群集选择群集数量的“ N_Clusters”的值,而无需我选择了这个价值。
我尝试了肘方法,但是到目前为止,我看到的示例使用图纸来选择最佳簇数。我的问题是:如何在没有视觉检查的情况下找到“簇数”的最佳值?
I have a text file that has a long 2D array. The first element of each has numbers between 1 to 6.
I could cluster the data using the guidance that is provided in the following post:
But I am wondering how can I make the clustering to choose the value for number of clusters"n_clusters" by itself without me choosing that value.
I tried the elbow method but the examples that I saw so far they use drawing to choose the optimal number of clusters. My question is: how to find the optimum value for "number of clusters" without a visual check?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
“肘方法”只是搜索退货减少的点。
虽然可以在视觉上设置超参数,但这只是检查最新改进是否显着的问题(即越过一些阈值)。
构建一个 k 值的表,计算 error 和百分比改进从上一步:
如果您的改善阈值为10%,则可以在k = 5停止。之后,这些改进正在减少(即倾向于过度合身,无法概括)。
在Python中,看起来像这样:
输出:
The "elbow method" is just a search for the point of diminishing returns.
While setting hyper-parameters can be done visually, it is just a matter of checking the whether the most recent improvement is significant (i.e. crossing some threshold).
Build a table of the k values, the computed error, and the percent improvement from the preceding step:
If your threshold for improvement is 10%, you can stop at k=5. After that, the improvements are diminishing (i.e. tending to overfit and failing to generalize).
In Python, it would look like this:
That outputs: