确定 ak 最近邻的最佳 k
我需要对一组二维数据进行一些聚类分析(我可能会在此过程中添加额外的维度)。
分析本身将构成输入到可视化中的数据的一部分,而不是输入到另一个流程中(例如 径向基函数网络)。
为此,我想找到一组主要“看起来正确”的集群,而不是阐明一些隐藏的模式。
我的直觉是 k-means 将是一个很好的起点,但是找到正确数量的簇来运行算法将是有问题的。
我要解决的问题是:
如何确定 k 的“最佳”值 k 使得形成的簇稳定且可目视验证?
问题:
- 假设这不是 NP 完全的,找到一个好的 k 的时间复杂度是多少。 (可能以运行 k 均值算法的次数来报告)。
- k-means 是解决此类问题的良好起点吗?如果是这样,您会推荐哪些其他方法。一个由轶事/经验支持的具体例子是 maxi-bon。
- 您建议采用哪些捷径/近似值来提高性能。
I have need to do some cluster analysis on a set of 2 dimensional data (I may add extra dimensions along the way).
The analysis itself will form part of the data being fed into a visualisation, rather than the inputs into another process (e.g. Radial Basis Function Networks).
To this end, I'd like to find a set of clusters which primarily "looks right", rather than elucidating some hidden patterns.
My intuition is that k-means would be a good starting place for this, but that finding the right number of clusters to run the algorithm with would be problematic.
The problem I'm coming to is this:
How to determine the 'best' value for k such that the clusters formed are stable and visually verifiable?
Questions:
- Assuming that this isn't NP-complete, what is the time complexity for finding a good k. (probably reported in number of times to run the k-means algorithm).
- is k-means a good starting point for this type of problem? If so, what other approaches would you recommend. A specific example, backed by an anecdote/experience would be maxi-bon.
- what short cuts/approximations would you recommend to increase the performance.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
对于未知数量的聚类问题,凝聚层次聚类通常是比 k 均值更好的途径。
凝聚聚类产生树结构,离主干越近,越少簇的数量,因此很容易扫描所有数量的簇。该算法首先将每个点分配给自己的簇,然后重复对两个最近的质心进行分组。跟踪分组顺序可以为任意数量的可能集群提供即时快照。因此,当您不知道需要多少个组时,通常最好使用此技术而不是 k 均值。
还有其他层次聚类方法(请参阅 Imran 评论中建议的论文)。聚合方法的主要优点是有许多现成的实现可供您使用。
For problems with an unknown number of clusters, agglomerative hierarchical clustering is often a better route than k-means.
Agglomerative clustering produces a tree structure, where the closer you are to the trunk, the fewer the number of clusters, so it's easy to scan through all numbers of clusters. The algorithm starts by assigning each point to its own cluster, and then repeatedly groups the two closest centroids. Keeping track of the grouping sequence allows an instant snapshot for any number of possible clusters. Therefore, it's often preferable to use this technique over k-means when you don't know how many groups you'll want.
There are other hierarchical clustering methods (see the paper suggested in Imran's comments). The primary advantage of an agglomerative approach is that there are many implementations out there, ready-made for your use.
为了使用 k-means,您应该知道有多少个簇。您不能尝试天真的元优化,因为您添加的集群越多(每个数据点最多 1 个集群),就越会导致过度拟合。您可能会寻找一些集群验证方法并用它来优化 k 超参数,但根据我的经验,它很少能很好地工作。这也是非常昂贵的。
如果我是你,我会根据你对输入的了解,最终在多项式空间上进行主成分分析(注意你的可用时间),并沿着最具代表性的组件进行聚类。
有关您的数据集的更多信息对于更准确的答案非常有帮助。
In order to use k-means, you should know how many cluster there is. You can't try a naive meta-optimisation, since the more cluster you'll add (up to 1 cluster for each data point), the more it will brought you to over-fitting. You may look for some cluster validation methods and optimize the k hyperparameter with it but from my experience, it rarely work well. It's very costly too.
If I were you, I would do a PCA, eventually on polynomial space (take care of your available time) depending on what you know of your input, and cluster along the most representatives components.
More infos on your data set would be very helpful for a more precise answer.
这是我的近似解决方案:
该算法背后的论点是, k 个簇对于 k 的“良好”值来说很小。
如果我们可以找到这种稳定性的局部最优值,或者稳定性的最佳增量,那么我们就可以找到一组好的集群,而这些集群不能通过添加更多集群来改进。
Here's my approximate solution:
The thesis behind this algorithm is that the number of sets of k clusters is small for "good" values of k.
If we can find a local optimum for this stability, or an optimal delta for the stability, then we can find a good set of clusters which cannot be improved by adding more clusters.
在之前的回答中,我解释了如何在视觉聚类中使用自组织映射 (SOM)。
否则,存在 K-Means 算法的变体,称为 X-Means< /a> 除了通过使用KD-trees解决可扩展性问题之外,它还能够通过优化贝叶斯信息准则(BIC)来找到簇的数量。
Weka 包括 X-Means 的实现以及许多其他聚类算法,一切都在一个易于使用的 GUI 工具中。
最后,您可以参考此页面,其中讨论了肘部方法以及用于确定数据集中的簇数量的其他技术。
In a previous answer, I explained how Self-Organizing Maps (SOM) can be used in visual clustering.
Otherwise, there exist a variation of the K-Means algorithm called X-Means which is able to find the number of clusters by optimizing the Bayesian Information Criterion (BIC), in addition to solving the problem of scalability by using KD-trees.
Weka includes an implementation of X-Means along with many other clustering algorithm, all in an easy to use GUI tool.
Finally you might to refer to this page which discusses the Elbow Method among other techniques for determining the number of clusters in a dataset.
您可以查看有关集群验证的论文。 这是一个涉及微阵列分析的论文中引用了这一点,微阵列分析涉及对具有相关表达水平的基因进行聚类。
其中一种技术是轮廓测量,它评估标记点与其质心的接近程度。一般的想法是,如果一个点被分配给一个质心但仍然接近其他质心,则可能它被分配给了错误的质心。通过对训练集中的这些事件进行计数并查看各种 k 均值聚类,可以寻找 k 使得标记点整体落入“最佳”或模糊程度最小的范围安排。
应该说,聚类更多的是一种数据可视化和探索技术。可能很难确定地阐明一种聚类是否能够正确解释数据(尤其是其他聚类)。最好将您的聚类与其他相关信息合并。您的数据是否有一些功能性或其他信息,以便您知道某些聚类是不可能的?这可以大大减少您的解决方案空间。
You might look at papers on cluster validation. Here's one that is cited in papers that involve microarray analysis, which involves clustering genes with related expression levels.
One such technique is the Silhouette measure that evaluates how closely a labeled point is to its centroid. The general idea is that, if a point is assigned to one centroid but is still close to others, perhaps it was assigned to the wrong centroid. By counting these events across training sets and looking across various k-means clusterings, one looks for the k such that the labeled points overall fall into the "best" or minimally ambiguous arrangement.
It should be said that clustering is more of a data visualization and exploration technique. It can be difficult to elucidate with certainty that one clustering explains the data correctly, above all others. It's best to merge your clusterings with other relevant information. Is there something functional or otherwise informative about your data, such that you know some clusterings are impossible? This can reduce your solution space considerably.
从您的维基百科链接:
也就是说,找到一个好的 k 值通常是一个启发式过程(即尝试一些并选择最好的)。
我认为 k-means 是一个很好的起点,它简单且易于实现(或复制)。仅当您遇到严重的性能问题时才进一步查看。
如果您想要聚类的点集非常大,一阶优化将是随机选择一个小子集,使用该集来查找您的 k 均值。
From your wikipedia link:
That said, finding a good value of k is usually a heuristic process (i.e. you try a few and select the best).
I think k-means is a good starting point, it is simple and easy to implement (or copy). Only look further if you have serious performance problems.
If the set of points you want to cluster is exceptionally large a first order optimisation would be to randomly select a small subset, use that set to find your k-means.
选择最佳 K 可以看作是一个模型选择问题。一种可能的方法是最小描述长度,在这种情况下意味着:您可以使用所有点(在这种情况下 K=N)。在另一个极端,K=1,并且所有点都存储为距单个质心的距离。 本节来自 Manning 和 Schutze 的《信息检索简介》建议尽量减少 Akaike 信息准则 作为启发式最优K。
Choosing the best K can be seen as a Model Selection problem. One possible approach is Minimum Description Length, which in this context means: You could store a table with all the points (in which case K=N). At the other extreme, you have K=1, and all the points are stored as their distances from a single centroid. This Section from Introduction to Information Retrieval by Manning and Schutze suggest minimising the Akaike Information Criterion as a heuristic for an optimal K.
这个问题属于“聚类优化问题”的“内部评估”类,当前最先进的解决方案似乎使用 **Silhouette* coeficient* 如所述 此处
https://en。 wikipedia.org/wiki/Cluster_analysis#Applications
和此处:
https://en.wikipedia.org/wiki/Silhouette_(clustering) :
“剪影图平均值可用于确定数据集中的自然聚类数”
scikit-learn 提供了该方法的示例使用实现此处
http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html
This problematic belongs to the "internal evaluation" class of "clustering optimisation problems" which curent state of the art solution seems to use the **Silhouette* coeficient* as stated here
https://en.wikipedia.org/wiki/Cluster_analysis#Applications
and here:
https://en.wikipedia.org/wiki/Silhouette_(clustering) :
"silhouette plots and averages may be used to determine the natural number of clusters within a dataset"
scikit-learn provides a sample usage implementation of the methodology here
http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html