层次聚类启发法

发布于 2024-11-19 08:55:19 字数 991 浏览 3 评论 0原文

我想探索大数组中数据项之间的关系。每个数据项都由多维向量表示。首先,我决定使用集群化。我有兴趣寻找集群(数据向量组)之间的层次关系。我能够计算向量之间的距离。因此,第一步我要找到最小生成树。之后,我需要根据生成树中的链接对数据向量进行分组。但在这一步我感到不安 - 如何将不同的向量组合成分层集群?我使用启发式:如果两个向量链接,并且它们之间的距离非常小 -这意味着它们位于同一个簇中如果两个向量链接但它们之间的距离大于阈值 - 这意味着它们位于具有公共根簇的不同簇中

但也许有更好的解决方案?

谢谢

PS 感谢大家!

事实上,我尝试过使用 k-means 和 CLOPE 的一些变体,但没有得到好的结果。

所以,现在我知道我的数据集的集群实际上具有复杂的结构(比 n 球体复杂得多)。

这就是我想使用分层集群的原因。另外,我猜簇看起来像 n 维串联(如 3d 或 2d 链)。所以我使用单链接策略。 但我很不安 - 如何将不同的集群相互组合(在什么情况下我必须创建公共根集群,在什么情况下我必须将所有子集群组合在一个集群中? )。 我正在使用这样简单的策略:

  • 如果簇(或向量)彼此距离太近 - 我会将它们的内容合并到一个簇中(由阈值调节)
  • 如果簇(或向量)彼此距离太远 - 我将创建根簇并将它们放入其中

使用这个策略我得到了非常大的簇树。我正在努力寻找令人满意的阈值。但也许有更好的策略来生成簇树?

这是一张简单的图片,描述了我的问题:

在此处输入图像描述

I want to explore relations between data items in large array. Every data item represented by multidimensional vector. First of all, I've decided to use clusterization. I'm interested in finding hierarchical relations between clusters (groups of data vectors). I'm able to calculate distance between my vectors. So at the first step I'm finding minimal spanning tree. After that I need to group data vectors according to links in my spanning tree. But at this step I'm disturbed - how to combine different vectors into hierarchical clusters? I'm using heuristics: if two vectors linked, and distance between them is very small - that means that they are in the same cluster, if two wectors are linked but distance between them is larger than threshold - that means that they are in different clusters with common root cluster.

But maybe there is better solution?

Thanks

P.S.
Thanks to all!

In fact I've tried to use k-means and some variation of CLOPE, but didn't get good results.

So, now I'm know that clusters of my dataset actually have complex structure (much more complex than n-spheres).

Thats why I want to use hierarchical clusterisation. Also I'm guess that clusters are looks like n-dimension concatenations (like 3d or 2d chain). So I use single-link strategy.
But I'm disturbed - how to combine different clusters with each other (in which situation I've to make common root cluster, and in which situations I've to combine all sub-clusters in one cluster?).
I'm using such simple strategy:

  • If clusters (or vectors) are too close to each other - I'm combine their content into one cluster (regulated by threshold)
  • If clusters (or vectors) are too far from each other - I'm creating root cluster and put them into it

But using this strategy I've got very large cluster trees. I'm trying to find satisfactory threshold. But maybe there might be better strategy to generate cluster-tree?

Here is a simple picture, describes my question:

enter image description here

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

笑,眼淚并存 2024-11-26 08:55:19

在这个领域已经做了很多工作。通常的建议是从 K-means 聚类开始,除非您有充分的理由不这样做 - 但 K-means 确实进行层次聚类(通常情况下),所以您可能有充分的理由否则(尽管完全有可能通过执行第一遍来创建集群来执行分层 K 均值,然后执行另一遍,使用每个集群的质心作为点,并继续,直到您拥有尽可能少的高级根据需要进行聚类)。

不过,还有很多其他聚类模型,并且有很多论文涵盖了相对的优点和缺点,例如:

  1. 成对聚类和图形模型
  2. 超越成对聚类
  3. 并行成对聚类
  4. 快速贪婪成对距离聚类。算法及其在发现主题中的应用。大型数据集中的结构。
  5. 成对聚类算法
  6. 分层凝聚聚类

稍微谷歌一下就会发现更多。回顾一下我从事聚类工作时的研究目录,我有几十篇论文,我的记忆是还有很多很多我看过但没有保留下来,还有很多更重要的是我从来没有机会真正看一眼。

A lot of work has been done in this area. The usual advice is to start with K-means clustering unless you have a really good reason to do otherwise - but K-means does not do hierarchical clustering (normally anyway), so you may have a good reason to do otherwise (although it's entirely possible to do hierarchical K-means by doing a first pass to create clusters, then do another pass, using the centroid of each of those clusters as a point, and continuing until you have as few high-level clusters as desired).

There are quite a few other clustering models though, and quite a few papers covering relative strengths and weaknesses, such as the following:

  1. Pairwise Clustering and Graphical Models
  2. Beyond pairwise clustering
  3. Parallel pairwise clustering
  4. A fast greedy pairwise distance clustering. algorithm and its use in discovering thematic. structures in large data sets.
  5. Pairwise Clustering Algorithm
  6. Hierarchical Agglomerative Clustering

A little Googling will turn up lots more. Glancing back through my research directory from when I was working on clustering, I have dozens of papers, and my recollection is that there were a lot more that I looked at but didn't keep around, and many more still that I never got a chance to really even look at.

枫以 2024-11-26 08:55:19

有一个完整的聚类算法动物园。其中,最小生成树(又名单链接聚类)具有一些很好的理论特性,如 中所述http://www.cs.uwaterloo.ca/~mackerma/Taxonomy.pdf。特别是,如果您采用最小生成树并删除长于某个阈值长度的所有链接,那么对于该大小的任何分组,所得的点分组应该具有剩余链接的最小总长度,原因与克鲁斯卡尔算法相同产生最小生成树。

但是,不能保证最小生成树最适合您的特定目的,因此我认为您应该写下您的聚类算法实际需要的内容,然后基于此选择一种方法,或者尝试各种不同的方法对数据进行聚类算法,看看哪种算法在实践中效果最好。

There is a whole zoo of clustering algorithms. Among them, minimum spanning tree a.k.a. single linkage clustering has some nice theoretical properties, as noted e.g. at http://www.cs.uwaterloo.ca/~mackerma/Taxonomy.pdf. In particular, if you take a minimum spanning tree and remove all links longer than some threshold length, then the resulting grouping of points into clusters should have minimum total length of remaining links for any grouping of that size, for the same reason that Kruskal's algorithm produces a minimum spanning tree.

However, there is no guarantee that minimum spanning tree will be the best for your particular purpose, so I think you should either write down what you actually need from your clustering algorithm and then choose a method based on that, or try a variety of different clustering algorithms on your data and see which is best in practice.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文