使用 Mahout 进行 K 均值聚类

发布于 2024-12-18 07:37:05 字数 421 浏览 3 评论 0原文

我正在使用此处给出的聚类技术 用于对大型数据集进行聚类,这在 Mahout 示例中给出。然而,当我可视化特定的聚类时,我得到下图。

Mahout k-means 可视化。

我真的很难理解这实际上意味着什么,并且有几个问题。

  1. 所有彩色线表示什么?
  2. 这么多簇是什么意思?
  3. 为什么只有少数区域拥挤,而其他区域却不拥挤?
  4. 为什么很少有彩色线相互重叠?

I'm using the clustering technique given here for clustering a large dataset, which is given in Mahout examples. However, when I visualize the particular clustering I get the following figure.

Mahout k-means visualization.

I'm really struggling to understand what this actually means and have several questions.

  1. What does all the coloured lines indicate?
  2. What does so many clusters mean?
  3. Why are few areas crowded, and why aren't the other areas crowded?
  4. Why are few colored lines overlapping each other?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

闻呓 2024-12-25 07:37:05

k-means 并不是最先进的聚类技术。圆圈作为一种可视化技术具有误导性,它实际上将数据空间划分为 Voronoi 单元(在维基百科上查找)。它还更喜欢大小相似的集群。

  1. 我假设不同的颜色表示 k 均值的不同迭代。它需要多次运行来优化其结果(通常只能达到局部最小值,并且不同的运行将导致不同的结果)。所以我猜结果还不是很稳定。它们移动缓慢,这就是它们重叠不多的原因。

  2. 簇的数量是 k-means 的一个参数。它通常表示为k。 k-means 无法确定簇的数量,但如果使用多个 k 值运行它,您可以测试哪个结果最适合数据集。

  3. k-means 不考虑密度。为此,您需要一种基于密度的聚类算法。 k-means 更喜欢大小相似的簇。你的“k”可能太高了。

  4. 由于它们是迭代更新的,因此不同的迭代不应重叠太多。

k-means is not the most advanced clustering technique. Circles as a visualization technique are misleading, it's actually partitioning the data space into Voronoi cells (look it up on Wikipedia). It also prefers similar-sized clusters.

  1. I assume that the different colors indicate the different iterations of k-means. It requires several runs to optimize its result (which usually only reaches a local minimum, and different runs will result in different results). So the results aren't very stable yet, I guess. They shift only slowly, which is why they don't overlap much.

  2. The number of clusters is a parameter for k-means. It's commonly denoted as k. k-means cannot determine the number of clusters, but you can test which result fits the data set best, if you run it with multiple values of k.

  3. k-means doesn't look at density. You need a density-based clustering algorithm for that. k-means prefers similar-sized clusters. Your "k" is probably too high.

  4. Since they are iteratively updated, the different iterations shouldn't overlap much.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文