当前位置：文江博客话题详情

选择适当的相似性度量&评估 k 均值聚类模型的有效性

发布于 2024-12-14 19:46:36 字数 200 浏览 9 评论 0原文

我已经实现了 k 均值聚类来确定 300 个对象中的聚类。我的每一个对象有大约30个维度。该距离是使用欧几里得度量计算的。

我需要知道

如何确定我的算法是否正常工作？我没有一张图表可以给出一些关于我的算法的正确性的想法。
欧几里得距离是计算距离的正确方法吗？如果我有 100 个维度怎么办而不是 30 ？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

不可一世的女人 2024-12-21 19:46:36

OP 中的两个问题是不同的主题（即答案中没有重叠），因此我将尝试一次回答一个问题，盯着列表中的第 1 项。

如何确定我的[聚类]算法是否正常工作？

k-means 与其他无监督机器学习技术一样，缺乏良好的诊断测试选择来回答诸如“k- 返回的聚类分配是否正确”之类的问题。意味着 k=3 或 k=5 更有意义？”

尽管如此，有一种被广泛接受的测试可以产生直观的结果并且易于应用。此诊断指标就是这个比率：

质心间分离 / 簇内方差

随着该比率值的增加，聚类结果的质量也会提高。

这是直观的。第一个指标是每个聚类与其他聚类的距离有多远（根据聚类中心测量）？

但质心间分离本身并不能说明全部情况，因为两种聚类算法可能返回具有相同质心间分离的结果，尽管其中一种显然更好，因为聚类“更紧密”（即半径更小）；换句话说，簇边缘有更多的分离。第二个指标——集群内方差——解释了这一点。这只是每个簇计算的平均方差。

总之，质心间分离与簇内方差之比是一种快速、一致且可靠的技术，用于比较不同聚类算法的结果，或比较在以下情况下运行的相同算法的结果不同的可变参数——例如，迭代次数、距离度量的选择、质心的数量（k 的值）。

期望的结果是紧密（小）的簇，每个簇都远离其他簇。

计算很简单：

对于质心间分离：

计算聚类中心之间的成对距离；然后
计算这些距离的中值。

对于簇内方差：

对于每个簇，计算给定簇中每个数据点的距离
其聚类中心；接下来
（对于每个簇）计算上述步骤的距离序列的方差；然后
对这些方差值求平均值。

这就是我对第一个问题的回答。这是第二个问题：

欧氏距离是计算距离的正确方法吗？如果我有 100 个维度而不是 30 个怎么办？

首先，一个简单的问题——随着维度/特征的增加，欧几里得距离是一个有效的度量吗？

欧几里得距离是完全可扩展的——适用于二维或两千维。对于任何一对数据点：

逐元素减去其特征向量，
对该结果向量中的每一项求平方，
对结果求和，
取该标量的平方根。

在这个计算序列中，没有任何地方涉及规模。

但是欧几里得距离是否是适合您的问题的相似性度量，取决于您的数据。例如，它是纯数字（连续）吗？或者它是否也有离散（分类）变量（例如，性别？男/女）如果您的维度之一是“当前位置”，并且在 200 个用户中，100 个用户的值为“旧金山”，另外 100 个用户的值为“波士顿”，你不能真正说，平均而言，你的用户来自堪萨斯州的某个地方，但这就是欧几里得距离的作用。

无论如何，由于我们对此一无所知，我只会给您一个简单的流程图，以便您可以将其应用于您的数据并确定适当的相似性度量。

根据您的数据确定适当的相似性指标：

在此处输入图像描述

The two questions in the OP are separate topics (i.e., no overlap in the answers), so I'll try to answer them one at a time staring with item 1 on the list.

How would I determine if my [clustering] algorithms works correctly?

k-means, like other unsupervised ML techniques, lacks a good selection of diagnostic tests to answer questions like "are the cluster assignments returned by k-means more meaningful for k=3 or k=5?"

Still, there is one widely accepted test that yields intuitive results and that is straightforward to apply. This diagnostic metric is just this ratio:

inter-centroidal separation / intra-cluster variance

As the value of this ratio increase, the quality of your clustering result increases.

This is intuitive. The first of these metrics is just how far apart is each cluster from the others (measured according to the cluster centers)?

But inter-centroidal separation alone doesn't tell the whole story, because two clustering algorithms could return results having the same inter-centroidal separation though one is clearly better, because the clusters are "tighter" (i.e., smaller radii); in other words, the cluster edges have more separation. The second metric--intra-cluster variance--accounts for this. This is just the mean variance, calculated per cluster.

In sum, the ratio of inter-centroidal separation to intra-cluster variance is a quick, consistent, and reliable technique for comparing results from different clustering algorithms, or to compare the results from the same algorithm run under different variable parameters--e.g., number of iterations, choice of distance metric, number of centroids (value of k).

The desired result is tight (small) clusters, each one far away from the others.

The calculation is simple:

For inter-centroidal separation:

calculate the pair-wise distance between cluster centers; then
calculate the median of those distances.

For intra-cluster variance:

for each cluster, calculate the distance of every data point in a given cluster from
its cluster center; next
(for each cluster) calculate the variance of the sequence of distances from the step above; then
average these variance values.

That's my answer to the first question. Here's the second question:

Is Euclidean distance the correct method for calculating distances? What if I have 100 dimensions instead of 30 ?

First, the easy question--is Euclidean distance a valid metric as dimensions/features increase?

Euclidean distance is perfectly scalable--works for two dimensions or two thousand. For any pair of data points:

subtract their feature vectors element-wise,
square each item in that result vector,
sum that result,
take the square root of that scalar.

Nowhere in this sequence of calculations is scale implicated.

But whether Euclidean distance is the appropriate similarity metric for your problem, depends on your data. For instance, is it purely numeric (continuous)? Or does it have discrete (categorical) variables as well (e.g., gender? M/F) If one of your dimensions is "current location" and of the 200 users, 100 have the value "San Francisco" and the other 100 have "Boston", you can't really say that, on average, your users are from somewhere in Kansas, but that's sort of what Euclidean distance would do.

In any event, since we don't know anything about it, i'll just give you a simple flow diagram so that you can apply it to your data and identify an appropriate similarity metric.

To identify an appropriate similarity metric given your data:

enter image description here

回复收藏 0 原文