选择适当的相似性度量&评估 k 均值聚类模型的有效性
我已经实现了 k 均值聚类来确定 300 个对象中的聚类。我的每一个对象 有大约30个维度。该距离是使用欧几里得度量计算的。
我需要知道
- 如何确定我的算法是否正常工作?我没有一张图表可以 给出一些关于我的算法的正确性的想法。
- 欧几里得距离是计算距离的正确方法吗?如果我有 100 个维度怎么办 而不是 30 ?
I have implemented k-means clustering for determining the clusters in 300 objects. Each of my object
has about 30 dimensions. The distance is calculated using the Euclidean metric.
I need to know
- How would I determine if my algorithms works correctly? I can't have a graph which will
give some idea about the correctness of my algorithm. - Is Euclidean distance the correct method for calculating distances? What if I have 100 dimensions
instead of 30 ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
OP 中的两个问题是不同的主题(即答案中没有重叠),因此我将尝试一次回答一个问题,盯着列表中的第 1 项。
k-means 与其他无监督机器学习技术一样,缺乏良好的诊断测试选择来回答诸如“k- 返回的聚类分配是否正确”之类的问题。意味着 k=3 或 k=5 更有意义?”
尽管如此,有一种被广泛接受的测试可以产生直观的结果并且易于应用。此诊断指标就是这个比率:
质心间分离 / 簇内方差
随着该比率值的增加,聚类结果的质量也会提高。
这是直观的。第一个指标是每个聚类与其他聚类的距离有多远(根据聚类中心测量)?
但质心间分离本身并不能说明全部情况,因为两种聚类算法可能返回具有相同质心间分离的结果,尽管其中一种显然更好,因为聚类“更紧密”(即半径更小);换句话说,簇边缘有更多的分离。第二个指标——集群内方差——解释了这一点。这只是每个簇计算的平均方差。
总之,质心间分离与簇内方差之比是一种快速、一致且可靠的技术,用于比较不同聚类算法的结果,或比较在以下情况下运行的相同算法的结果不同的可变参数——例如,迭代次数、距离度量的选择、质心的数量(k 的值)。
期望的结果是紧密(小)的簇,每个簇都远离其他簇。
计算很简单:
对于质心间分离:
计算聚类中心之间的成对距离;然后
计算这些距离的中值。
对于簇内方差:
对于每个簇,计算给定簇中每个数据点的距离
其聚类中心;接下来
(对于每个簇)计算上述步骤的距离序列的方差;然后
对这些方差值求平均值。
这就是我对第一个问题的回答。这是第二个问题:
首先,一个简单的问题——随着维度/特征的增加,欧几里得距离是一个有效的度量吗?
欧几里得距离是完全可扩展的——适用于二维或两千维。对于任何一对数据点:
逐元素减去其特征向量,
对该结果向量中的每一项求平方,
对结果求和,
取该标量的平方根。
在这个计算序列中,没有任何地方涉及规模。
但是欧几里得距离是否是适合您的问题的相似性度量,取决于您的数据。例如,它是纯数字(连续)吗?或者它是否也有离散(分类)变量(例如,性别?男/女)如果您的维度之一是“当前位置”,并且在 200 个用户中,100 个用户的值为“旧金山”,另外 100 个用户的值为“波士顿”,你不能真正说,平均而言,你的用户来自堪萨斯州的某个地方,但这就是欧几里得距离的作用。
无论如何,由于我们对此一无所知,我只会给您一个简单的流程图,以便您可以将其应用于您的数据并确定适当的相似性度量。
根据您的数据确定适当的相似性指标:
The two questions in the OP are separate topics (i.e., no overlap in the answers), so I'll try to answer them one at a time staring with item 1 on the list.
k-means, like other unsupervised ML techniques, lacks a good selection of diagnostic tests to answer questions like "are the cluster assignments returned by k-means more meaningful for k=3 or k=5?"
Still, there is one widely accepted test that yields intuitive results and that is straightforward to apply. This diagnostic metric is just this ratio:
inter-centroidal separation / intra-cluster variance
As the value of this ratio increase, the quality of your clustering result increases.
This is intuitive. The first of these metrics is just how far apart is each cluster from the others (measured according to the cluster centers)?
But inter-centroidal separation alone doesn't tell the whole story, because two clustering algorithms could return results having the same inter-centroidal separation though one is clearly better, because the clusters are "tighter" (i.e., smaller radii); in other words, the cluster edges have more separation. The second metric--intra-cluster variance--accounts for this. This is just the mean variance, calculated per cluster.
In sum, the ratio of inter-centroidal separation to intra-cluster variance is a quick, consistent, and reliable technique for comparing results from different clustering algorithms, or to compare the results from the same algorithm run under different variable parameters--e.g., number of iterations, choice of distance metric, number of centroids (value of k).
The desired result is tight (small) clusters, each one far away from the others.
The calculation is simple:
For inter-centroidal separation:
calculate the pair-wise distance between cluster centers; then
calculate the median of those distances.
For intra-cluster variance:
for each cluster, calculate the distance of every data point in a given cluster from
its cluster center; next
(for each cluster) calculate the variance of the sequence of distances from the step above; then
average these variance values.
That's my answer to the first question. Here's the second question:
First, the easy question--is Euclidean distance a valid metric as dimensions/features increase?
Euclidean distance is perfectly scalable--works for two dimensions or two thousand. For any pair of data points:
subtract their feature vectors element-wise,
square each item in that result vector,
sum that result,
take the square root of that scalar.
Nowhere in this sequence of calculations is scale implicated.
But whether Euclidean distance is the appropriate similarity metric for your problem, depends on your data. For instance, is it purely numeric (continuous)? Or does it have discrete (categorical) variables as well (e.g., gender? M/F) If one of your dimensions is "current location" and of the 200 users, 100 have the value "San Francisco" and the other 100 have "Boston", you can't really say that, on average, your users are from somewhere in Kansas, but that's sort of what Euclidean distance would do.
In any event, since we don't know anything about it, i'll just give you a simple flow diagram so that you can apply it to your data and identify an appropriate similarity metric.
To identify an appropriate similarity metric given your data:
当尺寸具有可比性并且处于相同比例时,欧几里德距离是很好的选择。如果一个维度代表长度,另一个维度代表物品的重量,则应将欧几里德替换为加权。
以二维方式制作并显示图片 - 这是直观地查看其是否有效的好选择。
或者您可以使用一些健全性检查 - 例如查找聚类中心并查看聚类中的所有项目都不太远离它。
Euclidean distance is good when dimensions are comparable and on the same scale. If one dimension represents length and another - weight of item - euclidean should be replaced with weighted.
Make it in 2d and show the picture - this is good option to see visually if it works.
Or you may use some sanity check - like to find cluster centers and see that all items in the cluster aren't too away of it.
你不能尝试一下 sum |xi - yi|相反,如果 (xi - yi)^2
在你的代码中,看看它是否有很大的不同?
有几种可能性:
PCA
将 30d 映射到 2d;参见下面的图
计算-the-percentage-of-variance-measure-for- k-均值,
还有 SO questions/tagged/pca
顺便说一句,scipy.spatial.cKDTree
可以很容易地给你说每个点的 3 个最近邻点,
p=2(欧几里得)或 p=1(曼哈顿,L1),看看。
它的速度可达约 20 天,甚至可以在 128 天内完成早期截止工作。
Added: I like Cosine distance in high dimensions; see euclidean-distance-is-usually-not-good-for-sparse-data for why.
Can't you just try sum |xi - yi| instead if (xi - yi)^2
in your code, and see if it makes much difference ?
A couple of possibilities:
PCA
to map 30d down to 2d; see the plots under
calculating-the-percentage-of-variance-measure-for-k-means,
also SO questions/tagged/pca
By the way, scipy.spatial.cKDTree
can easily give you say 3 nearest neighbors of each point,
in p=2 (Euclidean) or p=1 (Manhattan, L1), to look at.
It's fast up to ~ 20d, and with early cutoff works even in 128d.
Added: I like Cosine distance in high dimensions; see euclidean-distance-is-usually-not-good-for-sparse-data for why.
欧几里得距离是连续变量之间直观的、“正常”的距离。如果噪声太大或数据具有非高斯分布,则可能不合适。
您可能想尝试曼哈顿距离(或城市街区),它对此很稳健(请记住,稳健性总是有代价的:在这种情况下,会丢失一些信息)。
对于特定问题还有许多进一步的距离度量(例如计数数据的 Bray-Curtis 距离)。您可能想尝试通过 python 模块 scipy.spatial.distance 在 pdist 中实现的一些距离。
Euclidean distance is the intuitive and "normal" distance between continuous variable. It can be inappropriate if too noisy or if data has a non-gaussian distribution.
You might want to try the Manhattan distance (or cityblock) which is robust to that (bear in mind that robustness always comes at a cost : a bit of the information is lost, in this case).
There are many further distance metrics for specific problems (for example Bray-Curtis distance for count data). You might want to try some of the distances implemented in pdist from python module scipy.spatial.distance.