计算 k 均值的方差度量百分比?
在维基百科页面上,描述了一种肘部方法,用于确定 k 均值中的簇数量。 构建-in scipy 方法提供了一个实现,但我不确定我是否理解他们所说的失真是如何计算的。
更准确地说,如果您绘制由下式解释的方差百分比 簇数与簇数的比较,第一个簇将 添加很多信息(解释很多差异),但在某些时候 边际增益将会下降,在图表中给出一个角度。
假设我有以下点及其相关质心,计算此度量的好方法是什么?
points = numpy.array([[ 0, 0],
[ 0, 1],
[ 0, -1],
[ 1, 0],
[-1, 0],
[ 9, 9],
[ 9, 10],
[ 9, 8],
[10, 9],
[10, 8]])
kmeans(pp,2)
(array([[9, 8],
[0, 0]]), 0.9414213562373096)
我专门考虑计算 0.94.. 仅给出点和质心的度量。我不确定是否可以使用 scipy 的任何内置方法,或者我必须编写自己的方法。关于如何有效地处理大量点有什么建议吗?
简而言之,我的问题(所有相关)如下:
- 给定一个距离矩阵以及哪个点属于哪个点的映射 集群,计算可用度量的好方法是什么 绘制肘部图?
- 如果使用不同的距离函数(例如余弦相似度),方法将如何变化?
编辑 2:失真
from scipy.spatial.distance import cdist
D = cdist(points, centroids, 'euclidean')
sum(numpy.min(D, axis=1))
第一组点的输出是准确的。但是,当我尝试不同的集合时:
>>> pp = numpy.array([[1,2], [2,1], [2,2], [1,3], [6,7], [6,5], [7,8], [8,8]])
>>> kmeans(pp, 2)
(array([[6, 7],
[1, 2]]), 1.1330618877807475)
>>> centroids = numpy.array([[6,7], [1,2]])
>>> D = cdist(points, centroids, 'euclidean')
>>> sum(numpy.min(D, axis=1))
9.0644951022459797
我猜最后一个值不匹配,因为 kmeans 似乎将该值除以数据集中的总点数。
编辑 1:百分比方差
到目前为止我的代码(应添加到 Denis 的 K 均值实现中):
centres, xtoc, dist = kmeanssample( points, 2, nsample=2,
delta=kmdelta, maxiter=kmiter, metric=metric, verbose=0 )
print "Unique clusters: ", set(xtoc)
print ""
cluster_vars = []
for cluster in set(xtoc):
print "Cluster: ", cluster
truthcondition = ([x == cluster for x in xtoc])
distances_inside_cluster = (truthcondition * dist)
indices = [i for i,x in enumerate(truthcondition) if x == True]
final_distances = [distances_inside_cluster[k] for k in indices]
print final_distances
print np.array(final_distances).var()
cluster_vars.append(np.array(final_distances).var())
print ""
print "Sum of variances: ", sum(cluster_vars)
print "Total Variance: ", points.var()
print "Percent: ", (100 * sum(cluster_vars) / points.var())
以下是 k=2 的输出:
Unique clusters: set([0, 1])
Cluster: 0
[1.0, 2.0, 0.0, 1.4142135623730951, 1.0]
0.427451660041
Cluster: 1
[0.0, 1.0, 1.0, 1.0, 1.0]
0.16
Sum of variances: 0.587451660041
Total Variance: 21.1475
Percent: 2.77787757437
在我的真实数据集上(对我来说看起来不正确! ):
Sum of variances: 0.0188124746402
Total Variance: 0.00313754329764
Percent: 599.592510943
Unique clusters: set([0, 1, 2, 3])
Sum of variances: 0.0255808508714
Total Variance: 0.00313754329764
Percent: 815.314672809
Unique clusters: set([0, 1, 2, 3, 4])
Sum of variances: 0.0588210052519
Total Variance: 0.00313754329764
Percent: 1874.74720416
Unique clusters: set([0, 1, 2, 3, 4, 5])
Sum of variances: 0.0672406353655
Total Variance: 0.00313754329764
Percent: 2143.09824556
Unique clusters: set([0, 1, 2, 3, 4, 5, 6])
Sum of variances: 0.0646291452839
Total Variance: 0.00313754329764
Percent: 2059.86465055
Unique clusters: set([0, 1, 2, 3, 4, 5, 6, 7])
Sum of variances: 0.0817517362176
Total Variance: 0.00313754329764
Percent: 2605.5970695
Unique clusters: set([0, 1, 2, 3, 4, 5, 6, 7, 8])
Sum of variances: 0.0912820650486
Total Variance: 0.00313754329764
Percent: 2909.34837831
Unique clusters: set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Sum of variances: 0.102119601368
Total Variance: 0.00313754329764
Percent: 3254.76309585
Unique clusters: set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
Sum of variances: 0.125549475536
Total Variance: 0.00313754329764
Percent: 4001.52168834
Unique clusters: set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
Sum of variances: 0.138469402779
Total Variance: 0.00313754329764
Percent: 4413.30651542
Unique clusters: set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
On the Wikipedia page, an elbow method is described for determining the number of clusters in k-means. The built-in method of scipy provides an implementation but I am not sure I understand how the distortion as they call it, is calculated.
More precisely, if you graph the percentage of variance explained by
the clusters against the number of clusters, the first clusters will
add much information (explain a lot of variance), but at some point
the marginal gain will drop, giving an angle in the graph.
Assuming that I have the following points with their associated centroids, what is a good way of calculating this measure?
points = numpy.array([[ 0, 0],
[ 0, 1],
[ 0, -1],
[ 1, 0],
[-1, 0],
[ 9, 9],
[ 9, 10],
[ 9, 8],
[10, 9],
[10, 8]])
kmeans(pp,2)
(array([[9, 8],
[0, 0]]), 0.9414213562373096)
I am specifically looking at computing the 0.94.. measure given just the points and the centroids. I am not sure if any of the inbuilt methods of scipy can be used or I have to write my own. Any suggestions on how to do this efficiently for large number of points?
In short, my questions (all related) are the following:
- Given a distance matrix and a mapping of which point belongs to which
cluster, what is a good way of computing a measure that can be used
to draw the elbow plot? - How would the methodology change if a different distance function such as cosine similarity is used?
EDIT 2: Distortion
from scipy.spatial.distance import cdist
D = cdist(points, centroids, 'euclidean')
sum(numpy.min(D, axis=1))
The output for the first set of points is accurate. However, when I try a different set:
>>> pp = numpy.array([[1,2], [2,1], [2,2], [1,3], [6,7], [6,5], [7,8], [8,8]])
>>> kmeans(pp, 2)
(array([[6, 7],
[1, 2]]), 1.1330618877807475)
>>> centroids = numpy.array([[6,7], [1,2]])
>>> D = cdist(points, centroids, 'euclidean')
>>> sum(numpy.min(D, axis=1))
9.0644951022459797
I guess the last value does not match because kmeans
seems to be dividing the value by the total number of points in the dataset.
EDIT 1: Percent Variance
My code so far (should be added to Denis's K-means implementation):
centres, xtoc, dist = kmeanssample( points, 2, nsample=2,
delta=kmdelta, maxiter=kmiter, metric=metric, verbose=0 )
print "Unique clusters: ", set(xtoc)
print ""
cluster_vars = []
for cluster in set(xtoc):
print "Cluster: ", cluster
truthcondition = ([x == cluster for x in xtoc])
distances_inside_cluster = (truthcondition * dist)
indices = [i for i,x in enumerate(truthcondition) if x == True]
final_distances = [distances_inside_cluster[k] for k in indices]
print final_distances
print np.array(final_distances).var()
cluster_vars.append(np.array(final_distances).var())
print ""
print "Sum of variances: ", sum(cluster_vars)
print "Total Variance: ", points.var()
print "Percent: ", (100 * sum(cluster_vars) / points.var())
And following is the output for k=2:
Unique clusters: set([0, 1])
Cluster: 0
[1.0, 2.0, 0.0, 1.4142135623730951, 1.0]
0.427451660041
Cluster: 1
[0.0, 1.0, 1.0, 1.0, 1.0]
0.16
Sum of variances: 0.587451660041
Total Variance: 21.1475
Percent: 2.77787757437
On my real dataset (does not look right to me!):
Sum of variances: 0.0188124746402
Total Variance: 0.00313754329764
Percent: 599.592510943
Unique clusters: set([0, 1, 2, 3])
Sum of variances: 0.0255808508714
Total Variance: 0.00313754329764
Percent: 815.314672809
Unique clusters: set([0, 1, 2, 3, 4])
Sum of variances: 0.0588210052519
Total Variance: 0.00313754329764
Percent: 1874.74720416
Unique clusters: set([0, 1, 2, 3, 4, 5])
Sum of variances: 0.0672406353655
Total Variance: 0.00313754329764
Percent: 2143.09824556
Unique clusters: set([0, 1, 2, 3, 4, 5, 6])
Sum of variances: 0.0646291452839
Total Variance: 0.00313754329764
Percent: 2059.86465055
Unique clusters: set([0, 1, 2, 3, 4, 5, 6, 7])
Sum of variances: 0.0817517362176
Total Variance: 0.00313754329764
Percent: 2605.5970695
Unique clusters: set([0, 1, 2, 3, 4, 5, 6, 7, 8])
Sum of variances: 0.0912820650486
Total Variance: 0.00313754329764
Percent: 2909.34837831
Unique clusters: set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Sum of variances: 0.102119601368
Total Variance: 0.00313754329764
Percent: 3254.76309585
Unique clusters: set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
Sum of variances: 0.125549475536
Total Variance: 0.00313754329764
Percent: 4001.52168834
Unique clusters: set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
Sum of variances: 0.138469402779
Total Variance: 0.00313754329764
Percent: 4413.30651542
Unique clusters: set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
就 Kmeans 而言,失真被用作停止标准(如果两次迭代之间的变化小于某个阈值,我们假设收敛)
如果您想从一组点和质心计算它,您可以执行以下操作(代码在 MATLAB 中使用
pdist2
函数,但用 Python 重写应该很简单/Numpy/Scipy):结果:
编辑#1:
我有一些时间来玩这个..这是一个应用于 'Fisher Iris 数据集'(4 个特征,150 个实例)。我们迭代
k=1..10
,绘制肘部曲线,选择K=3
作为簇数,并显示结果的散点图。请注意,在给定点和质心的情况下,我提供了多种计算簇内方差(扭曲)的方法。
scipy.cluster.vq.kmeans
函数默认返回此度量(使用欧几里得作为距离度量计算)。您还可以使用scipy.spatial .distance.cdist
函数使用您选择的函数计算距离(前提是您使用相同的距离度量获得了簇质心:@Denis有一个解决方案),然后计算失真。编辑#2:
为了回应评论,我在下面给出了另一个使用 NIST 手写数字数据集:它有1797 个数字从 0 到 9 的图像,每个图像大小为 8 x 8 像素。我重复上面的实验,稍作修改:应用主成分分析将维度从 64 降低到2:
您可以看到一些簇实际上如何对应于可区分的数字,而其他簇则与单个数字不匹配。
注意:K-means 的实现包含在
scikit-learn
(以及许多其他聚类算法和各种 聚类指标)。 这里是另一个类似的示例。The distortion, as far as Kmeans is concerned, is used as a stopping criterion (if the change between two iterations is less than some threshold, we assume convergence)
If you want to calculate it from a set of points and the centroids, you can do the following (the code is in MATLAB using
pdist2
function, but it should be straightforward to rewrite in Python/Numpy/Scipy):the result:
EDIT#1:
I had some time to play around with this.. Here is an example of KMeans clustering applied on the 'Fisher Iris Dataset' (4 features, 150 instances). We iterate over
k=1..10
, plot the elbow curve, pickK=3
as number of clusters, and show a scatter plot of the result.Note that I included a number of ways to compute the within-cluster variances (distortions), given the points and the centroids. The
scipy.cluster.vq.kmeans
function returns this measure by default (computed with Euclidean as a distance measure). You can also use thescipy.spatial.distance.cdist
function to calculate the distances with the function of your choice (provided you obtained the cluster centroids using the same distance measure: @Denis have a solution for that), then compute the distortion from that.EDIT#2:
In response to the comments, I give below another complete example using the NIST hand-written digits dataset: it has 1797 images of digits from 0 to 9, each of size 8-by-8 pixels. I repeat the experiment above slightly modified: Principal Components Analysis is applied to reduce the dimensionality from 64 down to 2:
You can see how some clusters actually correspond to distinguishable digits, while others don't match a single number.
Note: An implementation of K-means is included in
scikit-learn
(as well as many other clustering algorithms and various clustering metrics). Here is another similar example.一个简单的聚类度量:
1)从每个点到最近的簇中心绘制“旭日”射线,
2)查看所有射线的长度——距离(点,中心,公制=...)。
对于
metric="sqeuclidean"
和 1 个集群,平均长度平方是总方差
X.var()
;对于 2 个簇,它更少......减少到 N 个簇,长度全为 0。“解释的方差百分比”为 100 % - 此平均值。
代码,位于 is-it-possible-to-specify-your-own-distance-function-using-scikits-learn-k-means:
像任何一长串数字一样,这些距离可以通过多种方式查看:np.mean()、np.histogram() ...绘图、可视化并不容易。
另请参阅stats.stackexchange.com/questions/tagged/clustering,特别是
如何判断数据是否足够“聚类”以使聚类算法产生有意义的结果?
A simple cluster measure:
1) draw "sunburst" rays from each point to its nearest cluster centre,
2) look at the lengths — distance( point, centre, metric=... ) — of all the rays.
For
metric="sqeuclidean"
and 1 cluster,the average length-squared is the total variance
X.var()
; for 2 clusters, it's less ... down to N clusters, lengths all 0."Percent of variance explained" is 100 % - this average.
Code for this, under is-it-possible-to-specify-your-own-distance-function-using-scikits-learn-k-means:
Like any long list of numbers, these distances can be looked at in various ways: np.mean(), np.histogram() ... Plotting, visualization, is not easy.
See also stats.stackexchange.com/questions/tagged/clustering, in particular
How to tell if data is “clustered” enough for clustering algorithms to produce meaningful results?