如何实现 K-Means++算法?
我无法完全理解 K-Means++ 算法。我感兴趣的是如何选择第一个 k
质心,即初始化,其余部分就像原始 K-Means 算法。
- 使用的概率函数是基于距离还是高斯?
- 同时,选取最远的点(距其他质心)作为新的质心。
我将欣赏逐步的解释和示例。 维基百科中的内容不够清楚。此外,注释良好的源代码也会有所帮助。如果您使用 6 个阵列,请告诉我们哪一个阵列用于什么用途。
I am having trouble fully understanding the K-Means++ algorithm. I am interested exactly how the first k
centroids are picked, namely the initialization as the rest is like in the original K-Means algorithm.
- Is the probability function used based on distance or Gaussian?
- In the same time the most long distant point (From the other centroids) is picked for a new centroid.
I will appreciate a step by step explanation and an example. The one in Wikipedia is not clear enough. Also a very well commented source code would also help. If you are using 6 arrays then please tell us which one is for what.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
有趣的问题。感谢您让我注意到这篇论文 - K-Means++:的优点仔细播种
简单来说,聚类中心最初是从输入观察向量集中随机选择的,其中选择向量
x
的概率很高,如果x
不靠近任何先前选择的中心。这是一个一维的例子。我们的观察结果是 [0, 1, 2, 3, 4]。令第一个中心
c1
为 0。下一个聚类中心c2
为 x 的概率与||c1-x||^ 成正比2.
.因此,P(c2 = 1) = 1a、P(c2 = 2) = 4a、P(c2 = 3) = 9a、P(c2 = 4) = 16a,其中 a = 1/(1+4+9+ 16)。假设c2=4。那么,P(c3 = 1) = 1a、P(c3 = 2) = 4a、P(c3 = 3) = 1a,其中 a = 1/(1+4+1)。
我已经用 Python 编写了初始化过程;我不知道这是否对你有帮助。
编辑并澄清: cumsum 的输出为我们提供了划分区间 [0,1] 的边界。这些分区的长度等于对应点被选为中心的概率。那么,由于
r
统一在 [0,1] 之间选择,因此它将恰好落入这些区间之一(因为break
)。for
循环检查r
位于哪个分区。示例:
Interesting question. Thank you for bringing this paper to my attention - K-Means++: The Advantages of Careful Seeding
In simple terms, cluster centers are initially chosen at random from the set of input observation vectors, where the probability of choosing vector
x
is high ifx
is not near any previously chosen centers.Here is a one-dimensional example. Our observations are [0, 1, 2, 3, 4]. Let the first center,
c1
, be 0. The probability that the next cluster center,c2
, is x is proportional to||c1-x||^2
. So, P(c2 = 1) = 1a, P(c2 = 2) = 4a, P(c2 = 3) = 9a, P(c2 = 4) = 16a, where a = 1/(1+4+9+16).Suppose c2=4. Then, P(c3 = 1) = 1a, P(c3 = 2) = 4a, P(c3 = 3) = 1a, where a = 1/(1+4+1).
I've coded the initialization procedure in Python; I don't know if this helps you.
EDIT with clarification: The output of
cumsum
gives us boundaries to partition the interval [0,1]. These partitions have length equal to the probability of the corresponding point being chosen as a center. So then, sincer
is uniformly chosen between [0,1], it will fall into exactly one of these intervals (because ofbreak
). Thefor
loop checks to see which partitionr
is in.Example:
一艘班轮。
假设我们需要选择 2 个聚类中心,而不是随机选择它们{就像我们在简单的 k 均值中所做的那样},我们将随机选择第一个,然后找到距离第一个中心最远的点{这些点最有可能做不属于第一个聚类中心,因为它们离它很远}并在这些远点附近分配第二个聚类中心。
One Liner.
Say we need to select 2 cluster centers, instead of selecting them all randomly{like we do in simple k means}, we will select the first one randomly, then find the points that are farthest to the first center{These points most probably do not belong to the first cluster center as they are far from it} and assign the second cluster center nearby those far points.
我根据 Toby Segaran 的《集体智慧》一书和此处提供的 k-menas++ 初始化准备了 k-means++ 的完整源代码实现。
实际上这里有两个距离函数。对于初始质心,使用基于 numpy.inner 的标准质心,然后使用 Pearson 质心固定质心。也许 Pearson 也可以用于初始质心。他们说这样更好。
数据.txt:
<代码>
I have prepared a full source implementation of k-means++ based on the book "Collective Intelligence" by Toby Segaran and the k-menas++ initialization provided here.
Indeed there are two distance functions here. For the initial centroids a standard one is used based numpy.inner and then for the centroids fixation the Pearson one is used. Maybe the Pearson one can be also be used for the initial centroids. They say it is better.
data.txt: