K-Means的实现详细信息++没有Sklearn

发布于 2025-01-24 17:41:39 字数 673 浏览 4 评论 0原文

我正在使用MINST数据集进行K均值。但是,我发现了初始化的实施和进一步步骤的困难。

对于初始化,我必须首先选择一个随机数据指向第一个质心。然后,对于剩余的质心,我们还随机选择数据点,但是从加权概率分布中,直到选择所有质心

“在此处输入映像说明”

我在此步骤中坚持,我如何应用此分布来选择?我的意思是,如何实施它?对于d_ {k-1}(x),我可以只使用np.linalg.norm对其进行编译和平方吗?

第一个元素,我是否需要通过获得上一个质心和所有样本点之间的最大距离来找到下一个质心?

self.centroids = np.zeros((self.num_clusters, input_x.shape[1]))
ran_num = np.random.choice(input_x.shape[0])
self.centroids[0] = input_x[ran_num]

for k in range(1, self.num_clusters):

对于我的实施,我现在只是初始化了下一步的

I am doing K-means using MINST dataset. However, I found difficulties in the implementation on initialization and some further steps.

For the initialization, I have to first pick one random data point to the first centroid. Then for the remaining centroids, we also pick data points randomly, but from a weighted probability distribution, until all the centroids are chosen

enter image description here

I am sticking in this step, how can I apply this distribution to choose? I mean, how to implement it? for the D_{k-1}(x), can I just use np.linalg.norm to compile and square it?

For my implementation, I now just initialized the first element

self.centroids = np.zeros((self.num_clusters, input_x.shape[1]))
ran_num = np.random.choice(input_x.shape[0])
self.centroids[0] = input_x[ran_num]

for k in range(1, self.num_clusters):

for the next step, do I need to find the next centroid by obtaining the largest distance between the previous centroid and all sample points?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

皓月长歌 2025-01-31 17:41:39

您需要创建一个分布,其中选择观察结果的概率是观察结果与最接近的群集之间的(归一化)距离。因此,为了选择一个新的群集中心,很有可能选择远离已经存在的群集中心的观测值。同样,选择与已经存在的群集中心接近的观测值的可能性很低。

看起来像这样:

centers = []
centers.append(X[np.random.randint(X.shape[0])]) # inital center = one random sample
distance = np.full(X.shape[0], np.inf) 
for j in range(1,self.n_clusters):
    distance = np.minimum(np.linalg.norm(X - centers[-1], axis=1), distance) 
    p = np.square(distance) / np.sum(np.square(distance)) # probability vector [p1,...,pn]
    sample = np.random.choice(X.shape[0], p = p)
    centers.append(X[sample])

You need to create a distribution where the probability to select an observation is the (normalized) distance between the observation and its closest cluster. Thus, to select a new cluster center, there is a high probability to select observations that are far from all already existing cluster centers. Similarly, there is a low probability to select observations that are close to already existing cluster centers.

This would look like this:

centers = []
centers.append(X[np.random.randint(X.shape[0])]) # inital center = one random sample
distance = np.full(X.shape[0], np.inf) 
for j in range(1,self.n_clusters):
    distance = np.minimum(np.linalg.norm(X - centers[-1], axis=1), distance) 
    p = np.square(distance) / np.sum(np.square(distance)) # probability vector [p1,...,pn]
    sample = np.random.choice(X.shape[0], p = p)
    centers.append(X[sample])
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文