使用 k-Means 聚类算法预测值

发布于 2024-12-16 19:21:45 字数 202 浏览 2 评论 0原文

我正在研究机器学习,并且用 Python 编写了 K 均值算法实现。它采用二维数据并将它们组织成簇。每个数据点还具有 0 或 1 的类值。

该算法让我困惑的是如何使用它来预测另一组没有 0 或 1 的二维数据的某些值,而是未知。对于每个簇,我应该将其中的点平均为 0 或 1,并且如果未知点最接近该簇,那么该未知点将采用平均值?或者有更聪明的方法吗?

干杯!

I'm messing around with machine learning, and I've written a K Means algorithm implementation in Python. It takes a two dimensional data and organises them into clusters. Each data point also has a class value of either a 0 or a 1.

What confuses me about the algorithm is how I can then use it to predict some values for another set of two dimensional data that doesn't have a 0 or a 1, but instead is unknown. For each cluster, should I average the points within it to either a 0 or a 1, and if an unknown point is closest to that cluster, then that unknown point takes on the averaged value? Or is there a smarter method?

Cheers!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

╰◇生如夏花灿烂 2024-12-23 19:21:45

要将新数据点分配给 k 均值创建的一组聚类中的一个,您只需找到距离该点最近的质心

换句话说,与将原始数据集中的每个点迭代分配给 k 个聚类之一所使用的步骤相同。这里唯一的区别是,您用于此计算的质心是最终集,即最后迭代时的质心值。

这是 python 中的一个实现(使用 NumPy):

>>> import numpy as NP
>>> # just made up values--based on your spec (2D data + 2 clusters)
>>> centroids
      array([[54, 85],
             [99, 78]])

>>> # randomly generate a new data point within the problem domain:
>>> new_data = NP.array([67, 78])

>>> # to assign a new data point to a cluster ID,
>>> # find its closest centroid:
>>> diff = centroids - new_data[0,:]  # NumPy broadcasting
>>> diff
      array([[-13,   7],
             [ 32,   0]])

>>> dist = NP.sqrt(NP.sum(diff**2, axis=-1))  # Euclidean distance
>>> dist
      array([ 14.76,  32.  ])

>>> closest_centroid = centroids[NP.argmin(dist),]
>>> closest_centroid
       array([54, 85])

To assign a new data point to one of a set of clusters created by k-means, you just find the centroid nearest to that point.

In other words, the same steps you used for the iterative assignment of each point in your original data set to one of k clusters. The only difference here is that the centroids you are using for this computation is the final set--i.e., the values for the centroids at the last iteration.

Here's one implementation in python (w/ NumPy):

>>> import numpy as NP
>>> # just made up values--based on your spec (2D data + 2 clusters)
>>> centroids
      array([[54, 85],
             [99, 78]])

>>> # randomly generate a new data point within the problem domain:
>>> new_data = NP.array([67, 78])

>>> # to assign a new data point to a cluster ID,
>>> # find its closest centroid:
>>> diff = centroids - new_data[0,:]  # NumPy broadcasting
>>> diff
      array([[-13,   7],
             [ 32,   0]])

>>> dist = NP.sqrt(NP.sum(diff**2, axis=-1))  # Euclidean distance
>>> dist
      array([ 14.76,  32.  ])

>>> closest_centroid = centroids[NP.argmin(dist),]
>>> closest_centroid
       array([54, 85])
岁月苍老的讽刺 2024-12-23 19:21:45

我知道我可能会迟到,但这是我对您问题的一般解决方案:

def predict(data, centroids):
    centroids, data = np.array(centroids), np.array(data)
    distances = []
    for unit in data:
        for center in centroids:
            distances.append(np.sum((unit - center) ** 2))                
    distances = np.reshape(distances, data.shape)
    closest_centroid = [np.argmin(dist) for dist in distances]
    print(closest_centroid)

I know that I might be late, but that is my general solution to your problem:

def predict(data, centroids):
    centroids, data = np.array(centroids), np.array(data)
    distances = []
    for unit in data:
        for center in centroids:
            distances.append(np.sum((unit - center) ** 2))                
    distances = np.reshape(distances, data.shape)
    closest_centroid = [np.argmin(dist) for dist in distances]
    print(closest_centroid)
披肩女神 2024-12-23 19:21:45

如果您正在考虑根据最近簇内的平均值分配一个值,那么您正在谈论某种形式的“软解码器”,它不仅估计坐标的正确值,而且估计您对估计的置信度。另一种选择是“硬解码器”,其中只有 0 和 1 的值是合法的(出现在训练数据集中),并且新坐标将获得最近簇内值的中值。我的猜测是,您应该始终只为每个坐标分配一个已知有效的类值(0 或 1),并且平均类值不是有效的方法。

If you are considering assigning a value based on the average value within the nearest cluster, you are talking about some form of "soft decoder", which estimates not only the correct value of the coordinate but your level of confidence in the estimate. The alternative would be a "hard decoder" where only values of 0 and 1 are legal (occur in the training data set), and the new coordinate would get the median of the values within the nearest cluster. My guess is that you should always assign only a known-valid class value (0 or 1) to each coordinate, and averaging class values is not a valid approach.

日久见人心 2024-12-23 19:21:45

这就是我为更接近的现有质心分配标签的方式。也可以是
对于实现在线/增量集群、创建新的分配很有用
现有的簇,但保持质心固定。小心,因为之后
(比方说)5-10% 的新点,您可能需要重新计算质心坐标。

def Labs( dataset,centroids ):    
a = []
for i in range(len(dataset)):
    d = []
    for j in range(n):        
        dist = np.linalg.norm(dataset[(i),:]-centroids[(j),:])
        d.append(dist)
    assignment = np.argmin(d)
    a.append(assignment)
return pd.DataFrame(np.array(a) + 1,columns =['Lab'])

我希望它有帮助

This is how I assigned labels to my closer existing centroid. It can be also
useful to implement online/incremental clustering, creating new assignation to
the existing clusters, but keeping centroids fixed. Be careful, cause after
(let's say) 5-10% new points, you might want to recalculate the centroid oordinates.

def Labs( dataset,centroids ):    
a = []
for i in range(len(dataset)):
    d = []
    for j in range(n):        
        dist = np.linalg.norm(dataset[(i),:]-centroids[(j),:])
        d.append(dist)
    assignment = np.argmin(d)
    a.append(assignment)
return pd.DataFrame(np.array(a) + 1,columns =['Lab'])

I hope it helps

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文