当前位置：文江博客话题详情

Python k-means scikit-learn cluster-analysis machine-learning

是否可以使用 scikit-learn K-Means Clustering 指定您自己的距离函数？

发布于 2024-10-29 11:38:26 字数 57 浏览 5 评论 0原文

是否可以使用 scikit-learn K-Means Clustering 指定您自己的距离函数？

收藏 0

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

评论（11）

萧瑟寒风 2024-11-05 11:38:27

只需使用 nltk 代替您可以执行此操作的地方，例如

from nltk.cluster.kmeans import KMeansClusterer
NUM_CLUSTERS = <choose a value>
data = <sparse matrix that you would normally give to scikit>.toarray()

kclusterer = KMeansClusterer(NUM_CLUSTERS, distance=nltk.cluster.util.cosine_distance, repeats=25)
assigned_clusters = kclusterer.cluster(data, assign_clusters=True)

Just use nltk instead where you can do this, e.g.

from nltk.cluster.kmeans import KMeansClusterer
NUM_CLUSTERS = <choose a value>
data = <sparse matrix that you would normally give to scikit>.toarray()

kclusterer = KMeansClusterer(NUM_CLUSTERS, distance=nltk.cluster.util.cosine_distance, repeats=25)
assigned_clusters = kclusterer.cluster(data, assign_clusters=True)

回复收藏 0 原文

如何视而不见 2024-11-05 11:38:27

是的，您可以使用差异度量函数；然而，根据定义，k 均值聚类算法依赖于与每个聚类均值的欧几里得距离。

您可以使用不同的度量，因此即使您仍在计算平均值，您也可以使用马哈诺比斯距离之类的东西。

回复收藏 0 原文

两仪 2024-11-05 11:38:27

有 pyclustering 它是 python/C++ （所以它很快！），并允许您指定自定义指标函数

from pyclustering.cluster.kmeans import kmeans
from pyclustering.utils.metric import type_metric, distance_metric

user_function = lambda point1, point2: point1[0] + point2[0] + 2
metric = distance_metric(type_metric.USER_DEFINED, func=user_function)

# create K-Means algorithm with specific distance metric
start_centers = [[4.7, 5.9], [5.7, 6.5]];
kmeans_instance = kmeans(sample, start_centers, metric=metric)

# run cluster analysis and obtain results
kmeans_instance.process()
clusters = kmeans_instance.get_clusters()

实际上，我还没有测试过这段代码，而是从一张票和示例代码。

There is pyclustering which is python/C++ (so its fast!) and lets you specify a custom metric function

from pyclustering.cluster.kmeans import kmeans
from pyclustering.utils.metric import type_metric, distance_metric

user_function = lambda point1, point2: point1[0] + point2[0] + 2
metric = distance_metric(type_metric.USER_DEFINED, func=user_function)

# create K-Means algorithm with specific distance metric
start_centers = [[4.7, 5.9], [5.7, 6.5]];
kmeans_instance = kmeans(sample, start_centers, metric=metric)

# run cluster analysis and obtain results
kmeans_instance.process()
clusters = kmeans_instance.get_clusters()

Actually, i haven't tested this code but cobbled it together from a ticket and example code.

回复收藏 0 原文

南城旧梦 2024-11-05 11:38:27

Spectral Python 的 k-means 允许使用 L1（曼哈顿））距离。

回复收藏 0 原文

独留℉清风醉 2024-11-05 11:38:27

Sklearn Kmeans 使用欧几里得距离。它没有度量参数。这就是说，如果您要对时间序列进行聚类，则可以使用 tslearn python 包，此时您可以指定一个指标 (dtw, <代码>softdtw，欧几里得）。

回复收藏 0 原文

捶死心动 2024-11-05 11:38:27

sklearn 库中的亲和力传播算法允许您传递相似度矩阵而不是样本。因此，您可以使用度量来计算相似性矩阵（而不是相异性矩阵），并通过将“亲和性”项设置为“预计算”将其传递给函数。https://scikit-learn.org/stable/modules/生成/sklearn.cluster.AffinityPropagation.html#sklearn.cluster.AffinityPropagation.fit
就K-Mean而言，我认为也是可以的，但是我没有尝试过。
然而，正如其他答案所述，使用不同的指标找到平均值将成为问题。相反，您可以使用 PAM (K-Medoids) 算法，因为它计算总偏差 (TD) 的变化，因此它不依赖于距离度量。 https://python-kmedoids.readthedocs.io/en/latest/#fasterpam< /a>

回复收藏 0 原文

征棹 2024-11-05 11:38:27

是的，在当前稳定版本的 sklearn (scikit-learn 1.1.3) 中，您可以轻松使用自己的距离度量。您所要做的就是创建一个继承自 sklearn.cluster.KMeans 的类并覆盖其 _transform 方法。

下面的例子是来自 Yolov2 论文的 IOU 距离。

import sklearn.cluster
import numpy as np

def anchor_iou(box_dims, centroid_box_dims):
    box_w, box_h = box_dims[..., 0], box_dims[..., 1]
    centroid_w, centroid_h = centroid_box_dims[..., 0], centroid_box_dims[..., 1]
    inter_w = np.minimum(box_w[..., np.newaxis], centroid_w[np.newaxis, ...])
    inter_h = np.minimum(box_h[..., np.newaxis], centroid_h[np.newaxis, ...])
    inter_area = inter_w * inter_h
    centroid_area = centroid_w * centroid_h
    box_area = box_w * box_h
    return inter_area / (
        centroid_area[np.newaxis, ...] + box_area[..., np.newaxis] - inter_area
    )

class IOUKMeans(sklearn.cluster.KMeans):
    def __init__(
        self,
        n_clusters=8,
        *,
        init="k-means++",
        n_init=10,
        max_iter=300,
        tol=1e-4,
        verbose=0,
        random_state=None,
        copy_x=True,
        algorithm="lloyd",
    ):
        super().__init__(
            n_clusters=n_clusters,
            init=init,
            n_init=n_init,
            max_iter=max_iter,
            tol=tol,
            verbose=verbose,
            random_state=random_state,
            copy_x=copy_x,
            algorithm=algorithm
        )

    def _transform(self, X):
        return anchor_iou(X, self.cluster_centers_)

rng = np.random.default_rng(12345)
num_boxes = 10
bboxes = rng.integers(low=0, high=100, size=(num_boxes, 2))

kmeans = IOUKMeans(num_clusters).fit(bboxes)

Yes, in the current stable version of sklearn (scikit-learn 1.1.3), you can easily use your own distance metric. All you have to do is create a class that inherits from sklearn.cluster.KMeans and overwrites its _transform method.

The below example is for the IOU distance from the Yolov2 paper.

import sklearn.cluster
import numpy as np

def anchor_iou(box_dims, centroid_box_dims):
    box_w, box_h = box_dims[..., 0], box_dims[..., 1]
    centroid_w, centroid_h = centroid_box_dims[..., 0], centroid_box_dims[..., 1]
    inter_w = np.minimum(box_w[..., np.newaxis], centroid_w[np.newaxis, ...])
    inter_h = np.minimum(box_h[..., np.newaxis], centroid_h[np.newaxis, ...])
    inter_area = inter_w * inter_h
    centroid_area = centroid_w * centroid_h
    box_area = box_w * box_h
    return inter_area / (
        centroid_area[np.newaxis, ...] + box_area[..., np.newaxis] - inter_area
    )

class IOUKMeans(sklearn.cluster.KMeans):
    def __init__(
        self,
        n_clusters=8,
        *,
        init="k-means++",
        n_init=10,
        max_iter=300,
        tol=1e-4,
        verbose=0,
        random_state=None,
        copy_x=True,
        algorithm="lloyd",
    ):
        super().__init__(
            n_clusters=n_clusters,
            init=init,
            n_init=n_init,
            max_iter=max_iter,
            tol=tol,
            verbose=verbose,
            random_state=random_state,
            copy_x=copy_x,
            algorithm=algorithm
        )

    def _transform(self, X):
        return anchor_iou(X, self.cluster_centers_)

rng = np.random.default_rng(12345)
num_boxes = 10
bboxes = rng.integers(low=0, high=100, size=(num_boxes, 2))

kmeans = IOUKMeans(num_clusters).fit(bboxes)

回复收藏 0 原文

躲猫猫 2024-11-05 11:38:27

从版本 scikit-learn==1.2.2 开始，可以将 sklearn.cluster._kmeans 中的 _euclidean_distances 替换为以下内容：

import sklearn.cluster._kmeans as kmeans
from sklearn.metrics import pairwise_distances

def custom_distances(X, Y=None, Y_norm_squared=None, squared=False):
    if squared: #squared equals False during cluster center estimation
        return pairwise_distances(X,Y, metric='minkowski', p=1.5)
    else:
        return pairwise_distances(X,Y, metric='minkowski', p=1.5)
    
kmeans._euclidean_distances = custom_distances
kmeans.euclidean_distances = custom_distances # utilized by the method `KMeans._transform`

然后创建基础像平常一样的估计器

km = kmeans.KMeans(init="k-means++", n_clusters=clusters, n_init=4, random_state=0)

As of version scikit-learn==1.2.2, one could replace _euclidean_distances in sklearn.cluster._kmeans with the following:

import sklearn.cluster._kmeans as kmeans
from sklearn.metrics import pairwise_distances

def custom_distances(X, Y=None, Y_norm_squared=None, squared=False):
    if squared: #squared equals False during cluster center estimation
        return pairwise_distances(X,Y, metric='minkowski', p=1.5)
    else:
        return pairwise_distances(X,Y, metric='minkowski', p=1.5)
    
kmeans._euclidean_distances = custom_distances
kmeans.euclidean_distances = custom_distances # utilized by the method `KMeans._transform`

Then create base estimator as usual

km = kmeans.KMeans(init="k-means++", n_clusters=clusters, n_init=4, random_state=0)

回复收藏 0 原文

作妖 2024-11-05 11:38:27

def distance_metrics(dist_metrics):
    kmeans_instance = kmeans(trs_data, initial_centers, metric=dist_metrics)

    label = np.zeros(210, dtype=int)
    for i in range(0, len(clusters)):
        for index, j in enumerate(clusters[i]):
            label[j] = i

def distance_metrics(dist_metrics):
    kmeans_instance = kmeans(trs_data, initial_centers, metric=dist_metrics)

    label = np.zeros(210, dtype=int)
    for i in range(0, len(clusters)):
        for index, j in enumerate(clusters[i]):
            label[j] = i

回复收藏 0 原文

相守太难 2024-11-05 11:38:26

这是一个小 kmeans，它使用 20 多个距离中的任意一个
scipy.spatial .distance，或用户函数。
欢迎评论（目前只有一名用户，还不够）；
特别是，你的 N, dim, k, metric 是多少？

#!/usr/bin/env python
# kmeans.py using any of the 20-odd metrics in scipy.spatial.distance
# kmeanssample 2 pass, first sample sqrt(N)

from __future__ import division
import random
import numpy as np
from scipy.spatial.distance import cdist  # $scipy/spatial/distance.py
    # http://docs.scipy.org/doc/scipy/reference/spatial.html
from scipy.sparse import issparse  # $scipy/sparse/csr.py

__date__ = "2011-11-17 Nov denis"
    # X sparse, any cdist metric: real app ?
    # centres get dense rapidly, metrics in high dim hit distance whiteout
    # vs unsupervised / semi-supervised svm

#...............................................................................
def kmeans( X, centres, delta=.001, maxiter=10, metric="euclidean", p=2, verbose=1 ):
    """ centres, Xtocentre, distances = kmeans( X, initial centres ... )
    in:
        X N x dim  may be sparse
        centres k x dim: initial centres, e.g. random.sample( X, k )
        delta: relative error, iterate until the average distance to centres
            is within delta of the previous average distance
        maxiter
        metric: any of the 20-odd in scipy.spatial.distance
            "chebyshev" = max, "cityblock" = L1, "minkowski" with p=
            or a function( Xvec, centrevec ), e.g. Lqmetric below
        p: for minkowski metric -- local mod cdist for 0 < p < 1 too
        verbose: 0 silent, 2 prints running distances
    out:
        centres, k x dim
        Xtocentre: each X -> its nearest centre, ints N -> k
        distances, N
    see also: kmeanssample below, class Kmeans below.
    """
    if not issparse(X):
        X = np.asanyarray(X)  # ?
    centres = centres.todense() if issparse(centres) \
        else centres.copy()
    N, dim = X.shape
    k, cdim = centres.shape
    if dim != cdim:
        raise ValueError( "kmeans: X %s and centres %s must have the same number of columns" % (
            X.shape, centres.shape ))
    if verbose:
        print "kmeans: X %s  centres %s  delta=%.2g  maxiter=%d  metric=%s" % (
            X.shape, centres.shape, delta, maxiter, metric)
    allx = np.arange(N)
    prevdist = 0
    for jiter in range( 1, maxiter+1 ):
        D = cdist_sparse( X, centres, metric=metric, p=p )  # |X| x |centres|
        xtoc = D.argmin(axis=1)  # X -> nearest centre
        distances = D[allx,xtoc]
        avdist = distances.mean()  # median ?
        if verbose >= 2:
            print "kmeans: av |X - nearest centre| = %.4g" % avdist
        if (1 - delta) * prevdist <= avdist <= prevdist \
        or jiter == maxiter:
            break
        prevdist = avdist
        for jc in range(k):  # (1 pass in C)
            c = np.where( xtoc == jc )[0]
            if len(c) > 0:
                centres[jc] = X[c].mean( axis=0 )
    if verbose:
        print "kmeans: %d iterations  cluster sizes:" % jiter, np.bincount(xtoc)
    if verbose >= 2:
        r50 = np.zeros(k)
        r90 = np.zeros(k)
        for j in range(k):
            dist = distances[ xtoc == j ]
            if len(dist) > 0:
                r50[j], r90[j] = np.percentile( dist, (50, 90) )
        print "kmeans: cluster 50 % radius", r50.astype(int)
        print "kmeans: cluster 90 % radius", r90.astype(int)
            # scale L1 / dim, L2 / sqrt(dim) ?
    return centres, xtoc, distances

#...............................................................................
def kmeanssample( X, k, nsample=0, **kwargs ):
    """ 2-pass kmeans, fast for large N:
        1) kmeans a random sample of nsample ~ sqrt(N) from X
        2) full kmeans, starting from those centres
    """
        # merge w kmeans ? mttiw
        # v large N: sample N^1/2, N^1/2 of that
        # seed like sklearn ?
    N, dim = X.shape
    if nsample == 0:
        nsample = max( 2*np.sqrt(N), 10*k )
    Xsample = randomsample( X, int(nsample) )
    pass1centres = randomsample( X, int(k) )
    samplecentres = kmeans( Xsample, pass1centres, **kwargs )[0]
    return kmeans( X, samplecentres, **kwargs )

def cdist_sparse( X, Y, **kwargs ):
    """ -> |X| x |Y| cdist array, any cdist metric
        X or Y may be sparse -- best csr
    """
        # todense row at a time, v slow if both v sparse
    sxy = 2*issparse(X) + issparse(Y)
    if sxy == 0:
        return cdist( X, Y, **kwargs )
    d = np.empty( (X.shape[0], Y.shape[0]), np.float64 )
    if sxy == 2:
        for j, x in enumerate(X):
            d[j] = cdist( x.todense(), Y, **kwargs ) [0]
    elif sxy == 1:
        for k, y in enumerate(Y):
            d[:,k] = cdist( X, y.todense(), **kwargs ) [0]
    else:
        for j, x in enumerate(X):
            for k, y in enumerate(Y):
                d[j,k] = cdist( x.todense(), y.todense(), **kwargs ) [0]
    return d

def randomsample( X, n ):
    """ random.sample of the rows of X
        X may be sparse -- best csr
    """
    sampleix = random.sample( xrange( X.shape[0] ), int(n) )
    return X[sampleix]

def nearestcentres( X, centres, metric="euclidean", p=2 ):
    """ each X -> nearest centre, any metric
            euclidean2 (~ withinss) is more sensitive to outliers,
            cityblock (manhattan, L1) less sensitive
    """
    D = cdist( X, centres, metric=metric, p=p )  # |X| x |centres|
    return D.argmin(axis=1)

def Lqmetric( x, y=None, q=.5 ):
    # yes a metric, may increase weight of near matches; see ...
    return (np.abs(x - y) ** q) .mean() if y is not None \
        else (np.abs(x) ** q) .mean()

#...............................................................................
class Kmeans:
    """ km = Kmeans( X, k= or centres=, ... )
        in: either initial centres= for kmeans
            or k= [nsample=] for kmeanssample
        out: km.centres, km.Xtocentre, km.distances
        iterator:
            for jcentre, J in km:
                clustercentre = centres[jcentre]
                J indexes e.g. X[J], classes[J]
    """
    def __init__( self, X, k=0, centres=None, nsample=0, **kwargs ):
        self.X = X
        if centres is None:
            self.centres, self.Xtocentre, self.distances = kmeanssample(
                X, k=k, nsample=nsample, **kwargs )
        else:
            self.centres, self.Xtocentre, self.distances = kmeans(
                X, centres, **kwargs )

    def __iter__(self):
        for jc in range(len(self.centres)):
            yield jc, (self.Xtocentre == jc)

#...............................................................................
if __name__ == "__main__":
    import random
    import sys
    from time import time

    N = 10000
    dim = 10
    ncluster = 10
    kmsample = 100  # 0: random centres, > 0: kmeanssample
    kmdelta = .001
    kmiter = 10
    metric = "cityblock"  # "chebyshev" = max, "cityblock" L1,  Lqmetric
    seed = 1

    exec( "\n".join( sys.argv[1:] ))  # run this.py N= ...
    np.set_printoptions( 1, threshold=200, edgeitems=5, suppress=True )
    np.random.seed(seed)
    random.seed(seed)

    print "N %d  dim %d  ncluster %d  kmsample %d  metric %s" % (
        N, dim, ncluster, kmsample, metric)
    X = np.random.exponential( size=(N,dim) )
        # cf scikits-learn datasets/
    t0 = time()
    if kmsample > 0:
        centres, xtoc, dist = kmeanssample( X, ncluster, nsample=kmsample,
            delta=kmdelta, maxiter=kmiter, metric=metric, verbose=2 )
    else:
        randomcentres = randomsample( X, ncluster )
        centres, xtoc, dist = kmeans( X, randomcentres,
            delta=kmdelta, maxiter=kmiter, metric=metric, verbose=2 )
    print "%.0f msec" % ((time() - t0) * 1000)

    # also ~/py/np/kmeans/test-kmeans.py

2012 年 3 月 26 日添加的一些注释：

1) 对于余弦距离，首先将所有数据向量标准化为 |X| = 1；然后就

cosinedistance( X, Y ) = 1 - X . Y = Euclidean distance |X - Y|^2 / 2

快了。对于位向量，将范数与向量分开
而不是扩展到浮动
（尽管有些程序可能会为您扩展）。
对于稀疏向量，例如 N, X 的 1 %。 Y 应该花费时间 O( 2 % N )，
空间 O(N)；但我不知道哪些程序可以做到这一点。

2）
Scikit-learn 聚类
很好地概述了 k 均值、小批量 k 均值 ...
使用适用于 scipy.sparse 矩阵的代码。

3) 始终在 k 均值之后检查簇大小。
如果您期望大小大致相等的簇，但它们却出来了
[44 37 9 5 5] % ...（抓头的声音）。

Here's a small kmeans that uses any of the 20-odd distances in
scipy.spatial.distance, or a user function.
Comments would be welcome (this has had only one user so far, not enough);
in particular, what are your N, dim, k, metric ?

#!/usr/bin/env python
# kmeans.py using any of the 20-odd metrics in scipy.spatial.distance
# kmeanssample 2 pass, first sample sqrt(N)

from __future__ import division
import random
import numpy as np
from scipy.spatial.distance import cdist  # $scipy/spatial/distance.py
    # http://docs.scipy.org/doc/scipy/reference/spatial.html
from scipy.sparse import issparse  # $scipy/sparse/csr.py

__date__ = "2011-11-17 Nov denis"
    # X sparse, any cdist metric: real app ?
    # centres get dense rapidly, metrics in high dim hit distance whiteout
    # vs unsupervised / semi-supervised svm

#...............................................................................
def kmeans( X, centres, delta=.001, maxiter=10, metric="euclidean", p=2, verbose=1 ):
    """ centres, Xtocentre, distances = kmeans( X, initial centres ... )
    in:
        X N x dim  may be sparse
        centres k x dim: initial centres, e.g. random.sample( X, k )
        delta: relative error, iterate until the average distance to centres
            is within delta of the previous average distance
        maxiter
        metric: any of the 20-odd in scipy.spatial.distance
            "chebyshev" = max, "cityblock" = L1, "minkowski" with p=
            or a function( Xvec, centrevec ), e.g. Lqmetric below
        p: for minkowski metric -- local mod cdist for 0 < p < 1 too
        verbose: 0 silent, 2 prints running distances
    out:
        centres, k x dim
        Xtocentre: each X -> its nearest centre, ints N -> k
        distances, N
    see also: kmeanssample below, class Kmeans below.
    """
    if not issparse(X):
        X = np.asanyarray(X)  # ?
    centres = centres.todense() if issparse(centres) \
        else centres.copy()
    N, dim = X.shape
    k, cdim = centres.shape
    if dim != cdim:
        raise ValueError( "kmeans: X %s and centres %s must have the same number of columns" % (
            X.shape, centres.shape ))
    if verbose:
        print "kmeans: X %s  centres %s  delta=%.2g  maxiter=%d  metric=%s" % (
            X.shape, centres.shape, delta, maxiter, metric)
    allx = np.arange(N)
    prevdist = 0
    for jiter in range( 1, maxiter+1 ):
        D = cdist_sparse( X, centres, metric=metric, p=p )  # |X| x |centres|
        xtoc = D.argmin(axis=1)  # X -> nearest centre
        distances = D[allx,xtoc]
        avdist = distances.mean()  # median ?
        if verbose >= 2:
            print "kmeans: av |X - nearest centre| = %.4g" % avdist
        if (1 - delta) * prevdist <= avdist <= prevdist \
        or jiter == maxiter:
            break
        prevdist = avdist
        for jc in range(k):  # (1 pass in C)
            c = np.where( xtoc == jc )[0]
            if len(c) > 0:
                centres[jc] = X[c].mean( axis=0 )
    if verbose:
        print "kmeans: %d iterations  cluster sizes:" % jiter, np.bincount(xtoc)
    if verbose >= 2:
        r50 = np.zeros(k)
        r90 = np.zeros(k)
        for j in range(k):
            dist = distances[ xtoc == j ]
            if len(dist) > 0:
                r50[j], r90[j] = np.percentile( dist, (50, 90) )
        print "kmeans: cluster 50 % radius", r50.astype(int)
        print "kmeans: cluster 90 % radius", r90.astype(int)
            # scale L1 / dim, L2 / sqrt(dim) ?
    return centres, xtoc, distances

#...............................................................................
def kmeanssample( X, k, nsample=0, **kwargs ):
    """ 2-pass kmeans, fast for large N:
        1) kmeans a random sample of nsample ~ sqrt(N) from X
        2) full kmeans, starting from those centres
    """
        # merge w kmeans ? mttiw
        # v large N: sample N^1/2, N^1/2 of that
        # seed like sklearn ?
    N, dim = X.shape
    if nsample == 0:
        nsample = max( 2*np.sqrt(N), 10*k )
    Xsample = randomsample( X, int(nsample) )
    pass1centres = randomsample( X, int(k) )
    samplecentres = kmeans( Xsample, pass1centres, **kwargs )[0]
    return kmeans( X, samplecentres, **kwargs )

def cdist_sparse( X, Y, **kwargs ):
    """ -> |X| x |Y| cdist array, any cdist metric
        X or Y may be sparse -- best csr
    """
        # todense row at a time, v slow if both v sparse
    sxy = 2*issparse(X) + issparse(Y)
    if sxy == 0:
        return cdist( X, Y, **kwargs )
    d = np.empty( (X.shape[0], Y.shape[0]), np.float64 )
    if sxy == 2:
        for j, x in enumerate(X):
            d[j] = cdist( x.todense(), Y, **kwargs ) [0]
    elif sxy == 1:
        for k, y in enumerate(Y):
            d[:,k] = cdist( X, y.todense(), **kwargs ) [0]
    else:
        for j, x in enumerate(X):
            for k, y in enumerate(Y):
                d[j,k] = cdist( x.todense(), y.todense(), **kwargs ) [0]
    return d

def randomsample( X, n ):
    """ random.sample of the rows of X
        X may be sparse -- best csr
    """
    sampleix = random.sample( xrange( X.shape[0] ), int(n) )
    return X[sampleix]

def nearestcentres( X, centres, metric="euclidean", p=2 ):
    """ each X -> nearest centre, any metric
            euclidean2 (~ withinss) is more sensitive to outliers,
            cityblock (manhattan, L1) less sensitive
    """
    D = cdist( X, centres, metric=metric, p=p )  # |X| x |centres|
    return D.argmin(axis=1)

def Lqmetric( x, y=None, q=.5 ):
    # yes a metric, may increase weight of near matches; see ...
    return (np.abs(x - y) ** q) .mean() if y is not None \
        else (np.abs(x) ** q) .mean()

#...............................................................................
class Kmeans:
    """ km = Kmeans( X, k= or centres=, ... )
        in: either initial centres= for kmeans
            or k= [nsample=] for kmeanssample
        out: km.centres, km.Xtocentre, km.distances
        iterator:
            for jcentre, J in km:
                clustercentre = centres[jcentre]
                J indexes e.g. X[J], classes[J]
    """
    def __init__( self, X, k=0, centres=None, nsample=0, **kwargs ):
        self.X = X
        if centres is None:
            self.centres, self.Xtocentre, self.distances = kmeanssample(
                X, k=k, nsample=nsample, **kwargs )
        else:
            self.centres, self.Xtocentre, self.distances = kmeans(
                X, centres, **kwargs )

    def __iter__(self):
        for jc in range(len(self.centres)):
            yield jc, (self.Xtocentre == jc)

#...............................................................................
if __name__ == "__main__":
    import random
    import sys
    from time import time

    N = 10000
    dim = 10
    ncluster = 10
    kmsample = 100  # 0: random centres, > 0: kmeanssample
    kmdelta = .001
    kmiter = 10
    metric = "cityblock"  # "chebyshev" = max, "cityblock" L1,  Lqmetric
    seed = 1

    exec( "\n".join( sys.argv[1:] ))  # run this.py N= ...
    np.set_printoptions( 1, threshold=200, edgeitems=5, suppress=True )
    np.random.seed(seed)
    random.seed(seed)

    print "N %d  dim %d  ncluster %d  kmsample %d  metric %s" % (
        N, dim, ncluster, kmsample, metric)
    X = np.random.exponential( size=(N,dim) )
        # cf scikits-learn datasets/
    t0 = time()
    if kmsample > 0:
        centres, xtoc, dist = kmeanssample( X, ncluster, nsample=kmsample,
            delta=kmdelta, maxiter=kmiter, metric=metric, verbose=2 )
    else:
        randomcentres = randomsample( X, ncluster )
        centres, xtoc, dist = kmeans( X, randomcentres,
            delta=kmdelta, maxiter=kmiter, metric=metric, verbose=2 )
    print "%.0f msec" % ((time() - t0) * 1000)

    # also ~/py/np/kmeans/test-kmeans.py

Some notes added 26mar 2012:

1) for cosine distance, first normalize all the data vectors to |X| = 1; then

cosinedistance( X, Y ) = 1 - X . Y = Euclidean distance |X - Y|^2 / 2

is fast. For bit vectors, keep the norms separately from the vectors
instead of expanding out to floats
(although some programs may expand for you).
For sparse vectors, say 1 % of N, X . Y should take time O( 2 % N ),
space O(N); but I don't know which programs do that.

2)
Scikit-learn clustering
gives an excellent overview of k-means, mini-batch-k-means ...
with code that works on scipy.sparse matrices.

3) Always check cluster sizes after k-means.
If you're expecting roughly equal-sized clusters, but they come out
[44 37 9 5 5] % ... (sound of head-scratching).

回复收藏 0 原文

天邊彩虹 2024-11-05 11:38:26

不幸的是没有：scikit-learn 当前的 k-means 实现仅使用欧几里德距离。

将 k-means 扩展到其他距离并不是一件容易的事，而且 Denis 的上述答案也不是为其他指标实现 k-means 的正确方法。

回复收藏 0 原文

~没有更多了~

关于作者

暂无简介

0 文章

0 评论

23 人气

关注发私信

相关话题

热门标签

操作系统程序设计 IT运维 Linux系统管理 JavaScript 服务器应用 solaris C/C++ PHP Shell BSD Vue.js aix Oracle Python HTML 系统管理 HTML5 CSS 前端

推荐作者

胡图图

文章 0 评论 0

zt006

文章 0 评论 0

z祗昰~

文章 0 评论 0

冰葑

文章 0 评论 0

野の

文章 0 评论 0

天空

文章 0 评论 0

友情链接

我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的隐私政策了解更多相关信息。单击 接受 或继续使用网站，即表示您同意使用 Cookies 和您的相关数据。

原文