Python聚类软件包，可以基于距离矩阵的聚类，但也可以预测新行（没有新的群集/距离矩阵）

发布于 2025-02-11 16:15:43 字数 2228 浏览 2 评论 0原文

我知道与距离矩阵一起使用的各种（Sklearn）聚类算法 - 例如，通过来自随机森林的接近矩阵产生的（下面有些笨拙的可重复的代码）。是否存在任何聚类算法（使用距离矩阵），其中拟合群集模型（例如Cluster_Model下面）可以产生新数据行的群集成员身份？

from sklearn.metrics.pairwise import pairwise_distances
from sklearn.cluster import AgglomerativeClustering
import pandas as pd
import numpy as np
import os
from sklearn.ensemble import RandomForestRegressor
from sklearn import datasets

def distanceMatrix(model, X, normalize=True):

    terminals = model.apply(X)
    nTrees = terminals.shape[1]

    a = terminals[:,0]
    proxMat = 1 * np.equal.outer(a, a)

    for i in range(1, nTrees):
        a = terminals[:,i]
        proxMat += 1*np.equal.outer(a, a)

    if normalize:
        proxMat = proxMat / nTrees

    return 1 - proxMat  

# use iris data to make example reproducible and fast
iris = datasets.load_iris()
df = pd.DataFrame(iris['data'], columns = iris['feature_names'])
df['target'] = pd.Series(iris['target'], name = 'target_values')
df['target_name'] = df['target'].replace([0,1,2], ['iris-' + species for species in iris['target_names'].tolist()])

# simple one hot
df['iris_setosa'] = (df['target_name'] == 'iris-setosa').astype(int)
df['iris_versicolor'] = (df['target_name'] == 'iris-versicolor').astype(int)
df['iris_virginica'] = (df['target_name'] == 'iris-virginica').astype(int)

# the new regression model "target"
y = df['petal width (cm)']

X = df.drop([
    'target'
    ,'target_name'
    ,'petal width (cm)'
], axis = 1)

# fit random forest just for the purpose of getting proximity matrix
# open question does it matter which target is picked and/or whether regresion or classification?
# this is just to produce a toy dataset with mixed data
overfitted_model = RandomForestRegressor(n_estimators=250, min_samples_leaf=10)
overfitted_model.fit(X, y)

distance_matrix = distanceMatrix(overfitted_model, X, normalize=True)

cluster_model = AgglomerativeClustering(n_clusters=3, affinity='precomputed', linkage='average')
cluster_model.fit(distance_matrix)

df['label'] = cluster_model.labels_

PS：

读者可能会发现在这种情况下很有趣。

原文

I am aware of various (sklearn) clustering algorithm that work with distance matrices - e.g. produced via a proximity matrix coming from a random forest (some clumsy reproducible code below). Is there any clustering algorithm (working with distance matrix), where the fitted cluster model (e.g. cluster_model below) can produce the cluster membership of a new data row?

from sklearn.metrics.pairwise import pairwise_distances
from sklearn.cluster import AgglomerativeClustering
import pandas as pd
import numpy as np
import os
from sklearn.ensemble import RandomForestRegressor
from sklearn import datasets

def distanceMatrix(model, X, normalize=True):

    terminals = model.apply(X)
    nTrees = terminals.shape[1]

    a = terminals[:,0]
    proxMat = 1 * np.equal.outer(a, a)

    for i in range(1, nTrees):
        a = terminals[:,i]
        proxMat += 1*np.equal.outer(a, a)

    if normalize:
        proxMat = proxMat / nTrees

    return 1 - proxMat  

# use iris data to make example reproducible and fast
iris = datasets.load_iris()
df = pd.DataFrame(iris['data'], columns = iris['feature_names'])
df['target'] = pd.Series(iris['target'], name = 'target_values')
df['target_name'] = df['target'].replace([0,1,2], ['iris-' + species for species in iris['target_names'].tolist()])

# simple one hot
df['iris_setosa'] = (df['target_name'] == 'iris-setosa').astype(int)
df['iris_versicolor'] = (df['target_name'] == 'iris-versicolor').astype(int)
df['iris_virginica'] = (df['target_name'] == 'iris-virginica').astype(int)

# the new regression model "target"
y = df['petal width (cm)']

X = df.drop([
    'target'
    ,'target_name'
    ,'petal width (cm)'
], axis = 1)

# fit random forest just for the purpose of getting proximity matrix
# open question does it matter which target is picked and/or whether regresion or classification?
# this is just to produce a toy dataset with mixed data
overfitted_model = RandomForestRegressor(n_estimators=250, min_samples_leaf=10)
overfitted_model.fit(X, y)

distance_matrix = distanceMatrix(overfitted_model, X, normalize=True)

cluster_model = AgglomerativeClustering(n_clusters=3, affinity='precomputed', linkage='average')
cluster_model.fit(distance_matrix)

df['label'] = cluster_model.labels_

PS:

Readers may find this interesting in this context.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

煮酒 2025-02-18 16:15:44

对于聚集聚类，添加其他数据点需要重新计算簇，因为这种类型的聚类的工作方式。聚集聚类迭代基于起始点构建簇，然后根据链接度量合并，因此添加新的数据点可以并且将修改最终簇。

...
cluster_model.fit_predict(distance_matrix)

df['label'] = cluster_model.labels_

查看fit_predict，这与无监督或托管估计器更相关。

输出：

     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  target     target_name  iris_setosa  iris_versicolor  iris_virginica  label
0                  5.1               3.5                1.4               0.2       0     iris-setosa            1                0               0      1
1                  4.9               3.0                1.4               0.2       0     iris-setosa            1                0               0      1
2                  4.7               3.2                1.3               0.2       0     iris-setosa            1                0               0      1
3                  4.6               3.1                1.5               0.2       0     iris-setosa            1                0               0      1
4                  5.0               3.6                1.4               0.2       0     iris-setosa            1                0               0      1
..                 ...               ...                ...               ...     ...             ...          ...              ...             ...    ...
145                6.7               3.0                5.2               2.3       2  iris-virginica            0                0               1      2
146                6.3               2.5                5.0               1.9       2  iris-virginica            0                0               1      2
147                6.5               3.0                5.2               2.0       2  iris-virginica            0                0               1      2
148                6.2               3.4                5.4               2.3       2  iris-virginica            0                0               1      2
149                5.9               3.0                5.1               1.8       2  iris-virginica            0                0               1      2

For Agglomerative Clustering, adding additional data points requires a recompute of the clusters because of how this type of clustering works. Agglomerative Clustering iteratively builds clusters based on started points, and then merges according to a linkage measure, so adding new data points can and will modify the final clusters.

...
cluster_model.fit_predict(distance_matrix)

df['label'] = cluster_model.labels_

Check out fit_predict, which is more relevant for unsupervised or transductive estimators.

output:

     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  target     target_name  iris_setosa  iris_versicolor  iris_virginica  label
0                  5.1               3.5                1.4               0.2       0     iris-setosa            1                0               0      1
1                  4.9               3.0                1.4               0.2       0     iris-setosa            1                0               0      1
2                  4.7               3.2                1.3               0.2       0     iris-setosa            1                0               0      1
3                  4.6               3.1                1.5               0.2       0     iris-setosa            1                0               0      1
4                  5.0               3.6                1.4               0.2       0     iris-setosa            1                0               0      1
..                 ...               ...                ...               ...     ...             ...          ...              ...             ...    ...
145                6.7               3.0                5.2               2.3       2  iris-virginica            0                0               1      2
146                6.3               2.5                5.0               1.9       2  iris-virginica            0                0               1      2
147                6.5               3.0                5.2               2.0       2  iris-virginica            0                0               1      2
148                6.2               3.4                5.4               2.3       2  iris-virginica            0                0               1      2
149                5.9               3.0                5.1               1.8       2  iris-virginica            0                0               1      2

回复收藏 0 原文

~没有更多了~