Python聚类软件包,可以基于距离矩阵的聚类,但也可以预测新行(没有新的群集/距离矩阵)

发布于 2025-02-11 16:15:43 字数 2228 浏览 2 评论 0原文

我知道与距离矩阵一起使用的各种(Sklearn)聚类算法 - 例如,通过来自随机森林的接近矩阵产生的(下面有些笨拙的可重复的代码)。是否存在任何聚类算法(使用距离矩阵),其中拟合群集模型(例如Cluster_Model下面)可以产生新数据行的群集成员身份?

from sklearn.metrics.pairwise import pairwise_distances
from sklearn.cluster import AgglomerativeClustering
import pandas as pd
import numpy as np
import os
from sklearn.ensemble import RandomForestRegressor
from sklearn import datasets

def distanceMatrix(model, X, normalize=True):

    terminals = model.apply(X)
    nTrees = terminals.shape[1]

    a = terminals[:,0]
    proxMat = 1 * np.equal.outer(a, a)

    for i in range(1, nTrees):
        a = terminals[:,i]
        proxMat += 1*np.equal.outer(a, a)

    if normalize:
        proxMat = proxMat / nTrees

    return 1 - proxMat  

# use iris data to make example reproducible and fast
iris = datasets.load_iris()
df = pd.DataFrame(iris['data'], columns = iris['feature_names'])
df['target'] = pd.Series(iris['target'], name = 'target_values')
df['target_name'] = df['target'].replace([0,1,2], ['iris-' + species for species in iris['target_names'].tolist()])

# simple one hot
df['iris_setosa'] = (df['target_name'] == 'iris-setosa').astype(int)
df['iris_versicolor'] = (df['target_name'] == 'iris-versicolor').astype(int)
df['iris_virginica'] = (df['target_name'] == 'iris-virginica').astype(int)

# the new regression model "target"
y = df['petal width (cm)']

X = df.drop([
    'target'
    ,'target_name'
    ,'petal width (cm)'
], axis = 1)

# fit random forest just for the purpose of getting proximity matrix
# open question does it matter which target is picked and/or whether regresion or classification?
# this is just to produce a toy dataset with mixed data
overfitted_model = RandomForestRegressor(n_estimators=250, min_samples_leaf=10)
overfitted_model.fit(X, y)

distance_matrix = distanceMatrix(overfitted_model, X, normalize=True)

cluster_model = AgglomerativeClustering(n_clusters=3, affinity='precomputed', linkage='average')
cluster_model.fit(distance_matrix)

df['label'] = cluster_model.labels_

PS:

I am aware of various (sklearn) clustering algorithm that work with distance matrices - e.g. produced via a proximity matrix coming from a random forest (some clumsy reproducible code below). Is there any clustering algorithm (working with distance matrix), where the fitted cluster model (e.g. cluster_model below) can produce the cluster membership of a new data row?

from sklearn.metrics.pairwise import pairwise_distances
from sklearn.cluster import AgglomerativeClustering
import pandas as pd
import numpy as np
import os
from sklearn.ensemble import RandomForestRegressor
from sklearn import datasets

def distanceMatrix(model, X, normalize=True):

    terminals = model.apply(X)
    nTrees = terminals.shape[1]

    a = terminals[:,0]
    proxMat = 1 * np.equal.outer(a, a)

    for i in range(1, nTrees):
        a = terminals[:,i]
        proxMat += 1*np.equal.outer(a, a)

    if normalize:
        proxMat = proxMat / nTrees

    return 1 - proxMat  

# use iris data to make example reproducible and fast
iris = datasets.load_iris()
df = pd.DataFrame(iris['data'], columns = iris['feature_names'])
df['target'] = pd.Series(iris['target'], name = 'target_values')
df['target_name'] = df['target'].replace([0,1,2], ['iris-' + species for species in iris['target_names'].tolist()])

# simple one hot
df['iris_setosa'] = (df['target_name'] == 'iris-setosa').astype(int)
df['iris_versicolor'] = (df['target_name'] == 'iris-versicolor').astype(int)
df['iris_virginica'] = (df['target_name'] == 'iris-virginica').astype(int)

# the new regression model "target"
y = df['petal width (cm)']

X = df.drop([
    'target'
    ,'target_name'
    ,'petal width (cm)'
], axis = 1)

# fit random forest just for the purpose of getting proximity matrix
# open question does it matter which target is picked and/or whether regresion or classification?
# this is just to produce a toy dataset with mixed data
overfitted_model = RandomForestRegressor(n_estimators=250, min_samples_leaf=10)
overfitted_model.fit(X, y)

distance_matrix = distanceMatrix(overfitted_model, X, normalize=True)

cluster_model = AgglomerativeClustering(n_clusters=3, affinity='precomputed', linkage='average')
cluster_model.fit(distance_matrix)

df['label'] = cluster_model.labels_

PS:

  • Readers may find this interesting in this context.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

煮酒 2025-02-18 16:15:44

对于聚集聚类,添加其他数据点需要重新计算簇,因为这种类型的聚类的工作方式。聚集聚类迭代基于起始点构建簇,然后根据链接度量合并,因此添加新的数据点可以并且将修改最终簇。

...
cluster_model.fit_predict(distance_matrix)

df['label'] = cluster_model.labels_

查看fit_predict,这与无监督或托管估计器更相关。

输出:

     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  target     target_name  iris_setosa  iris_versicolor  iris_virginica  label
0                  5.1               3.5                1.4               0.2       0     iris-setosa            1                0               0      1
1                  4.9               3.0                1.4               0.2       0     iris-setosa            1                0               0      1
2                  4.7               3.2                1.3               0.2       0     iris-setosa            1                0               0      1
3                  4.6               3.1                1.5               0.2       0     iris-setosa            1                0               0      1
4                  5.0               3.6                1.4               0.2       0     iris-setosa            1                0               0      1
..                 ...               ...                ...               ...     ...             ...          ...              ...             ...    ...
145                6.7               3.0                5.2               2.3       2  iris-virginica            0                0               1      2
146                6.3               2.5                5.0               1.9       2  iris-virginica            0                0               1      2
147                6.5               3.0                5.2               2.0       2  iris-virginica            0                0               1      2
148                6.2               3.4                5.4               2.3       2  iris-virginica            0                0               1      2
149                5.9               3.0                5.1               1.8       2  iris-virginica            0                0               1      2

For Agglomerative Clustering, adding additional data points requires a recompute of the clusters because of how this type of clustering works. Agglomerative Clustering iteratively builds clusters based on started points, and then merges according to a linkage measure, so adding new data points can and will modify the final clusters.

...
cluster_model.fit_predict(distance_matrix)

df['label'] = cluster_model.labels_

Check out fit_predict, which is more relevant for unsupervised or transductive estimators.

output:

     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  target     target_name  iris_setosa  iris_versicolor  iris_virginica  label
0                  5.1               3.5                1.4               0.2       0     iris-setosa            1                0               0      1
1                  4.9               3.0                1.4               0.2       0     iris-setosa            1                0               0      1
2                  4.7               3.2                1.3               0.2       0     iris-setosa            1                0               0      1
3                  4.6               3.1                1.5               0.2       0     iris-setosa            1                0               0      1
4                  5.0               3.6                1.4               0.2       0     iris-setosa            1                0               0      1
..                 ...               ...                ...               ...     ...             ...          ...              ...             ...    ...
145                6.7               3.0                5.2               2.3       2  iris-virginica            0                0               1      2
146                6.3               2.5                5.0               1.9       2  iris-virginica            0                0               1      2
147                6.5               3.0                5.2               2.0       2  iris-virginica            0                0               1      2
148                6.2               3.4                5.4               2.3       2  iris-virginica            0                0               1      2
149                5.9               3.0                5.1               1.8       2  iris-virginica            0                0               1      2
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文