Python聚类软件包,可以基于距离矩阵的聚类,但也可以预测新行(没有新的群集/距离矩阵)
我知道与距离矩阵一起使用的各种(Sklearn)聚类算法 - 例如,通过来自随机森林的接近矩阵产生的(下面有些笨拙的可重复的代码)。是否存在任何聚类算法(使用距离矩阵),其中拟合群集模型(例如Cluster_Model下面)可以产生新数据行的群集成员身份?
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.cluster import AgglomerativeClustering
import pandas as pd
import numpy as np
import os
from sklearn.ensemble import RandomForestRegressor
from sklearn import datasets
def distanceMatrix(model, X, normalize=True):
terminals = model.apply(X)
nTrees = terminals.shape[1]
a = terminals[:,0]
proxMat = 1 * np.equal.outer(a, a)
for i in range(1, nTrees):
a = terminals[:,i]
proxMat += 1*np.equal.outer(a, a)
if normalize:
proxMat = proxMat / nTrees
return 1 - proxMat
# use iris data to make example reproducible and fast
iris = datasets.load_iris()
df = pd.DataFrame(iris['data'], columns = iris['feature_names'])
df['target'] = pd.Series(iris['target'], name = 'target_values')
df['target_name'] = df['target'].replace([0,1,2], ['iris-' + species for species in iris['target_names'].tolist()])
# simple one hot
df['iris_setosa'] = (df['target_name'] == 'iris-setosa').astype(int)
df['iris_versicolor'] = (df['target_name'] == 'iris-versicolor').astype(int)
df['iris_virginica'] = (df['target_name'] == 'iris-virginica').astype(int)
# the new regression model "target"
y = df['petal width (cm)']
X = df.drop([
'target'
,'target_name'
,'petal width (cm)'
], axis = 1)
# fit random forest just for the purpose of getting proximity matrix
# open question does it matter which target is picked and/or whether regresion or classification?
# this is just to produce a toy dataset with mixed data
overfitted_model = RandomForestRegressor(n_estimators=250, min_samples_leaf=10)
overfitted_model.fit(X, y)
distance_matrix = distanceMatrix(overfitted_model, X, normalize=True)
cluster_model = AgglomerativeClustering(n_clusters=3, affinity='precomputed', linkage='average')
cluster_model.fit(distance_matrix)
df['label'] = cluster_model.labels_
PS:
- 读者可能会发现在这种情况下很有趣。
I am aware of various (sklearn) clustering algorithm that work with distance matrices - e.g. produced via a proximity matrix coming from a random forest (some clumsy reproducible code below). Is there any clustering algorithm (working with distance matrix), where the fitted cluster model (e.g. cluster_model below) can produce the cluster membership of a new data row?
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.cluster import AgglomerativeClustering
import pandas as pd
import numpy as np
import os
from sklearn.ensemble import RandomForestRegressor
from sklearn import datasets
def distanceMatrix(model, X, normalize=True):
terminals = model.apply(X)
nTrees = terminals.shape[1]
a = terminals[:,0]
proxMat = 1 * np.equal.outer(a, a)
for i in range(1, nTrees):
a = terminals[:,i]
proxMat += 1*np.equal.outer(a, a)
if normalize:
proxMat = proxMat / nTrees
return 1 - proxMat
# use iris data to make example reproducible and fast
iris = datasets.load_iris()
df = pd.DataFrame(iris['data'], columns = iris['feature_names'])
df['target'] = pd.Series(iris['target'], name = 'target_values')
df['target_name'] = df['target'].replace([0,1,2], ['iris-' + species for species in iris['target_names'].tolist()])
# simple one hot
df['iris_setosa'] = (df['target_name'] == 'iris-setosa').astype(int)
df['iris_versicolor'] = (df['target_name'] == 'iris-versicolor').astype(int)
df['iris_virginica'] = (df['target_name'] == 'iris-virginica').astype(int)
# the new regression model "target"
y = df['petal width (cm)']
X = df.drop([
'target'
,'target_name'
,'petal width (cm)'
], axis = 1)
# fit random forest just for the purpose of getting proximity matrix
# open question does it matter which target is picked and/or whether regresion or classification?
# this is just to produce a toy dataset with mixed data
overfitted_model = RandomForestRegressor(n_estimators=250, min_samples_leaf=10)
overfitted_model.fit(X, y)
distance_matrix = distanceMatrix(overfitted_model, X, normalize=True)
cluster_model = AgglomerativeClustering(n_clusters=3, affinity='precomputed', linkage='average')
cluster_model.fit(distance_matrix)
df['label'] = cluster_model.labels_
PS:
- Readers may find this interesting in this context.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
对于聚集聚类,添加其他数据点需要重新计算簇,因为这种类型的聚类的工作方式。聚集聚类迭代基于起始点构建簇,然后根据链接度量合并,因此添加新的数据点可以并且将修改最终簇。
查看
fit_predict
,这与无监督或托管估计器更相关。输出:
For Agglomerative Clustering, adding additional data points requires a recompute of the clusters because of how this type of clustering works. Agglomerative Clustering iteratively builds clusters based on started points, and then merges according to a linkage measure, so adding new data points can and will modify the final clusters.
Check out
fit_predict
, which is more relevant for unsupervised or transductive estimators.output: