是否可以使用 scikit-learn K-Means Clustering 指定您自己的距离函数?
是否可以使用 scikit-learn K-Means Clustering 指定您自己的距离函数?
Is it possible to specify your own distance function using scikit-learn K-Means Clustering?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(11)
只需使用 nltk 代替您可以执行此操作的地方,例如
Just use nltk instead where you can do this, e.g.
是的,您可以使用差异度量函数;然而,根据定义,k 均值聚类算法依赖于与每个聚类均值的欧几里得距离。
您可以使用不同的度量,因此即使您仍在计算平均值,您也可以使用马哈诺比斯距离之类的东西。
Yes you can use a difference metric function; however, by definition, the k-means clustering algorithm relies on the eucldiean distance from the mean of each cluster.
You could use a different metric, so even though you are still calculating the mean you could use something like the mahalnobis distance.
有 pyclustering 它是 python/C++ (所以它很快!),并允许您指定自定义指标函数
实际上,我还没有测试过这段代码,而是从 一张票 和 示例代码。
There is pyclustering which is python/C++ (so its fast!) and lets you specify a custom metric function
Actually, i haven't tested this code but cobbled it together from a ticket and example code.
Spectral Python 的 k-means 允许使用 L1(曼哈顿) ) 距离。
k-means of Spectral Python allows the use of L1 (Manhattan) distance.
Sklearn Kmeans 使用欧几里得距离。它没有度量参数。这就是说,如果您要对时间序列进行聚类,则可以使用
tslearn
python 包,此时您可以指定一个指标 (dtw
, <代码>softdtw,欧几里得
)。Sklearn Kmeans uses the Euclidean distance. It has no metric parameter. This said, if you're clustering time series, you can use the
tslearn
python package, when you can specify a metric (dtw
,softdtw
,euclidean
).sklearn 库中的亲和力传播算法允许您传递相似度矩阵而不是样本。因此,您可以使用度量来计算相似性矩阵(而不是相异性矩阵),并通过将“亲和性”项设置为“预计算”将其传递给函数。https://scikit-learn.org/stable/modules/生成/sklearn.cluster.AffinityPropagation.html#sklearn.cluster.AffinityPropagation.fit
就K-Mean而言,我认为也是可以的,但是我没有尝试过。
然而,正如其他答案所述,使用不同的指标找到平均值将成为问题。相反,您可以使用 PAM (K-Medoids) 算法,因为它计算总偏差 (TD) 的变化,因此它不依赖于距离度量。 https://python-kmedoids.readthedocs.io/en/latest/#fasterpam< /a>
The Affinity propagation algorithm from the sklearn library allows you to pass the similarity matrix instead of the samples. So, you can use your metric to compute the similarity matrix (not the dissimilarity matrix) and pass it to the function by setting the "affinity" term to "precomputed".https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AffinityPropagation.html#sklearn.cluster.AffinityPropagation.fit
In terms of the K-Mean, I think it is also possible but I have not tried it.
However, as the other answers stated, finding the mean using a different metric will be the issue. Instead, you can use PAM (K-Medoids) algorthim as it calculates the change in Total Deviation (TD), thus it does not rely on the distance metric. https://python-kmedoids.readthedocs.io/en/latest/#fasterpam
是的,在当前稳定版本的 sklearn (scikit-learn 1.1.3) 中,您可以轻松使用自己的距离度量。您所要做的就是创建一个继承自 sklearn.cluster.KMeans 的类并覆盖其 _transform 方法。
下面的例子是来自 Yolov2 论文的 IOU 距离。
Yes, in the current stable version of sklearn (scikit-learn 1.1.3), you can easily use your own distance metric. All you have to do is create a class that inherits from
sklearn.cluster.KMeans
and overwrites its_transform
method.The below example is for the IOU distance from the Yolov2 paper.
从版本
scikit-learn==1.2.2
开始,可以将sklearn.cluster._kmeans
中的_euclidean_distances
替换为以下内容:然后创建基础像平常一样的估计器
As of version
scikit-learn==1.2.2
, one could replace_euclidean_distances
insklearn.cluster._kmeans
with the following:Then create base estimator as usual
这是一个小 kmeans,它使用 20 多个距离中的任意一个
scipy.spatial .distance,或用户函数。
欢迎评论(目前只有一名用户,还不够);
特别是,你的 N, dim, k, metric 是多少?
2012 年 3 月 26 日添加的一些注释:
1) 对于余弦距离,首先将所有数据向量标准化为 |X| = 1;然后就
快了。对于位向量,将范数与向量分开
而不是扩展到浮动
(尽管有些程序可能会为您扩展)。
对于稀疏向量,例如 N, X 的 1 %。 Y 应该花费时间 O( 2 % N ),
空间 O(N);但我不知道哪些程序可以做到这一点。
2)
Scikit-learn 聚类
很好地概述了 k 均值、小批量 k 均值 ...
使用适用于 scipy.sparse 矩阵的代码。
3) 始终在 k 均值之后检查簇大小。
如果您期望大小大致相等的簇,但它们却出来了
[44 37 9 5 5] %
...(抓头的声音)。Here's a small kmeans that uses any of the 20-odd distances in
scipy.spatial.distance, or a user function.
Comments would be welcome (this has had only one user so far, not enough);
in particular, what are your N, dim, k, metric ?
Some notes added 26mar 2012:
1) for cosine distance, first normalize all the data vectors to |X| = 1; then
is fast. For bit vectors, keep the norms separately from the vectors
instead of expanding out to floats
(although some programs may expand for you).
For sparse vectors, say 1 % of N, X . Y should take time O( 2 % N ),
space O(N); but I don't know which programs do that.
2)
Scikit-learn clustering
gives an excellent overview of k-means, mini-batch-k-means ...
with code that works on scipy.sparse matrices.
3) Always check cluster sizes after k-means.
If you're expecting roughly equal-sized clusters, but they come out
[44 37 9 5 5] %
... (sound of head-scratching).不幸的是没有:scikit-learn 当前的 k-means 实现仅使用欧几里德距离。
将 k-means 扩展到其他距离并不是一件容易的事,而且 Denis 的上述答案也不是为其他指标实现 k-means 的正确方法。
Unfortunately no: scikit-learn current implementation of k-means only uses Euclidean distances.
It is not trivial to extend k-means to other distances and denis' answer above is not the correct way to implement k-means for other metrics.