如何使用聚类算法（例如DBSCAN），如果它们的距离在一个维度上超过阈值，则通过防止两个点在一个群集中使用？

发布于 2025-02-13 04:01:27 字数 1152 浏览 1 评论 0原文

我尝试将具有列时间，xCoorcoriation和yCoorcorion的大型数据集群集。我希望我的最终簇有两个属性：

当两个点彼此之间太远时（Fe＆gt;之间的10个时间段）时，我不希望它们在同一集群中。
除了时间限制（1.）之外，我想使用地理值应用正常的聚类：因此，如果我使用Fe dbscan，我将定义适当的epsilon值，该值用于在地理位置上进行聚类。

到目前为止，我的方法是首先计算两个距离向量，一个仅使用时间值，一个仅使用地理值：

import numpy as np
from scipy.spatial.distance import pdist, squareform
from sklearn.cluster import DBSCAN

arr = np.array([[0,1,1], [0,3,3], [0,1,2],...]) #array has dimensions [time,x,y]
n,m = arr.shape

time_dist = pdist(arr[:, 0].reshape(n, 1), metric='euclidean')
euc_dist = pdist(arr[:, 1:], metric='euclidean')

然后，我滤除了所有具有较高时间距离的数据点，该数据点高于指定的阈值（Fe 10）并设置他们到2*epsilon的地理距离，以避免他们在一个集群中避免：

dist = np.where(time_dist <= 10, euc_dist, 2 * epsilon)  #filter the pairs with time difference > 10

最后我构建了距离向量的正方形矩阵，并将其作为对聚类算法的输入Sklearn.Cluster，例如DBSCAN：

X = squareform(dist)
clustering = DBSCAN(eps=3).fit(X)

最后一步是一个问题，因为平方距离矩阵的构建使用了大量的存储空间来用于大量数据点，从而大大降低了性能。由于Sklearln群集算法（例如DBSCAN）仅有两种可能的输入格式是正方形形式的预定距离矩阵，如上所述或数据点的简单向量阵列（我无法使用，因为我想实现时间限制）我不知道如何进行集群更有效的群集。

非常感谢您提前提供的任何帮助 /建议！

原文

I try to cluster a large dataset having the columns time, xCoordinate and yCoordinate.
I want my final clusters to have two properties:

When two points are too far away from each other time wise (f.e. > 10 time periods between them) I don't want them to be in the same cluster.
Apart from the time restriction (1.), I want to apply normal clustering using the geographical values: so if I use f.e. DBSCAN I would define an appropriate epsilon value which is used for clustering on the geographical values.

My approach so far was to first compute two distance vectors, one using only the time value and one using only the geographical values:

import numpy as np
from scipy.spatial.distance import pdist, squareform
from sklearn.cluster import DBSCAN

arr = np.array([[0,1,1], [0,3,3], [0,1,2],...]) #array has dimensions [time,x,y]
n,m = arr.shape

time_dist = pdist(arr[:, 0].reshape(n, 1), metric='euclidean')
euc_dist = pdist(arr[:, 1:], metric='euclidean')

Then I filtered out all data points which have a higher time distance than the specified threshold value (f.e. 10) and set their geographical distance to 2*epsilon, to avoid them being in one cluster later on:

dist = np.where(time_dist <= 10, euc_dist, 2 * epsilon)  #filter the pairs with time difference > 10

Finally I built a square matrix of the distance vector and have it as an input to a clustering algorithm from sklearn.cluster, for example DBSCAN:

X = squareform(dist)
clustering = DBSCAN(eps=3).fit(X)

This last step is a problem, since the building of a square distance matrix uses lots of memory space for a large number of data points and thus decreases performance a lot.
Since the only two possible input formats for the sklearn clustering algorithms (such as DBSCAN) are a precomputed distance matrix in square form as described above or a simple vector array of the data points (which I can't use since I want to implement the time restriction) I don't know how one can do the clustering more memory efficient.

Thanks a lot for any help / recommendations in advance!

分享到QQ

分享到微博