在 MATLAB 中使用 clusterdata 时出现内存不足错误

发布于 2024-09-04 01:29:13 字数 514 浏览 6 评论 0原文

我正在尝试对矩阵进行聚类(大小:20057x2)。:

T = clusterdata(X,cutoff);

但我收到此错误:

??? Error using ==> pdistmex
Out of memory. Type HELP MEMORY for your options.

Error in ==> pdist at 211
    Y = pdistmex(X',dist,additionalArg);

Error in ==> linkage at 139
       Z = linkagemex(Y,method,pdistArg);

Error in ==> clusterdata at 88
Z = linkage(X,linkageargs{1},pdistargs);

Error in ==> kmeansTest at 2
T = clusterdata(X,1);

有人可以帮助我吗?我有 4GB 内存,但认为问题出在其他地方。

I am trying to cluster a Matrix (size: 20057x2).:

T = clusterdata(X,cutoff);

but I get this error:

??? Error using ==> pdistmex
Out of memory. Type HELP MEMORY for your options.

Error in ==> pdist at 211
    Y = pdistmex(X',dist,additionalArg);

Error in ==> linkage at 139
       Z = linkagemex(Y,method,pdistArg);

Error in ==> clusterdata at 88
Z = linkage(X,linkageargs{1},pdistargs);

Error in ==> kmeansTest at 2
T = clusterdata(X,1);

can someone help me. I have 4GB of ram, but think that the problem is from somewhere else..

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

少钕鈤記 2024-09-11 01:29:13

正如其他人所提到的,层次聚类需要计算成对距离矩阵,在您的情况下,该矩阵太大而无法容纳在内存中。

尝试改用 K-Means 算法:

numClusters = 4;
T = kmeans(X, numClusters);

或者,您可以选择数据的随机子集并用作聚类算法的输入。接下来,将聚类中心计算为每个聚类组的平均值/中位数。最后,对于子集中未选择的每个实例,您只需计算其到每个质心的距离并将其分配给最近的一个。

下面是一个示例代码来说明上述想法:

%# random data
X = rand(25000, 2);

%# pick a subset
SUBSET_SIZE = 1000;            %# subset size
ind = randperm(size(X,1));
data = X(ind(1:SUBSET_SIZE), :);

%# cluster the subset data
D = pdist(data, 'euclid');
T = linkage(D, 'ward');
CUTOFF = 0.6*max(T(:,3));      %# CUTOFF = 5;
C = cluster(T, 'criterion','distance', 'cutoff',CUTOFF);
K = length( unique(C) );       %# number of clusters found

%# visualize the hierarchy of clusters
figure(1)
h = dendrogram(T, 0, 'colorthreshold',CUTOFF);
set(h, 'LineWidth',2)
set(gca, 'XTickLabel',[], 'XTick',[])

%# plot the subset data colored by clusters
figure(2)
subplot(121), gscatter(data(:,1), data(:,2), C), axis tight

%# compute cluster centers
centers = zeros(K, size(data,2));
for i=1:size(data,2)
    centers(:,i) = accumarray(C, data(:,i), [], @mean);
end

%# calculate distance of each instance to all cluster centers
D = zeros(size(X,1), K);
for k=1:K
    D(:,k) = sum( bsxfun(@minus, X, centers(k,:)).^2, 2);
end
%# assign each instance to the closest cluster
[~,clustIDX] = min(D, [], 2);

%#clustIDX( ind(1:SUBSET_SIZE) ) = C;

%# plot the entire data colored by clusters
subplot(122), gscatter(X(:,1), X(:,2), clustIDX), axis tight

树状图
集群

As mentioned by others, hierarchical clustering needs to calculate the pairwise distance matrix which is too big to fit in memory in your case.

Try using the K-Means algorithm instead:

numClusters = 4;
T = kmeans(X, numClusters);

Alternatively you can select a random subset of your data and use as input to the clustering algorithm. Next you compute the cluster centers as mean/median of each cluster group. Finally for each instance that was not selected in the subset, you simply compute its distance to each of the centroids and assign it to the closest one.

Here's a sample code to illustrate the idea above:

%# random data
X = rand(25000, 2);

%# pick a subset
SUBSET_SIZE = 1000;            %# subset size
ind = randperm(size(X,1));
data = X(ind(1:SUBSET_SIZE), :);

%# cluster the subset data
D = pdist(data, 'euclid');
T = linkage(D, 'ward');
CUTOFF = 0.6*max(T(:,3));      %# CUTOFF = 5;
C = cluster(T, 'criterion','distance', 'cutoff',CUTOFF);
K = length( unique(C) );       %# number of clusters found

%# visualize the hierarchy of clusters
figure(1)
h = dendrogram(T, 0, 'colorthreshold',CUTOFF);
set(h, 'LineWidth',2)
set(gca, 'XTickLabel',[], 'XTick',[])

%# plot the subset data colored by clusters
figure(2)
subplot(121), gscatter(data(:,1), data(:,2), C), axis tight

%# compute cluster centers
centers = zeros(K, size(data,2));
for i=1:size(data,2)
    centers(:,i) = accumarray(C, data(:,i), [], @mean);
end

%# calculate distance of each instance to all cluster centers
D = zeros(size(X,1), K);
for k=1:K
    D(:,k) = sum( bsxfun(@minus, X, centers(k,:)).^2, 2);
end
%# assign each instance to the closest cluster
[~,clustIDX] = min(D, [], 2);

%#clustIDX( ind(1:SUBSET_SIZE) ) = C;

%# plot the entire data colored by clusters
subplot(122), gscatter(X(:,1), X(:,2), clustIDX), axis tight

dendrogram
clusters

杀手六號 2024-09-11 01:29:13

X 太大,无法在 32 位机器上执行。 pdist 正在尝试创建一个 201,131,596 双精度行向量(clusterdata 使用 pdist),这将占用大约 1609MB (double< /code> 是 8 个字节)...如果您在 Windows 下使用 /3GB 开关运行它,您的最大矩阵大小将被限制为 1536MB(请参阅 此处)。

您需要以某种方式划分数据,而不是一次性直接对所有数据进行聚类。

X is too big to do on a 32 bit machine. pdist is trying to make a 201,131,596 row vector (clusterdata uses pdist) of doubles, which would use up about 1609MB (double is 8 bytes) ... if you run it under windows with the /3GB switch you're limited to a maximum matrix size of 1536MB (see here).

You're going to need to divide up the data someway instead of directly clustering all of it in one go.

神回复 2024-09-11 01:29:13

PDIST 计算所有可能的行对之间的距离。如果您的数据包含 N=20057 行,则对的数量将为 N*(N-1)/2,在您的情况下为 201131596。对于您的机器来说可能太多了。

PDIST calculates distances between all possible pairs of rows. If your data contain N=20057 rows, then number of pairs will be N*(N-1)/2, which is 201131596 in your case. Might be too much for your machine.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文