加速巨大稀疏矩阵的条件填充
我想知道是否有一种方法可以加速(也许通过矢量化?)巨大稀疏矩阵(例如〜1e10 x 1e10)的条件填充。这是示例代码,其中我有一个嵌套循环,并且仅在满足特定条件时才填充稀疏矩阵:
% We are given the following cell arrays of the same size:
% all_arrays_1
% all_arrays_2
% all_mapping_arrays
N = 1e10;
% The number of nnz (non-zeros) is unknown until the loop finishes
huge_sparse_matrix = sparse([],[],[],N,N);
n_iterations = numel(all_arrays_1);
for iteration=1:n_iterations
array_1 = all_arrays_1{iteration};
array_2 = all_arrays_2{iteration};
mapping_array = all_mapping_arrays{iteration};
n_elements_in_array_1 = numel(array_1);
n_elements_in_array_2 = numel(array_2);
for element_1 = 1:n_elements_in_array_1
element_2 = mapping_array(element_1);
% Sanity check:
if element_2 <= n_elements_in_array_2
item_1 = array_1(element_1);
item_2 = array_2(element_2);
huge_sparse_matrix(item_1,item_2) = 1;
end
end
end
我正在努力对嵌套循环进行矢量化。据我了解,当要填充的条目数量很大(~100M)时,逐个元素填充稀疏矩阵的速度非常慢。我需要使用稀疏矩阵,因为它的尺寸在 10,000M x 10,000M 范围内。然而,这种在 MATLAB 中填充稀疏矩阵的方法非常慢。
编辑:
我更新了变量的名称,以更好地反映它们的性质。没有函数调用。
附录:
此代码为一个巨大的图构建矩阵邻接。变量all_mapping_arrays
以本地表示的方式保存图形节点之间的映射数组(〜邻接关系),这就是为什么我需要array_1
和< code>array_2 将邻接映射到全局表示。
I was wondering if there is a way of speeding up (maybe via vectorization?) the conditional filling of huge sparse matrices (e.g. ~ 1e10 x 1e10). Here's the sample code where I have a nested loop, and I fill in a sparse matrix only if a certain condition is met:
% We are given the following cell arrays of the same size:
% all_arrays_1
% all_arrays_2
% all_mapping_arrays
N = 1e10;
% The number of nnz (non-zeros) is unknown until the loop finishes
huge_sparse_matrix = sparse([],[],[],N,N);
n_iterations = numel(all_arrays_1);
for iteration=1:n_iterations
array_1 = all_arrays_1{iteration};
array_2 = all_arrays_2{iteration};
mapping_array = all_mapping_arrays{iteration};
n_elements_in_array_1 = numel(array_1);
n_elements_in_array_2 = numel(array_2);
for element_1 = 1:n_elements_in_array_1
element_2 = mapping_array(element_1);
% Sanity check:
if element_2 <= n_elements_in_array_2
item_1 = array_1(element_1);
item_2 = array_2(element_2);
huge_sparse_matrix(item_1,item_2) = 1;
end
end
end
I am struggling to vectorize the nested loop. As far as I understand the filling a sparse matrix element by element is very slow when the number of entries to fill is large (~100M). I need to work with a sparse matrix since it has dimensions in the 10,000M x 10,000M range. However, this way of filling a sparse matrix in MATLAB is very slow.
Edits:
I have updated the names of the variables to reflect their nature better. There are no function calls.
Addendum:
This code builds the matrix adjacency for a huge graph. The variable all_mapping_arrays
holds mapping arrays (~ adjacency relationship) between nodes of the graph in a local representation, which is why I need array_1
and array_2
to map the adjacency to a global representation.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我认为这将是稀疏矩阵的增量更新,而不是基于循环的条件,这将减慢速度。
当您通过 A(i,j) = 1 之类的方式向稀疏矩阵添加新条目时,通常需要重新打包整个矩阵数据结构。这是一项昂贵的操作。如果您感兴趣,MATLAB 在内部使用
CCS
数据结构(压缩列存储),这在数据结构部分 此处。注意声明:,最好(更快)地将矩阵中的非零条目累积为一组三元组,然后对
sparse
进行一次调用。例如(警告 - 大脑编译的代码!):我一次分配
N
条目的块。假设您对矩阵的结构了解很多,您可能可以在此处使用更好的值。希望这有帮助。
I think it will be the incremental update of the sparse matrix, rather than the loop based conditional that will be slowing things down.
When you add a new entry to a sparse matrix via something like
A(i,j) = 1
it typically requires that the whole matrix data structure is re-packed. The is an expensive operation. If you're interested, MATLAB uses aCCS
data structure (compressed column storage) internally, which is described under the Data Structure section here. Note the statement:Generally, it's far better (faster) to accumulate the non-zero entries in the matrix as a set of triplets and then make a single call to
sparse
. For example (warning - brain compiled code!!):I'm allocating in chunks of
N
entries at a time. Assuming that you know alot about the structure of your matrix you might be able to use a better value here.Hope this helps.