下面的循环可以向量化吗?
我有一个 for 循环,它执行以下功能:
取一个 M x 8 矩阵,然后:
- 将其拆分为大小为 512 个元素的块(意味着矩阵的 X x 8 == 512,元素数量可以是 128,256,512,1024 ,2048)
- 将块重塑为 1 x 512(元素数)矩阵。
- 取矩阵的最后1/4,放在前面,
例如Data = [Data(1,385:512),Data(1,1:384)];
以下是我的代码:
for i = 1 : NumOfBlock
if i == 1
Header = tempHeader(1:RowNeeded,:);
Header = reshape(Header,1,BlockSize); %BS
Header = [Header(1,385:512),Header(1,1:384)]; %CP
Data = tempData(1:RowNeeded,:);
Data = reshape(Data,1,BlockSize); %BS
Data = [Data(1,385:512),Data(1,1:384)]; %CP
start = RowNeeded + 1;
end1 = RowNeeded * 2;
else
temp = tempData(start:end1,:);
temp = reshape(temp,1,BlockSize); %BS
temp = [temp(1,385:512),temp(1,1:384)]; %CP
Data = [Data, temp];
end
if i <= 127 & i > 1
temp = tempHeader(start:end1,:);
temp = reshape(temp,1,BlockSize); %BS
temp = [temp(1,385:512),temp(1,1:384)]; %CP
Header = [Header, temp];
end
start = end1 + 1;
end1=end1 + RowNeeded;
end
使用 500 万个元素运行此循环将需要 1 个多小时。我需要它尽可能快(以秒为单位)。这个循环可以向量化吗?
I have a for-loop which performs the following function:
Take a M by 8 matrix and:
- Split it into blocks of size 512 elements (meaning X by 8 of the matrix == 512, and the number of elements can be 128,256,512,1024,2048)
- Reshape the block into 1 by 512 (Number of elements) matrix.
- Take the last 1/4 of the matrix and put it in front,
e.g.Data = [Data(1,385:512),Data(1,1:384)];
The following is my code:
for i = 1 : NumOfBlock
if i == 1
Header = tempHeader(1:RowNeeded,:);
Header = reshape(Header,1,BlockSize); %BS
Header = [Header(1,385:512),Header(1,1:384)]; %CP
Data = tempData(1:RowNeeded,:);
Data = reshape(Data,1,BlockSize); %BS
Data = [Data(1,385:512),Data(1,1:384)]; %CP
start = RowNeeded + 1;
end1 = RowNeeded * 2;
else
temp = tempData(start:end1,:);
temp = reshape(temp,1,BlockSize); %BS
temp = [temp(1,385:512),temp(1,1:384)]; %CP
Data = [Data, temp];
end
if i <= 127 & i > 1
temp = tempHeader(start:end1,:);
temp = reshape(temp,1,BlockSize); %BS
temp = [temp(1,385:512),temp(1,1:384)]; %CP
Header = [Header, temp];
end
start = end1 + 1;
end1=end1 + RowNeeded;
end
Running this loop with 5 million element will take more than 1 hour. I need it to be as fast as possible (in sec). Is this loop able to be vectorized?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
根据您的功能描述,这就是我的想法:
此图可能有助于理解上述内容:
Based on your function description, here's what I came up with:
this diagram might help in understanding the above:
矢量化可能有帮助,也可能没有帮助。了解瓶颈在哪里会有所帮助。使用此处概述的探查器:
http: //blogs.mathworks.com/videos/2006/10/19/profiler-to-find-code-bottlenecks/
Vectorizing may or may not help. What will help is knowing where the bottleneck is. Use the profiler as outlined here:
http://blogs.mathworks.com/videos/2006/10/19/profiler-to-find-code-bottlenecks/
如果你能告诉你你正在尝试做什么,那就太好了(我的猜测是动态系统中的一些模拟,但很难说)。
是的,当然可以矢量化:每个块实际上是四个子块;使用您的(极其非标准)索引:
1...128、129...256、257...384、385...512
向量化的每个内核/线程/无论您如何称呼它都应该执行以下操作:
i = threadIdx 介于 0 和 127 之间
温度 = 数据[1 + i]
数据[1 + i] = 数据[385+i]
数据[385 + i] = 数据[257+i]
数据[257 + i] = 数据[129+i]
data[129 + i] = temp
您当然还应该在块上并行化,而不仅仅是矢量化。
It would be nice if you'd tell what you are trying to do (my guess is some simulation in dynamical systems, but it's hard to tell).
yes, of course it can be vectorized: each of your blocks is actually four sub blocks; using your (extremely non standard) indices:
1...128, 129...256, 257...384, 385...512
Every kernel/thread/what-ever-you-call-it of the vectorization should do the following:
i = threadIdx is between 0 and 127
temp = data[1 + i]
data[1 + i] = data[385+i]
data[385 + i] = data[257+i]
data[257 + i] = data[129+i]
data[129 + i] = temp
You should of course also parallelize on blocks, not only vectorize.
我要再次感谢 Amro 为我提供了如何解决我的问题的想法。很抱歉没有在问题中表达清楚。
这是我的问题的解决方案:
Once again I would like to thanks Amro for giving me an idea on how to solve my question. Sorry for not making myself clear in the question.
Here is my solution to my problem: