初始化两个401x401的大2D数组,并在C++中进行快速矩阵乘法。
我想初始化两个大小401x401的2D矩阵,并以快速的方式乘以它们。 但是很可能是由于堆栈溢出
,在此问题中没有初始化两个双2D矩阵:在c ++初始化新数组之后,没有输出。
经过以下建议,我使用了向量的向量
存储我的2D矩阵。但是我想做快速的矩阵乘法,因为时间对我来说是一个重要因素。有建议不要为此目的使用向量
:我们如何加速矩阵乘法,其中使用c ++中的向量(2D vector)初始化矩阵。
也许我可以再次将其转换为array
,但是我觉得会有stackoverflow! 我该如何完成初始化两个大矩阵和矩阵乘法的任务也很快?,如果我坚持使用向量的向量,就无法使用MKL等内置库的能力。
I want to initialise two 2D matrices of size 401X401 and multiply them in a speedy way.
But most probably due to stack overflow
, two double 2D matrices were not initialised as stated in this question: No output after new arrays being initialized in C++.
After following suggestions, I used vectors of vectors
to store my 2D matrix. But I wanted to do fast matrix multiplication as time is a significant factor for me. There were suggestions not to use vectors of vectors
for this purpose: How can we speedup matrix multiplication where matrices are initialized using vectors of vectors (2D vector) in C++.
Maybe I can convert again to array
, but I feel again there would be StackOverflow!!
How can I do both the task of initialising two large matrices and matrix multiplication is also fast? If I stick to vectors of vectors, there is no way to use abilities of inbuilt libraries such as MKL etc.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这可以大大加速。
首先,为了消除堆栈用法问题,声明矩阵之类的矩阵:
由于所有矩阵存储器都是连续的,因此可以实现缓存相干性。
接下来,将其中一个矩阵转置。这将使乘法在顺序内存访问中进行。
最后,这可以通过多线程进行进一步的速度改进。例如,创建每个过程100,100,100,101行的4个线程。无需线程同步。由于所有写作都是特定于每个线程的。只需在最后加入他们,就完成了。
我很好奇,并在时间4不同的情况下被黑客入侵了。
结果转换为
vector< vector>
vvector< array>
是:::This can be sped up considerably.
First, to eliminate the stack usage issue declare the matrix like so:
This achieves cache coherence since all the matrix memory is contiguous.
Next, transpose one of the matrixes. This will make the multiplication proceed in sequential memory accesses.
And finally, this lends itself to further speed improvements by multithreading. For instance create 4 threads that each process 100,100,100,101 rows. No thread sync required. since all writes are specific to each thread. Just join them all at the end and you're done.
I was curious and hacked some code to time 4 different conditions.
The results for a
vector<vector>
vvector<array>
are:您可以通过
new
在堆上获取内存。重复使用不会在这里真正有所帮助,因为new
不是快速操作。您可以通过获得一个大的内存并假装为2D来绕过这个问题。您可以做到这一点:您可以使用我认为的向量管理的内存或具有智能指针的东西来完成类似的操作。当然,这将不如分配堆栈那样快,但是可以通过。
使用这种方法,您必须小心的内存泄漏,并记住,返回
raw_data
,矩阵
也将被无效。其中一些问题是通过使用一些智能容器来解决raw_data
的解决方案,而您必须牢记的其他容器。You can get the memory on the heap via
new
. Repeated use will not really help here asnew
is not a fast operation. You can skirt this problem, by getting one large block of memory once and pretending it's 2D. You can do this like:You could do similar things with memory managed by vector I think, or something with smart pointers as well. This will of course not be as fast as allocating on the stack, but it will be passable.
With this approach you'll have to be careful of memory leaks as well as remembering that as soon as
raw_data
is returnedmatrix
gets invalidated as well. Some of these problems are addressed via using some smart containers forraw_data
and others you have to keep in mind.