缓存阻塞比原始代码最差吗?
我正在研究如何改善矩阵操作(使用双重类型),并且正在尝试一些技术,例如缓存和循环展开。第二个确实取得了成功,但是我无法使用阻止来提高自己的性能。我不是因为我做错了什么,或者是由于阻塞而引起的,在这种情况下根本没有用。
没有操作的原始代码是:
for (int i=0; i<N; i++){
for (int j=0; j<N; j++){
d[i][j] = 0.0;
}
}
for (int i=0; i<N; i++){
for (int j=0; j<N; j++){
for (int k=0; k<K_MAX; k++){
d[i][j] += 2 * a[i][k] * (b[k][j] - c[k]);
}
}
}
k_max始终为8,n从250 500 750 1000 1500 2000 2000 2550 3000中获取值
,而我试图处理的封锁是:
for (int i= 0 ; i<N; i+=block_size){
for (int j=0; j<N; j+=block_size){
for (int ii=i; ii<min (i+block_size, N); ii++){
for (int jj=j; jj<min(j+block_size, N); jj++){
d[ii][jj] = 0.0;
for (int k = 0; k<K_MAX; k++){
d[ii][jj] += 2 * a[ii][k] * ( b[k][jj]- c[k]);
}
}
}
}
}
我可能为block_size选择一个不好的值,因为我做了不了解如何选择一个不错的一个,但是我尝试了N的所有分隔线来选择一个块大小,从1到N。而且,我尝试使用适合缓存线上的元素数量的倍数(8倍)就像8、64、128、256和512(我知道n并不总是该值的倍数,有必要处理块无法达到的元素,我尝试并做得很好,因为我已经获得了正确的输出),但表现没有改善。我还尝试使用所有n个块大小值使用相同的块大小值,但是您可以猜到,什么都没有实现。
我的处理器是Intel Core i7-10870h。
先感谢您
I am doing some research about how to improve matrix operation (using the double type) and I was trying some techniques such as cache-blocking and loop unrolling. The second one was really successful, but I cannot improve my performance using blocking. I don't if it is because I am doing something wrong or if it is due to blocking is not useful at all in this case.
The original code without the operation is:
for (int i=0; i<N; i++){
for (int j=0; j<N; j++){
d[i][j] = 0.0;
}
}
for (int i=0; i<N; i++){
for (int j=0; j<N; j++){
for (int k=0; k<K_MAX; k++){
d[i][j] += 2 * a[i][k] * (b[k][j] - c[k]);
}
}
}
Where K_MAX is always 8 and N takes values from 250 500 750 1000 1500 2000 2550 3000
And what I was trying to do with blocking was:
for (int i= 0 ; i<N; i+=block_size){
for (int j=0; j<N; j+=block_size){
for (int ii=i; ii<min (i+block_size, N); ii++){
for (int jj=j; jj<min(j+block_size, N); jj++){
d[ii][jj] = 0.0;
for (int k = 0; k<K_MAX; k++){
d[ii][jj] += 2 * a[ii][k] * ( b[k][jj]- c[k]);
}
}
}
}
}
I'm probably choosing a bad value for block_size because I did not understand how to choose a nice one, but I tried all the dividers of N to choose a block size, from 1 to N. Also, I tried using a multiple of the number of elements that fit on a cache line (8 doubles) like 8, 64, 128, 256, and 512 (I know N is not always a multiple of that value, it is necessary to handle elements that cannot be reached by the block, I tried and do it nicely because I have got right outputs), but the performance was not improved. I also tried using the same block size value for all the N ones, but as you can guess, nothing was achieved.
My processor is an Intel Core i7-10870H.
Thank you in advance
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论