缓存阻塞比原始代码最差吗?

发布于 2025-01-26 13:22:28 字数 1320 浏览 1 评论 0原文

我正在研究如何改善矩阵操作(使用双重类型),并且正在尝试一些技术,例如缓存和循环展开。第二个确实取得了成功,但是我无法使用阻止来提高自己的性能。我不是因为我做错了什么,或者是由于阻塞而引起的,在这种情况下根本没有用。

没有操作的原始代码是:

    for (int i=0; i<N; i++){
        for (int j=0; j<N; j++){
            d[i][j] = 0.0;
        }
    }

    for (int i=0; i<N; i++){
        for (int j=0; j<N; j++){
            for (int k=0; k<K_MAX; k++){
                d[i][j] += 2 * a[i][k] * (b[k][j] - c[k]);
            }   
        }
    }

k_max始终为8,n从250 500 750 1000 1500 2000 2000 2550 3000中获取值

,而我试图处理的封锁是:

    for (int i= 0 ; i<N; i+=block_size){
        for (int j=0; j<N; j+=block_size){      
            for (int ii=i; ii<min (i+block_size, N); ii++){
                for (int jj=j; jj<min(j+block_size, N); jj++){      
                    d[ii][jj] = 0.0;
                    for (int k = 0; k<K_MAX; k++){
                        d[ii][jj] += 2 * a[ii][k] * ( b[k][jj]- c[k]);
                    }
                }
            }
        }
    }

我可能为block_size选择一个不好的值,因为我做了不了解如何选择一个不错的一个,但是我尝试了N的所有分隔线来选择一个块大小,从1到N。而且,我尝试使用适合缓存线上的元素数量的倍数(8倍)就像8、64、128、256和512(我知道n并不总是该值的倍数,有必要处理块无法达到的元素,我尝试并做得很好,因为我已经获得了正确的输出),但表现没有改善。我还尝试使用所有n个块大小值使用相同的块大小值,但是您可以猜到,什么都没有实现。

我的处理器是Intel Core i7-10870h。

先感谢您

I am doing some research about how to improve matrix operation (using the double type) and I was trying some techniques such as cache-blocking and loop unrolling. The second one was really successful, but I cannot improve my performance using blocking. I don't if it is because I am doing something wrong or if it is due to blocking is not useful at all in this case.

The original code without the operation is:

    for (int i=0; i<N; i++){
        for (int j=0; j<N; j++){
            d[i][j] = 0.0;
        }
    }

    for (int i=0; i<N; i++){
        for (int j=0; j<N; j++){
            for (int k=0; k<K_MAX; k++){
                d[i][j] += 2 * a[i][k] * (b[k][j] - c[k]);
            }   
        }
    }

Where K_MAX is always 8 and N takes values from 250 500 750 1000 1500 2000 2550 3000

And what I was trying to do with blocking was:

    for (int i= 0 ; i<N; i+=block_size){
        for (int j=0; j<N; j+=block_size){      
            for (int ii=i; ii<min (i+block_size, N); ii++){
                for (int jj=j; jj<min(j+block_size, N); jj++){      
                    d[ii][jj] = 0.0;
                    for (int k = 0; k<K_MAX; k++){
                        d[ii][jj] += 2 * a[ii][k] * ( b[k][jj]- c[k]);
                    }
                }
            }
        }
    }

I'm probably choosing a bad value for block_size because I did not understand how to choose a nice one, but I tried all the dividers of N to choose a block size, from 1 to N. Also, I tried using a multiple of the number of elements that fit on a cache line (8 doubles) like 8, 64, 128, 256, and 512 (I know N is not always a multiple of that value, it is necessary to handle elements that cannot be reached by the block, I tried and do it nicely because I have got right outputs), but the performance was not improved. I also tried using the same block size value for all the N ones, but as you can guess, nothing was achieved.

My processor is an Intel Core i7-10870H.

Thank you in advance

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文