缓存阻塞比原始代码最差吗？

发布于 2025-01-26 13:22:28 字数 1320 浏览 1 评论 0原文

我正在研究如何改善矩阵操作（使用双重类型），并且正在尝试一些技术，例如缓存和循环展开。第二个确实取得了成功，但是我无法使用阻止来提高自己的性能。我不是因为我做错了什么，或者是由于阻塞而引起的，在这种情况下根本没有用。

没有操作的原始代码是：

    for (int i=0; i<N; i++){
        for (int j=0; j<N; j++){
            d[i][j] = 0.0;
        }
    }

    for (int i=0; i<N; i++){
        for (int j=0; j<N; j++){
            for (int k=0; k<K_MAX; k++){
                d[i][j] += 2 * a[i][k] * (b[k][j] - c[k]);
            }   
        }
    }

k_max始终为8，n从250 500 750 1000 1500 2000 2000 2550 3000中获取值

，而我试图处理的封锁是：

    for (int i= 0 ; i<N; i+=block_size){
        for (int j=0; j<N; j+=block_size){      
            for (int ii=i; ii<min (i+block_size, N); ii++){
                for (int jj=j; jj<min(j+block_size, N); jj++){      
                    d[ii][jj] = 0.0;
                    for (int k = 0; k<K_MAX; k++){
                        d[ii][jj] += 2 * a[ii][k] * ( b[k][jj]- c[k]);
                    }
                }
            }
        }
    }

我可能为block_size选择一个不好的值，因为我做了不了解如何选择一个不错的一个，但是我尝试了N的所有分隔线来选择一个块大小，从1到N。而且，我尝试使用适合缓存线上的元素数量的倍数（8倍）就像8、64、128、256和512（我知道n并不总是该值的倍数，有必要处理块无法达到的元素，我尝试并做得很好，因为我已经获得了正确的输出），但表现没有改善。我还尝试使用所有n个块大小值使用相同的块大小值，但是您可以猜到，什么都没有实现。

我的处理器是Intel Core i7-10870h。

先感谢您

原文

I am doing some research about how to improve matrix operation (using the double type) and I was trying some techniques such as cache-blocking and loop unrolling. The second one was really successful, but I cannot improve my performance using blocking. I don't if it is because I am doing something wrong or if it is due to blocking is not useful at all in this case.

The original code without the operation is:

    for (int i=0; i<N; i++){
        for (int j=0; j<N; j++){
            d[i][j] = 0.0;
        }
    }

    for (int i=0; i<N; i++){
        for (int j=0; j<N; j++){
            for (int k=0; k<K_MAX; k++){
                d[i][j] += 2 * a[i][k] * (b[k][j] - c[k]);
            }   
        }
    }

Where K_MAX is always 8 and N takes values from 250 500 750 1000 1500 2000 2550 3000

And what I was trying to do with blocking was:

    for (int i= 0 ; i<N; i+=block_size){
        for (int j=0; j<N; j+=block_size){      
            for (int ii=i; ii<min (i+block_size, N); ii++){
                for (int jj=j; jj<min(j+block_size, N); jj++){      
                    d[ii][jj] = 0.0;
                    for (int k = 0; k<K_MAX; k++){
                        d[ii][jj] += 2 * a[ii][k] * ( b[k][jj]- c[k]);
                    }
                }
            }
        }
    }

I'm probably choosing a bad value for block_size because I did not understand how to choose a nice one, but I tried all the dividers of N to choose a block size, from 1 to N. Also, I tried using a multiple of the number of elements that fit on a cache line (8 doubles) like 8, 64, 128, 256, and 512 (I know N is not always a multiple of that value, it is necessary to handle elements that cannot be reached by the block, I tried and do it nicely because I have got right outputs), but the performance was not improved. I also tried using the same block size value for all the N ones, but as you can guess, nothing was achieved.

My processor is an Intel Core i7-10870H.

Thank you in advance

分享到QQ

分享到微博