通过缓存局部性提高 C 函数性能？

发布于 2024-12-09 05:54:35 字数 610 浏览 0 评论 0原文

我必须找到表示为二维数组的矩阵中的对角线差异，函数原型是

int diagonal_diff(int x[512][512])

我必须使用二维数组，数据为 512x512。这是在 SPARC 机器上测试的：我当前的计时是 6 毫秒，但我需要低于 2 毫秒。

示例数据：

[3][4][5][9]
[2][8][9][4]
[6][9][7][3]
[5][8][8][2]

区别在于：

|4-2| + |5-6| + |9-5| + |9-9| + |4-8| + |3-8| = 2 + 1 + 4 + 0 + 4 + 5 = 16

为了做到这一点，我使用以下算法：

int i,j,result=0;
for(i=0; i<4; i++)
    for(j=0; j<4; j++)
        result+=abs(array[i][j]-[j][i]);

return result;

但该算法不断访问列、行、列、行等，这使得缓存的使用效率低下。

有没有办法改善我的功能？

原文

I have to find a diagonal difference in a matrix represented as 2d array and the function prototype is

int diagonal_diff(int x[512][512])

I have to use a 2d array, and the data is 512x512. This is tested on a SPARC machine: my current timing is 6ms but I need to be under 2ms.

Sample data:

[3][4][5][9]
[2][8][9][4]
[6][9][7][3]
[5][8][8][2]

The difference is:

|4-2| + |5-6| + |9-5| + |9-9| + |4-8| + |3-8| = 2 + 1 + 4 + 0 + 4 + 5 = 16

In order to do that, I use the following algorithm:

int i,j,result=0;
for(i=0; i<4; i++)
    for(j=0; j<4; j++)
        result+=abs(array[i][j]-[j][i]);

return result;

But this algorithm keeps accessing the column, row, column, row, etc which make inefficient use of cache.

Is there a way to improve my function?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

落日海湾 2024-12-16 05:54:35

编辑：为什么面向块的方法更快？我们通过确保无论是按行还是按列迭代块，我们都保证整个块适合缓存，从而利用 CPU 的数据缓存。

例如，如果您有 32 字节的缓存行，而 int 为 4 字节，则可以将 8x8 int 矩阵放入 8 个缓存行中。假设您有足够大的数据缓存，您可以按行或按列迭代该矩阵，并保证不会破坏缓存。另一种思考方式是，如果您的矩阵适合缓存，您可以以任何您想要的方式遍历它。

如果您有一个更大的矩阵，例如 512x512，那么您需要调整矩阵遍历，以免破坏缓存。例如，如果您以与矩阵布局相反的顺序遍历矩阵，则几乎总是会错过您访问的每个元素的缓存。

面向块的方法可确保在 CPU 必须刷新该高速缓存行之前，您最终将访问的数据只有高速缓存未命中。换句话说，调整缓存行大小的面向块的方法将确保您不会破坏缓存。

因此，如果您尝试优化正在运行的计算机的缓存行大小，则可以以块形式迭代矩阵，并确保仅访问每个矩阵元素一次：

int sum_diagonal_difference(int array[512][512], int block_size)
{
    int i,j, block_i, block_j,result=0;

     // sum diagonal blocks
    for (block_i= 0; block_i<512; block_i+= block_size)
        for (block_j= block_i + block_size; block_j<512; block_j+= block_size)
            for(i=0; i<block_size; i++)
                for(j=0; j<block_size; j++)
                    result+=abs(array[block_i + i][block_j + j]-array[block_j + j][block_i + i]);

    result+= result;

     // sum diagonal
    for (int block_offset= 0; block_offset<512; block_offset+= block_size)
    {
        for (i= 0; i<block_size; ++i)
        {
            for (j= i+1; j<block_size; ++j)
            {
                int value= abs(array[block_offset + i][block_offset + j]-array[block_offset + j][block_offset + i]);
                result+= value + value;
            }
        }
    }

    return result;
}

您应该尝试 的各种值块大小。在我的机器上，与 block_size 1 相比，8 带来最大的速度提升 (2.5 倍)（与整个矩阵的原始迭代相比，约 5 倍）。 block_size 理想情况下应为 cache_line_size_in_bytes/sizeof(int)。

EDIT: Why is a block oriented approach faster? We are taking advantage of the CPU's data cache by ensuring that whether we iterate over a block by row or by column, we guarantee that the entire block fits into the cache.

For example, if you have a cache line of 32-bytes and an int is 4 bytes, you can fit a 8x8 int matrix into 8 cache lines. Assuming you have a big enough data cache, you can iterate over that matrix either by row or by column and be guaranteed that you do not thrash the cache. Another way to think about it is if your matrix fits in the cache, you can traverse it any way you want.

If you have a matrix that is much bigger, say 512x512, then you need to tune your matrix traversal such that you don't thrash the cache. For example, if you traverse the matrix in the opposite order of the layout of the matrix, you will almost always miss the cache on every element you visit.

A block oriented approach ensures that you only have a cache miss for data you will eventually visit before the CPU has to flush that cache line. In other words, a block oriented approach tuned to the cache line size will ensure you don't thrash the cache.

So, if you are trying to optimize for the cache line size of the machine you are running on, you can iterate over the matrix in block form and ensure you only visit each matrix element once:

int sum_diagonal_difference(int array[512][512], int block_size)
{
    int i,j, block_i, block_j,result=0;

     // sum diagonal blocks
    for (block_i= 0; block_i<512; block_i+= block_size)
        for (block_j= block_i + block_size; block_j<512; block_j+= block_size)
            for(i=0; i<block_size; i++)
                for(j=0; j<block_size; j++)
                    result+=abs(array[block_i + i][block_j + j]-array[block_j + j][block_i + i]);

    result+= result;

     // sum diagonal
    for (int block_offset= 0; block_offset<512; block_offset+= block_size)
    {
        for (i= 0; i<block_size; ++i)
        {
            for (j= i+1; j<block_size; ++j)
            {
                int value= abs(array[block_offset + i][block_offset + j]-array[block_offset + j][block_offset + i]);
                result+= value + value;
            }
        }
    }

    return result;
}

You should experiment with various values for block_size. On my machine, 8 lead to the biggest speed up (2.5x) compared to a block_size of 1 (and ~5x compared to the original iteration over the entire matrix). The block_size should ideally be cache_line_size_in_bytes/sizeof(int).

回复收藏 0 原文

秋凉 2024-12-16 05:54:35

如果您有像 intel MKL 这样好的矢量/矩阵库，也可以尝试矢量化方式。

在matlab中非常简单：
结果 = sum(sum(abs(xx')));

我也在matlab中重现了Hans的方法和MSN的方法，结果是：

Elapsed time is 0.211480 seconds.  (Hans)

Elapsed time is 0.009172 seconds.  (MSN)

Elapsed time is 0.002193 seconds.  (Mine)

If you have a good vector/matrix library like intel MKL, also try the vectorized way.

very simple in matlab:
result = sum(sum(abs(x-x')));

I reproduced Hans's method and MSN's method in matlab too, and the results are:

Elapsed time is 0.211480 seconds.  (Hans)

Elapsed time is 0.009172 seconds.  (MSN)

Elapsed time is 0.002193 seconds.  (Mine)

回复收藏 0 原文

简单气质女生网名 2024-12-16 05:54:35

通过一项微小的更改，您可以让循环仅在所需的索引上运行。我刚刚更改了 j 循环初始化。

int i, j, result = 0;
for (i = 0; i < 4; ++i) {
    for (j = i + 1; j < 4; ++j) {
        result += abs(array[i][j] - array[j][i]);
    }
}

With one minor change you can have your loops only operate on the desired indices. I just changed the j loop initialization.

int i, j, result = 0;
for (i = 0; i < 4; ++i) {
    for (j = i + 1; j < 4; ++j) {
        result += abs(array[i][j] - array[j][i]);
    }
}

回复收藏 0 原文

~没有更多了~

关于作者

长途伴

暂无简介

0 文章

0 评论

23 人气

关注发私信

友情链接

文江博客

通过缓存局部性提高 C 函数性能？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

已经忘了多久

15867725375

LonelySnow

走过海棠暮

轻许诺言

信馬由缰

友情链接

通过缓存局部性提高 C 函数性能？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

已经忘了多久

15867725375

LonelySnow

走过海棠暮

轻许诺言

信馬由缰

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。