通过缓存局部性提高 C 函数性能?
我必须找到表示为二维数组的矩阵中的对角线差异,函数原型是
int diagonal_diff(int x[512][512])
我必须使用二维数组,数据为 512x512。这是在 SPARC 机器上测试的:我当前的计时是 6 毫秒,但我需要低于 2 毫秒。
示例数据:
[3][4][5][9]
[2][8][9][4]
[6][9][7][3]
[5][8][8][2]
区别在于:
|4-2| + |5-6| + |9-5| + |9-9| + |4-8| + |3-8| = 2 + 1 + 4 + 0 + 4 + 5 = 16
为了做到这一点,我使用以下算法:
int i,j,result=0;
for(i=0; i<4; i++)
for(j=0; j<4; j++)
result+=abs(array[i][j]-[j][i]);
return result;
但该算法不断访问列、行、列、行等,这使得缓存的使用效率低下。
有没有办法改善我的功能?
I have to find a diagonal difference in a matrix represented as 2d array and the function prototype is
int diagonal_diff(int x[512][512])
I have to use a 2d array, and the data is 512x512. This is tested on a SPARC machine: my current timing is 6ms but I need to be under 2ms.
Sample data:
[3][4][5][9]
[2][8][9][4]
[6][9][7][3]
[5][8][8][2]
The difference is:
|4-2| + |5-6| + |9-5| + |9-9| + |4-8| + |3-8| = 2 + 1 + 4 + 0 + 4 + 5 = 16
In order to do that, I use the following algorithm:
int i,j,result=0;
for(i=0; i<4; i++)
for(j=0; j<4; j++)
result+=abs(array[i][j]-[j][i]);
return result;
But this algorithm keeps accessing the column, row, column, row, etc which make inefficient use of cache.
Is there a way to improve my function?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
编辑:为什么面向块的方法更快?我们通过确保无论是按行还是按列迭代块,我们都保证整个块适合缓存,从而利用 CPU 的数据缓存。
例如,如果您有 32 字节的缓存行,而
int
为 4 字节,则可以将 8x8int
矩阵放入 8 个缓存行中。假设您有足够大的数据缓存,您可以按行或按列迭代该矩阵,并保证不会破坏缓存。另一种思考方式是,如果您的矩阵适合缓存,您可以以任何您想要的方式遍历它。如果您有一个更大的矩阵,例如 512x512,那么您需要调整矩阵遍历,以免破坏缓存。例如,如果您以与矩阵布局相反的顺序遍历矩阵,则几乎总是会错过您访问的每个元素的缓存。
面向块的方法可确保在 CPU 必须刷新该高速缓存行之前,您最终将访问的数据只有高速缓存未命中。换句话说,调整缓存行大小的面向块的方法将确保您不会破坏缓存。
因此,如果您尝试优化正在运行的计算机的缓存行大小,则可以以块形式迭代矩阵,并确保仅访问每个矩阵元素一次:
您应该尝试
的各种值块大小。在我的机器上,与
block_size
1 相比,8
带来最大的速度提升 (2.5 倍)(与整个矩阵的原始迭代相比,约 5 倍)。block_size
理想情况下应为cache_line_size_in_bytes/sizeof(int)
。EDIT: Why is a block oriented approach faster? We are taking advantage of the CPU's data cache by ensuring that whether we iterate over a block by row or by column, we guarantee that the entire block fits into the cache.
For example, if you have a cache line of 32-bytes and an
int
is 4 bytes, you can fit a 8x8int
matrix into 8 cache lines. Assuming you have a big enough data cache, you can iterate over that matrix either by row or by column and be guaranteed that you do not thrash the cache. Another way to think about it is if your matrix fits in the cache, you can traverse it any way you want.If you have a matrix that is much bigger, say 512x512, then you need to tune your matrix traversal such that you don't thrash the cache. For example, if you traverse the matrix in the opposite order of the layout of the matrix, you will almost always miss the cache on every element you visit.
A block oriented approach ensures that you only have a cache miss for data you will eventually visit before the CPU has to flush that cache line. In other words, a block oriented approach tuned to the cache line size will ensure you don't thrash the cache.
So, if you are trying to optimize for the cache line size of the machine you are running on, you can iterate over the matrix in block form and ensure you only visit each matrix element once:
You should experiment with various values for
block_size
. On my machine,8
lead to the biggest speed up (2.5x) compared to ablock_size
of 1 (and ~5x compared to the original iteration over the entire matrix). Theblock_size
should ideally becache_line_size_in_bytes/sizeof(int)
.如果您有像 intel MKL 这样好的矢量/矩阵库,也可以尝试矢量化方式。
在matlab中非常简单:
结果 = sum(sum(abs(xx')));
我也在matlab中重现了Hans的方法和MSN的方法,结果是:
If you have a good vector/matrix library like intel MKL, also try the vectorized way.
very simple in matlab:
result = sum(sum(abs(x-x')));
I reproduced Hans's method and MSN's method in matlab too, and the results are:
通过一项微小的更改,您可以让循环仅在所需的索引上运行。我刚刚更改了
j
循环初始化。With one minor change you can have your loops only operate on the desired indices. I just changed the
j
loop initialization.