二维阵列两个方向之间的性能测试
此代码 (A) 的执行速度比第二个代码快得多(10 倍):
for(int w=0; w<width; w++) {
for(int h=1; h<height; h++) {
image[h][w] = (1-a)*image[h][w] + a*image[h-1][w];
}
}
第二个代码:
for(int h=0; h<height; h++) {
for(int w=1; w<width; w++) {
image[h][w] = (1-a)*image[h][w] + a*image[h][w-1];
}
}
为什么会这样?无论是水平方向还是垂直方向遍历图像中的所有像素都是一样的。
有没有办法加快第二个速度?
提前致谢。
This code (A) executes much faster (10 times) then the second one:
for(int w=0; w<width; w++) {
for(int h=1; h<height; h++) {
image[h][w] = (1-a)*image[h][w] + a*image[h-1][w];
}
}
Second one:
for(int h=0; h<height; h++) {
for(int w=1; w<width; w++) {
image[h][w] = (1-a)*image[h][w] + a*image[h][w-1];
}
}
Why is that? it is the same to go through all of the pixels in the image in either horizontal or vertical direction.
Is there a way to speed up the second one?
Thanks in advance.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这与引用位置有关。如果按照元素存储在内存中的顺序访问元素,这将比以跨步模式访问它们要快得多,因为内存缓存和内存带宽将得到更有效的利用。
上面解释了第二个版本比第一个版本更快,这正是我的盒子上发生的情况:
这是我用来分配数组的代码:
版本 1 乘以以下循环:
版本 2 循环:
用
g++
4.4.3 与-O3
并在某种描述的 Xeon 机器上运行(64 位 Ubuntu)。如果您仍然 100% 确定您看到了相反效果,那么与我正在做的事情相比,您所做的事情一定有根本上的不同。如果您告诉我们图像的尺寸以及它的准确分配方式(以帮助建立内存布局),这可能会有所帮助。
This has to do with locality of reference. If you access the elements in the same order as they are stored in memory, this will be much faster than accessing them in a strided pattern, as the memory caches and memory bandwidth will be utilized far more effectively.
The above would explain the second version being faster than the first one, and this is exactly what happens on my box:
Here is the code I use to allocate the array:
Version 1 times the following loop:
Version 2 loop:
Compiled with
g++
4.4.3 with-O3
and run on a Xeon box of some description (64-bit Ubuntu).If you are still 100% sure that you're seeing the opposite effect, there must be something fundamentally different about what you're doing compared to what I'm doing. It might help if you told us the dimensions of your image and how exactly it gets allocated (in order to help establish the memory layout).
aix 关于引用位置的说法是正确的。更明确地说,这是因为内存层次结构。
当您第一次访问一个元素时,可能是缓存未命中。加载整个缓存行,然后进行读/写。
根据您遍历数组的方向,下一次访问将在位置 i+1 或 i+N 处。 i+1 可能位于同一高速缓存行中,但 i+N 通常位于另一高速缓存行中,需要另一次大的读取。
对于小N,整个事情最终都在缓存中,并且方向并不重要。对于适当大的 N,并非所有数组都可以放入缓存中最快(且最小)的部分,因此包含元素 i 的缓存行可能会在访问 i+M*N 之前被删除,并且必须在访问 i 之前重新加载+1。
为了使其尽可能快,您必须特别关注 CPU 架构。有些比其他对对齐更加敏感。有些人更喜欢每个缓存行一次(达到容量),然后再进行复制。当然,时间切片和处理器共享会把事情搞砸。
aix is right about locality of reference. To be more explicit, this is because of memory hierarchy.
When you first access an element, it's probably a cache miss. An entire cache line is loaded, then the read/write happens.
Depending on which direction you traverse the array in, the next access will either be at location i+1 or i+N. i+1 is likely to be in the same cache line but i+N will generally be in another cache line, demanding another large fetch.
For small N, the whole thing ends up in cache and it doesn't matter much about direction. For suitably large N, not all of the array can fit into the fastest (and smallest) part of the cache, so the cache line containing element i may get dropped before you access i+M*N and have to be reloaded before accessing i+1.
To make it as fast as possible you have to get particular about the CPU architecture. Some are more alignment-sensitive than others. Some prefer you to touch each cache line once (up to capacity) and then do the copying afterwards. Timeslicing and processor sharing mess things up, of course.