SIMD/SSE 新手:简单的图像过滤
我对 SIMD/SSE 很陌生,我正在尝试做一些简单的图像过滤(模糊)。 下面的代码在水平方向上使用简单的 [1 2 1] 权重过滤 8 位灰度位图的每个像素。我一次创建 16 个像素的总和。
这段代码看起来非常糟糕,至少对我来说,是其中有很多插入/提取,这不是很优雅,而且可能也会减慢一切。移位时是否有更好的方法将数据从一个寄存器包装到另一个寄存器中?
buf 是图像数据,16 字节对齐。 w/h 是宽度和高度,是 16 的倍数。
__m128i *p = (__m128i *) buf;
__m128i cur1, cur2, sum1, sum2, zeros, tmp1, tmp2, saved;
zeros = _mm_setzero_si128();
short shifted, last = 0, next;
// preload first row
cur1 = _mm_load_si128(p);
for (x = 1; x < (w * h) / 16; x++) {
// unpack
sum1 = sum2 = saved = cur1;
sum1 = _mm_unpacklo_epi8(sum1, zeros);
sum2 = _mm_unpackhi_epi8(sum2, zeros);
cur1 = tmp1 = sum1;
cur2 = tmp2 = sum2;
// "middle" pixel
sum1 = _mm_add_epi16(sum1, sum1);
sum2 = _mm_add_epi16(sum2, sum2);
// left pixel
cur2 = _mm_slli_si128(cur2, 2);
shifted = _mm_extract_epi16(cur1, 7);
cur2 = _mm_insert_epi16(cur2, shifted, 0);
cur1 = _mm_slli_si128(cur1, 2);
cur1 = _mm_insert_epi16(cur1, last, 0);
sum1 = _mm_add_epi16(sum1, cur1);
sum2 = _mm_add_epi16(sum2, cur2);
// right pixel
tmp1 = _mm_srli_si128(tmp1, 2);
shifted = _mm_extract_epi16(tmp2, 0);
tmp1 = _mm_insert_epi16(tmp1, shifted, 7);
tmp2 = _mm_srli_si128(tmp2, 2);
// preload next row
cur1 = _mm_load_si128(p + x);
// we need the first pixel of the next row for the "right" pixel
next = _mm_extract_epi16(cur1, 0) & 0xff;
tmp2 = _mm_insert_epi16(tmp2, next, 7);
// and the last pixel of last row for the next "left" pixel
last = ((uint16_t) _mm_extract_epi16(saved, 7)) >> 8;
sum1 = _mm_add_epi16(sum1, tmp1);
sum2 = _mm_add_epi16(sum2, tmp2);
// divide
sum1 = _mm_srli_epi16(sum1, 2);
sum2 = _mm_srli_epi16(sum2, 2);
sum1 = _mm_packus_epi16(sum1, sum2);
mm_store_si128(p + x - 1, sum1);
}
I'm very new to SIMD/SSE and I'm trying to do some simple image filtering (blurring).
The code below filters each pixel of a 8-bit gray bitmap with a simple [1 2 1] weighting in horizontal direction. I'm creating sums of 16 pixels at a time.
What seems very bad about this code, at least to me, is that there is a lot of insert/extract in it, which is not very elegant and probably slows everything down as well. Is there a better way to wrap data from one reg into another when shifting?
buf is the image data, 16-byte aligned.
w/h are width and height, multiples of 16.
__m128i *p = (__m128i *) buf;
__m128i cur1, cur2, sum1, sum2, zeros, tmp1, tmp2, saved;
zeros = _mm_setzero_si128();
short shifted, last = 0, next;
// preload first row
cur1 = _mm_load_si128(p);
for (x = 1; x < (w * h) / 16; x++) {
// unpack
sum1 = sum2 = saved = cur1;
sum1 = _mm_unpacklo_epi8(sum1, zeros);
sum2 = _mm_unpackhi_epi8(sum2, zeros);
cur1 = tmp1 = sum1;
cur2 = tmp2 = sum2;
// "middle" pixel
sum1 = _mm_add_epi16(sum1, sum1);
sum2 = _mm_add_epi16(sum2, sum2);
// left pixel
cur2 = _mm_slli_si128(cur2, 2);
shifted = _mm_extract_epi16(cur1, 7);
cur2 = _mm_insert_epi16(cur2, shifted, 0);
cur1 = _mm_slli_si128(cur1, 2);
cur1 = _mm_insert_epi16(cur1, last, 0);
sum1 = _mm_add_epi16(sum1, cur1);
sum2 = _mm_add_epi16(sum2, cur2);
// right pixel
tmp1 = _mm_srli_si128(tmp1, 2);
shifted = _mm_extract_epi16(tmp2, 0);
tmp1 = _mm_insert_epi16(tmp1, shifted, 7);
tmp2 = _mm_srli_si128(tmp2, 2);
// preload next row
cur1 = _mm_load_si128(p + x);
// we need the first pixel of the next row for the "right" pixel
next = _mm_extract_epi16(cur1, 0) & 0xff;
tmp2 = _mm_insert_epi16(tmp2, next, 7);
// and the last pixel of last row for the next "left" pixel
last = ((uint16_t) _mm_extract_epi16(saved, 7)) >> 8;
sum1 = _mm_add_epi16(sum1, tmp1);
sum2 = _mm_add_epi16(sum2, tmp2);
// divide
sum1 = _mm_srli_epi16(sum1, 2);
sum2 = _mm_srli_epi16(sum2, 2);
sum1 = _mm_packus_epi16(sum1, sum2);
mm_store_si128(p + x - 1, sum1);
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我建议将相邻像素保留在 SSE 寄存器上。也就是说,将 _mm_slli_si128 / _mm_srli_si128 的结果保留在 SSE 变量中,并消除所有插入和提取。我的推理是,在较旧的 CPU 中,插入/提取指令需要 SSE 单元和通用单元之间的通信,这比在 SSE 内进行计算要慢得多,即使它溢出到 L1 缓存。
完成后,应该只有四个 16 位移位(_mm_slli_si128、_mm_srli_si128、 不计算除法移位)。我的建议是对您的代码进行基准测试,因为到那时您的代码可能已经达到内存带宽限制..这意味着您无法再优化。
如果图像很大(大于 L2 大小)并且输出不会很快读回,请尝试使用 MOVNTDQ ( _mm_stream_si128 ) 进行写回。根据几个网站的说法,它位于 SSE2 中,但您可能需要仔细检查。
SIMD 教程:
一些 SIMD 大师网站:
I suggest keeping the neighbouring pixels on the SSE register. That is, keep the result of the _mm_slli_si128 / _mm_srli_si128 in an SSE variable, and eliminate all of the insert and extract. My reasoning is that in older CPUs, the insert/extract instructions require communication between the SSE units and the general-purpose units, which is much slower than keeping the calculation within SSE, even if it spills over to the L1 cache.
When that is done, there should be only four 16-bit shifts ( _mm_slli_si128, _mm_srli_si128, not counting the divison shift ). My suggestion is to do a benchmark with your code, because by that time your code may have already hit the memory bandwidth limit .. which means you can't optimize anymore.
If the image is large (bigger than L2 size) and the output won't be read back soon, try use MOVNTDQ ( _mm_stream_si128 ) for writing back. According to several websites it is in SSE2, although you might want to double-check.
SIMD tutorial:
Some SIMD guru websites:
这种邻域操作对于 SSE 来说一直是一个痛苦,直到 SSE3.5(又名 SSSE3)出现,并引入了 PALIGNR(_mm_alignr_epi8)。
如果您需要与 SSE2/SSE3 向后兼容,您可以编写一个等效的宏或内联函数,为 SSE2/SSE3 模拟 _mm_alignr_epi8,并在针对 SSE3.5/SSE4 时直接调用 _mm_alignr_epi8。
另一种方法是使用未对齐的负载来获取移位的数据 - 这在较旧的 CPU 上相对昂贵(大约是对齐负载的延迟的两倍和吞吐量的一半),但这可能是可以接受的,具体取决于每个负载执行的计算量。它还具有的好处是,在当前的 Intel CPU(Core i7)上,与对齐负载相比,未对齐负载没有任何损失,因此您的代码在 Core i7 等上将非常高效。
This kind of neighbourhood operation was always a pain with SSE, until SSE3.5 (aka SSSE3) came along, and PALIGNR (_mm_alignr_epi8) was introduced.
If you need backward compatibility with SSE2/SSE3 though, you can write an equivalent macro or inline function which emulates _mm_alignr_epi8 for SSE2/SSE3 and which drops through to _mm_alignr_epi8 when targetting SSE3.5/SSE4.
Another approach is to use misaligned loads to get the shifted data - this is relatively expensive on older CPUs (roughly twice the latency and half the throughput of aligned loads) but this may be acceptable depending on much much computation you're doing per load. It also has the benefit that on current Intel CPUs (Core i7) misaligned loads have no penalty compared to aligned loads, so your code will be quite efficient on Core i7 et al.