具有 SSE4.1 内在函数的双线性滤波器
我现在正在尝试找出一种一次仅针对一个过滤样本的相当快速的双线性过滤函数,作为习惯使用内在函数的练习 - 最高可达 SSE41 就可以了。
到目前为止,我有以下内容:
inline __m128i DivideBy255_8xUint16(const __m128i value)
{
// Blinn 16bit divide by 255 trick but across 8 packed 16bit values
const __m128i plus128 = _mm_add_epi16(value, _mm_set1_epi16(128));
const __m128i plus128ThenDivideBy256 = _mm_srli_epi16(plus128, 8); // TODO: Should this be an arithmetic or logical shift or does it matter?
const __m128i partial = _mm_add_epi16(plus128, plus128ThenDivideBy256);
const __m128i result = _mm_srli_epi16(partial, 8); // TODO: Should this be an arithmetic or logical shift or does it matter?
return result;
}
inline uint32_t BilinearSSE41(const uint8_t* data, uint32_t pitch, uint32_t width, uint32_t height, float u, float v)
{
// TODO: There are probably intrinsics I haven't found yet to avoid using these?
// 0x80 is high bit set which means zero out that component
const __m128i unpack_fraction_u_mask = _mm_set_epi8(0x80, 0, 0x80, 0, 0x80, 0, 0x80, 0, 0x80, 0, 0x80, 0, 0x80, 0, 0x80, 0);
const __m128i unpack_fraction_v_mask = _mm_set_epi8(0x80, 1, 0x80, 1, 0x80, 1, 0x80, 1, 0x80, 1, 0x80, 1, 0x80, 1, 0x80, 1);
const __m128i unpack_two_texels_mask = _mm_set_epi8(0x80, 7, 0x80, 6, 0x80, 5, 0x80, 4, 0x80, 3, 0x80, 2, 0x80, 1, 0x80, 0);
// TODO: Potentially wasting two channels of operations for now
const __m128i size = _mm_set_epi32(0, 0, height - 1, width - 1);
const __m128 uv = _mm_set_ps(0.0f, 0.0f, v, u);
const __m128 floor_uv_f = _mm_floor_ps(uv);
const __m128 fraction_uv_f = _mm_sub_ps(uv, floor_uv_f);
const __m128 fraction255_uv_f = _mm_mul_ps(fraction_uv_f, _mm_set_ps1(255.0f));
const __m128i fraction255_uv_i = _mm_cvttps_epi32(fraction255_uv_f); // TODO: Did this get rounded correctly?
const __m128i fraction255_u_i = _mm_shuffle_epi8(fraction255_uv_i, unpack_fraction_u_mask); // Splat fraction_u*255 across all 16 bit words
const __m128i fraction255_v_i = _mm_shuffle_epi8(fraction255_uv_i, unpack_fraction_v_mask); // Splat fraction_v*255 across all 16 bit words
const __m128i inverse_fraction255_u_i = _mm_sub_epi16(_mm_set1_epi16(255), fraction255_u_i);
const __m128i inverse_fraction255_v_i = _mm_sub_epi16(_mm_set1_epi16(255), fraction255_v_i);
const __m128i floor_uv_i = _mm_cvttps_epi32(floor_uv_f);
const __m128i clipped_floor_uv_i = _mm_min_epu32(floor_uv_i, size); // TODO: I haven't clamped this probably if uv was less than zero yet...
// TODO: Calculating the addresses in the SSE register set would maybe be better
int u0 = _mm_extract_epi32(floor_uv_i, 0);
int v0 = _mm_extract_epi32(floor_uv_i, 1);
const uint8_t* row = data + (u0<<2) + pitch*v0;
const __m128i row0_packed = _mm_loadl_epi64((const __m128i*)data);
const __m128i row0 = _mm_shuffle_epi8(row0_packed, unpack_two_texels_mask);
const __m128i row1_packed = _mm_loadl_epi64((const __m128i*)(data + pitch));
const __m128i row1 = _mm_shuffle_epi8(row1_packed, unpack_two_texels_mask);
// Compute (row0*fraction)/255 + row1*(255 - fraction)/255 - probably slight precision loss across addition!
const __m128i vlerp0 = DivideBy255_8xUint16(_mm_mullo_epi16(row0, fraction255_v_i));
const __m128i vlerp1 = DivideBy255_8xUint16(_mm_mullo_epi16(row1, inverse_fraction255_v_i));
const __m128i vlerp = _mm_adds_epi16(vlerp0, vlerp1);
const __m128i hlerp0 = DivideBy255_8xUint16(_mm_mullo_epi16(vlerp, fraction255_u_i));
const __m128i hlerp1 = DivideBy255_8xUint16(_mm_srli_si128(_mm_mullo_epi16(vlerp, inverse_fraction255_u_i), 16 - 2*4));
const __m128i hlerp = _mm_adds_epi16(hlerp0, hlerp1);
// Pack down to 8bit from 16bit components and return 32bit ARGB result
return _mm_extract_epi32(_mm_packus_epi16(hlerp, hlerp), 0);
}
代码假设图像数据是 ARGB8,并且有一个额外的列和行来处理边缘情况,而无需分支。
我正在寻求有关我可以使用哪些指令来缩小这个细长混乱的大小的建议,当然还有如何改进它以运行得更快!
谢谢 :)
I am trying to figure out a reasonably fast bilinear filtering function just for one filtered sample at a time now as an exercise in getting used to using intrinsics - up to SSE41 is fine.
So far I have the following:
inline __m128i DivideBy255_8xUint16(const __m128i value)
{
// Blinn 16bit divide by 255 trick but across 8 packed 16bit values
const __m128i plus128 = _mm_add_epi16(value, _mm_set1_epi16(128));
const __m128i plus128ThenDivideBy256 = _mm_srli_epi16(plus128, 8); // TODO: Should this be an arithmetic or logical shift or does it matter?
const __m128i partial = _mm_add_epi16(plus128, plus128ThenDivideBy256);
const __m128i result = _mm_srli_epi16(partial, 8); // TODO: Should this be an arithmetic or logical shift or does it matter?
return result;
}
inline uint32_t BilinearSSE41(const uint8_t* data, uint32_t pitch, uint32_t width, uint32_t height, float u, float v)
{
// TODO: There are probably intrinsics I haven't found yet to avoid using these?
// 0x80 is high bit set which means zero out that component
const __m128i unpack_fraction_u_mask = _mm_set_epi8(0x80, 0, 0x80, 0, 0x80, 0, 0x80, 0, 0x80, 0, 0x80, 0, 0x80, 0, 0x80, 0);
const __m128i unpack_fraction_v_mask = _mm_set_epi8(0x80, 1, 0x80, 1, 0x80, 1, 0x80, 1, 0x80, 1, 0x80, 1, 0x80, 1, 0x80, 1);
const __m128i unpack_two_texels_mask = _mm_set_epi8(0x80, 7, 0x80, 6, 0x80, 5, 0x80, 4, 0x80, 3, 0x80, 2, 0x80, 1, 0x80, 0);
// TODO: Potentially wasting two channels of operations for now
const __m128i size = _mm_set_epi32(0, 0, height - 1, width - 1);
const __m128 uv = _mm_set_ps(0.0f, 0.0f, v, u);
const __m128 floor_uv_f = _mm_floor_ps(uv);
const __m128 fraction_uv_f = _mm_sub_ps(uv, floor_uv_f);
const __m128 fraction255_uv_f = _mm_mul_ps(fraction_uv_f, _mm_set_ps1(255.0f));
const __m128i fraction255_uv_i = _mm_cvttps_epi32(fraction255_uv_f); // TODO: Did this get rounded correctly?
const __m128i fraction255_u_i = _mm_shuffle_epi8(fraction255_uv_i, unpack_fraction_u_mask); // Splat fraction_u*255 across all 16 bit words
const __m128i fraction255_v_i = _mm_shuffle_epi8(fraction255_uv_i, unpack_fraction_v_mask); // Splat fraction_v*255 across all 16 bit words
const __m128i inverse_fraction255_u_i = _mm_sub_epi16(_mm_set1_epi16(255), fraction255_u_i);
const __m128i inverse_fraction255_v_i = _mm_sub_epi16(_mm_set1_epi16(255), fraction255_v_i);
const __m128i floor_uv_i = _mm_cvttps_epi32(floor_uv_f);
const __m128i clipped_floor_uv_i = _mm_min_epu32(floor_uv_i, size); // TODO: I haven't clamped this probably if uv was less than zero yet...
// TODO: Calculating the addresses in the SSE register set would maybe be better
int u0 = _mm_extract_epi32(floor_uv_i, 0);
int v0 = _mm_extract_epi32(floor_uv_i, 1);
const uint8_t* row = data + (u0<<2) + pitch*v0;
const __m128i row0_packed = _mm_loadl_epi64((const __m128i*)data);
const __m128i row0 = _mm_shuffle_epi8(row0_packed, unpack_two_texels_mask);
const __m128i row1_packed = _mm_loadl_epi64((const __m128i*)(data + pitch));
const __m128i row1 = _mm_shuffle_epi8(row1_packed, unpack_two_texels_mask);
// Compute (row0*fraction)/255 + row1*(255 - fraction)/255 - probably slight precision loss across addition!
const __m128i vlerp0 = DivideBy255_8xUint16(_mm_mullo_epi16(row0, fraction255_v_i));
const __m128i vlerp1 = DivideBy255_8xUint16(_mm_mullo_epi16(row1, inverse_fraction255_v_i));
const __m128i vlerp = _mm_adds_epi16(vlerp0, vlerp1);
const __m128i hlerp0 = DivideBy255_8xUint16(_mm_mullo_epi16(vlerp, fraction255_u_i));
const __m128i hlerp1 = DivideBy255_8xUint16(_mm_srli_si128(_mm_mullo_epi16(vlerp, inverse_fraction255_u_i), 16 - 2*4));
const __m128i hlerp = _mm_adds_epi16(hlerp0, hlerp1);
// Pack down to 8bit from 16bit components and return 32bit ARGB result
return _mm_extract_epi32(_mm_packus_epi16(hlerp, hlerp), 0);
}
The code assumes the image data is ARGB8 and has an extra column and row to handle edge cases without having to branch.
I am after advice on what instructions I can use to bring down the size of this gangly mess and of course how it can be improved to run faster!
Thanks :)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
关于您的代码没有什么具体可说的。但我使用 SSE2 编写了自己的双线性缩放代码。有关更多详细信息,请参阅 StackOverflow 问题帮助我改进更多 SSE2 代码 。
在我的代码中,我首先计算水平和垂直分数和索引,而不是每个像素。我认为这样更快。
我在 core2 cpu 下的代码似乎受内存限制而不是 cpu 限制,因此不进行预计算可能会更快。
Nothing specific to say about your code. But I wrote my own Bilinear scaling code using SSE2. See the StackOverflow question Help me improve some more SSE2 code for more details.
In my code I calculate the horizontal and vertical fractions and indexes first rather than per pixel. I think this is faster.
My code under core2 cpus seems to be memory limited rather than cpu so not doing the precalc might be faster.
注意到你的评论“TODO:这应该是算术或逻辑移位还是重要?”
算术移位适用于有符号整数。逻辑移位适用于无符号整数。
Noticed your comment "TODO: Should this be an arithmetic or logical shift or does it matter?"
Arithmetic shift is for signed integers. Logical shift is for unsigned integers.