使用TBB用很少的指令(SSE2,SSE4)优化循环
我有一个简单的图像处理相关算法。 简而言之,浮点数图像(平均值)减去 8 位图像 然后将结果保存到浮点图像(dest)
该函数主要由内在函数编写。
我尝试用 TBB、parallel_for 来优化这个函数, 但我没有得到速度上的提升,而是受到了惩罚。
我应该怎么办 ?我应该使用更底层的方案,例如 TBB 任务 优化代码?
float *m, **m_data,
*o, **o_data;
unsigned char *p, **src_data;
register unsigned long len, i;
unsigned long nr,
nc;
src_data = src->UByteData; // 2d array
m_data = mean->FloatData; // 2d array
o_data = dest->FloatData; // 2d array
nr = src->Rows;
nc = src->Cols;
__m128i xmm0;
for(i=0; i<nr; i++)
{
m = m_data[i];
o = o_data[i];
p = src_data[i];
len = nc;
do
{
_mm_prefetch((const char *)(p + 16), _MM_HINT_NTA);
_mm_prefetch((const char *)(m + 16), _MM_HINT_NTA);
xmm0 = _mm_load_si128((__m128i *) (p));
_mm_stream_ps(
o,
_mm_sub_ps(
_mm_cvtepi32_ps(_mm_cvtepu8_epi32(_mm_srli_si128(xmm0, 0))),
_mm_load_ps(m + offset)
)
);
_mm_stream_ps(
o + 4,
_mm_sub_ps(
_mm_cvtepi32_ps(_mm_cvtepu8_epi32(_mm_srli_si128(xmm0, 4))),
_mm_load_ps(m + offset + 4)
)
);
_mm_stream_ps(
o + 8,
_mm_sub_ps(
_mm_cvtepi32_ps(_mm_cvtepu8_epi32(_mm_srli_si128(xmm0, 8))),
_mm_load_ps(m + offset + 8)
)
);
_mm_stream_ps(
o + 12,
_mm_sub_ps(
_mm_cvtepi32_ps(_mm_cvtepu8_epi32(_mm_srli_si128(xmm0, 12))),
_mm_load_ps(m + offset + 12)
)
);
p += 16;
m += 16;
o += 16;
len -= 16;
}
while(len);
}
I have a simple image processing related algorithm.
Briefly, an image(mean) in float is subtracted by an 8-bit image
the result is then save to an float image(dest)
this function is mainly written by intrinsics.
I have tried to optimize this function with TBB, parrallel_for,
but I received no gain in speed but penalty.
What should I do ? Should I use more low-level scheme such as TBB task
to optimize the code ?
float *m, **m_data,
*o, **o_data;
unsigned char *p, **src_data;
register unsigned long len, i;
unsigned long nr,
nc;
src_data = src->UByteData; // 2d array
m_data = mean->FloatData; // 2d array
o_data = dest->FloatData; // 2d array
nr = src->Rows;
nc = src->Cols;
__m128i xmm0;
for(i=0; i<nr; i++)
{
m = m_data[i];
o = o_data[i];
p = src_data[i];
len = nc;
do
{
_mm_prefetch((const char *)(p + 16), _MM_HINT_NTA);
_mm_prefetch((const char *)(m + 16), _MM_HINT_NTA);
xmm0 = _mm_load_si128((__m128i *) (p));
_mm_stream_ps(
o,
_mm_sub_ps(
_mm_cvtepi32_ps(_mm_cvtepu8_epi32(_mm_srli_si128(xmm0, 0))),
_mm_load_ps(m + offset)
)
);
_mm_stream_ps(
o + 4,
_mm_sub_ps(
_mm_cvtepi32_ps(_mm_cvtepu8_epi32(_mm_srli_si128(xmm0, 4))),
_mm_load_ps(m + offset + 4)
)
);
_mm_stream_ps(
o + 8,
_mm_sub_ps(
_mm_cvtepi32_ps(_mm_cvtepu8_epi32(_mm_srli_si128(xmm0, 8))),
_mm_load_ps(m + offset + 8)
)
);
_mm_stream_ps(
o + 12,
_mm_sub_ps(
_mm_cvtepi32_ps(_mm_cvtepu8_epi32(_mm_srli_si128(xmm0, 12))),
_mm_load_ps(m + offset + 12)
)
);
p += 16;
m += 16;
o += 16;
len -= 16;
}
while(len);
}
相对于加载和存储的数量,您在这里几乎没有进行任何计算,因此您可能受到内存带宽而不是计算的限制。这可以解释为什么在优化计算时没有看到吞吐量有任何改进。
不过,我会摆脱
_mm_prefetch
指令 - 它们几乎肯定没有帮助,甚至可能会损害性能。如果可能的话,您应该将此循环与在此之前/之后执行的任何其他操作结合起来 - 这样您就可以通过更多计算来分摊内存 I/O 的成本。
You are doing almost no computation here, relative to the number of loads and stores, so it's likely that you are being limited by memory bandwidth rather than computation. This would explain why you don't see any improvement in throughput when you optimise the computation.
I would get rid of the
_mm_prefetch
instructions though - they are almost certainly not helping here and may even be hurting performance.If possible you should combine this loop with any other operations that you are doing before/after this - that way you amortise the cost of memory I/O over more computation.