如何使用 SSE 指令集对 2 个双精度型或 4 个浮点型进行绝对运算? (最高 SSE4)
这是我尝试使用 SSE 加速的示例 C 代码,两个数组的长度为 3072 个元素,带有双精度数,如果我不需要双精度数的精度,可以将其降低为浮点型。
double sum = 0.0;
for(k = 0; k < 3072; k++) {
sum += fabs(sima[k] - simb[k]);
}
double fp = (1.0 - (sum / (255.0 * 1024.0 * 3.0)));
无论如何,我当前的问题是如何在 SSE 寄存器中执行双精度或浮点数的 fabs 步骤,以便我可以将整个计算保留在 SSE 寄存器中,从而保持快速,并且我可以通过部分展开此循环来并行化所有步骤。
这是我找到的一些资源 fabs() asm或者可能是这个 翻转标志 - SO 但是第二个的弱点是需要有条件检查。
Here's the sample C code that I am trying to accelerate using SSE, the two arrays are 3072 element long with doubles, may drop it down to float if i don't need the precision of doubles.
double sum = 0.0;
for(k = 0; k < 3072; k++) {
sum += fabs(sima[k] - simb[k]);
}
double fp = (1.0 - (sum / (255.0 * 1024.0 * 3.0)));
Anyway my current problem is how to do the fabs step in a SSE register for doubles or float so that I can keep the whole calculation in the SSE registers so that it remains fast and I can parallelize all of the steps by partly unrolling this loop.
Here's some resources I've found fabs() asm or possibly this flipping the sign - SO however the weakness of the second one would need a conditional check.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我建议使用按位和掩码。正值和负值具有相同的表示形式,仅最高有效位不同,正值为 0,负值为 1,请参见 双精度数字格式。您可以使用其中之一:
此外,展开循环以打破循环携带的依赖链可能是个好主意。由于这是非负值的总和,因此求和的顺序并不重要:
通过展开并打破依赖关系(sum1 和 sum2 现在是独立的),您可以让处理器按顺序执行加法。由于现代 CPU 上的指令是流水线式的,因此 CPU 可以在前一个指令完成之前开始处理新的指令。此外,按位运算是在单独的执行单元上执行的,CPU 实际上可以在与加法/减法相同的周期中执行它。我建议Agner Fog 的优化手册。
最后,我不推荐使用openMP。循环太小,在多个线程之间分配作业的开销可能大于任何潜在的好处。
I suggest using bitwise and with a mask. Positive and negative values have the same representation, only the most significant bit differs, it is 0 for positive values and 1 for negative values, see double precision number format. You can use one of these:
Also, it might be a good idea to unroll the loop to break the loop-carried dependency chain. Since this is a sum of nonnegative values, the order of summation is not important:
By unrolling and breaking the dependency (sum1 and sum2 are now independent), you let the processor execute the additions our of order. Since the instruction is pipelined on a modern CPU, the CPU can start working on a new addition before the previous one is finished. Also, bitwise operations are executed on a separate execution unit, the CPU can actually perform it in the same cycle as addition/subtraction. I suggest Agner Fog's optimization manuals.
Finally, I don't recommend using openMP. The loop is too small and the overhead of distribution the job among multiple threads might be bigger than any potential benefit.
-x 和x 的最大值应为abs(x)。这是代码:
The maximum of -x and x should be abs(x). Here it is in code:
最简单的方法可能如下:
请注意,这可能不会比现代 x86 CPU 上的标量代码快,后者通常有两个 FPU。然而,如果您可以降低到单精度,那么您很可能会获得 2 倍的吞吐量改进。
另请注意,您需要在循环后将
vsum
中的两个部分和合并为一个标量值,但这相当简单,而且对性能并不关键。Probably the easiest way is as follows:
Note that this may not be any faster than scalar code on modern x86 CPUs, which typically have two FPUs anyway. However if you can drop down to single precision then you may well get a 2x throughput improvement.
Note also that you will need to combine the two partial sums in
vsum
into a scalar value after the loop, but this is fairly trivial to do and is not performance-critical.