如何从 SSE 获得最大速度?
MXCSR 等内容的最佳设置是什么?哪种舍入模式最快?在什么处理器上?启用 NaN 信号是否更快,以便我在计算结果为 nan 时收到通知,或者这是否会导致非 NaN 计算速度减慢?
总之,如何从紧密的内部 SSE 循环中获得最大速度?
也欢迎任何相关的 x87 浮点速度建议。
What are the best settings for stuff like MXCSR? Which rounding mode is fastest? On what processors? Is it faster to enable signalling NaNs so I get informed when a computation results in a nan, or does this cause slowdowns in non-NaN computations?
In summary, how do you get the maximum of speed out of tight inner SSE loops?
Any related x87 floating-point speed advice also welcome.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果您的计算可能会遇到非正规数,并且非常小的值的准确性对您的计算并不重要,那么请务必打开 FZ 和 DAZ(在计算开始时一次;不要过多地接触 MXCSR) )。如果您的计算不涉及非正规值,它们不会产生任何影响,但如果涉及非正规值,差异可能会非常显着。
其他 MXCSR 位对性能完全没有任何影响。
唯一与 x87 相关的性能建议是:不要使用 x87 单元。尽可能在 SSE 中进行计算。
If you computation is likely to encounter denormals, and accuracy of very small values is not important to your computation, then by all means turn on FZ and DAZ (once, at the start of your computation; don't touch the MXCSR more than necessary). They won't make any difference if your computation doesn't involve denormal values, but if it does, the difference can be quite significant.
None of the other MXCSR bits have any effect on performance at all.
The only x87-related performance advice is: don't use the x87 unit. Do your computations in SSE instead whenever possible.
使用“清零”和“非正规数为零”模式:它们旨在以您可能不会注意到的精度成本提高速度。
我怀疑不同的舍入模式具有不同的成本。理论上,舍入到最接近的值是最难的,但在硬件实现中,我猜想在相同数量的周期中执行此操作的附加晶体管可能无论如何都存在,并且只是不用于定向舍入。
发信号 NaN 不会减慢非 NaN 计算的速度。
在计算之前仅设置一次控制标志字:在计算期间更改它将使您实现的任何节省相形见绌。
Use Flush-to-zero and Denormals-are-zero modes: they are intended for speed at a precision cost that you probably won't notice.
I doubt that different rounding modes have different costs. Round-to-nearest is hardest in theory, but in a hardware implementation, I would guess the additional transistors to do it in the same number of cycles are probably there anyway, and are just unused for directed rounding.
Signaling NaNs do not slow down non-NaN computations.
Set the control flags word only once before your computation: changing it during the computation will dwarf any savings you achieve.