MKL库中BLAS sdot操作实现
我测试了 BLAS sdot
接口的单个精确浮点点运算。我发现 Intel MKL 库的结果与 http://netlib 中给出的 BLAS fortran 代码的结果略有不同。 org/blas/. MKL 看起来更准确。
我只是想知道MKL有什么优化吗?或者说MKL是如何实现的,使其更加准确?
I tested the BLAS sdot
interface for single precise floating point dot operations. I found that the results of Intel MKL library are a little different from that of the BLAS fortran code given in http://netlib.org/blas/. The MKL ones appear more accurate.
I just wonder is there any optimization made by MKL? Or how does MKL implement it to make it more accurate?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
好吧,由于 MKL 是由特定 CPU 供应商专门为他们自己的产品编写的,我想他们可以比参考实现使用更多关于底层机器的知识。
第一个想法可能是他们使用优化的汇编,并始终将运行总和保留在 x87 80 位浮点堆栈上,而不在每次迭代中将其四舍五入到 32 位。或者也许他们使用 SSE(2) 并以双精度计算整个总和(从性能角度来看,这对于加法和乘法不会有太大影响)。或者他们可能使用完全不同的计算或黑魔法机器的技巧。
关键是,这些例程对于特定硬件的优化程度比基本参考实现要高得多,但在没有看到它们的实现的情况下,我们无法说出具体的优化方式。上述想法只是简单的方法。
Well, since the MKL is especially written by a specific CPU vendor for their own products, I guess they can use a bit more knowledge about the underlying machine than the reference implementation can.
First thoughts may be that they use optimized assembly and always keep the running sum on the x87 80bit floating point stack without rounding it down to 32bit in each iteration. Or maybe they use SSE(2) and compute the whole sum in double precision (which shouldn't make much of a difference for addition and multiplication, performance-wise). Or maybe they use a completely different computation or what black magic machine tricks ever.
The point is that these routines are far more optimized for a specific hardware than the basic reference implementation, but without seeing their implementation we cannot say in which way. The above mentioned ideas are just simple approaches.