定点算术值得我费力吗?
我正在研究一个应该实时运行的流体动力学纳维-斯托克斯求解器。因此,性能很重要。
现在,我正在研究许多紧密循环,每个循环都占执行时间的很大一部分:不存在单一瓶颈。这些循环中的大多数都执行一些浮点运算,但中间有很多分支。
浮点运算主要限于加法、减法、乘法、除法和比较。所有这些都是使用 32 位浮点数完成的。我的目标平台是 x86,至少具有 SSE1 指令。 (我已经在汇编器输出中验证了编译器确实生成了 SSE 指令。)
我正在使用的大多数浮点值都有一个相当小的上限,并且接近零值的精度并不是很重要。于是我想到:也许改用定点运算可以加快速度?我知道真正确定的唯一方法是衡量它,这可能需要几天的时间,所以我想提前知道成功的几率。
定点在《毁灭战士》时代曾风靡一时,但我不确定 2010 年它的地位如何。考虑到现在有多少芯片被投入到浮点性能中,定点运算是否还有可能继续存在给我显着的速度提升?有人有可能适用于我的情况的现实经验吗?
I'm working on a fluid dynamics Navier-Stokes solver that should run in real time. Hence, performance is important.
Right now, I'm looking at a number of tight loops that each account for a significant fraction of the execution time: there is no single bottleneck. Most of these loops do some floating-point arithmetic, but there's a lot of branching in between.
The floating-point operations are mostly limited to additions, subtractions, multiplications, divisions and comparisons. All this is done using 32-bit floats. My target platform is x86 with at least SSE1 instructions. (I've verified in the assembler output that the compiler indeed generates SSE instructions.)
Most of the floating-point values that I'm working with have a reasonably small upper bound, and precision for near-zero values isn't very important. So the thought occurred to me: maybe switching to fixed-point arithmetic could speed things up? I know the only way to be really sure is to measure it, that might take days, so I'd like to know the odds of success beforehand.
Fixed-point was all the rage back in the days of Doom, but I'm not sure where it stands anno 2010. Considering how much silicon is nowadays pumped into floating-point performance, is there a chance that fixed-point arithmetic will still give me a significant speed boost? Does anyone have any real-world experience that may apply to my situation?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
坚持使用浮点。仅当您可以在 8 位或 16 位内工作并使用 SIMD 时,定点实际上才有用(图像处理和音频是这种情况的典型用例)。
现代 CPU 通常有 2 个 FPU,每个时钟周期最多可以发出 2 个 FP 指令。然后,您还可以使用 4 路 FP SIMD (SSE) 进行优化。
如果您仍在努力获得良好的性能,请尝试使用更好的编译器,例如英特尔的 ICC。此外,由于 64 位模型中的寄存器数量增加,64 位 Intel 可执行文件往往比 32 位可执行文件要快一些,因此如果可以的话,请构建 64 位。
当然,您也应该分析您的代码,以便您确定热点在哪里。您没有说明您正在使用什么操作系统,而是在 Windows 上使用 VTune, Linux 上的 Zoom 或 Shark 将帮助您快速轻松地找到性能瓶颈。
Stick with floating point. Fixed point is really only useful if you can work within 8 bits or 16 bits and use SIMD (image processing and audio are typical use cases for this).
Modern CPUs typically have 2 FPUs and you can issue up to 2 FP instructions per clock cycle. You also then have the possibility of optimisation using 4 way FP SIMD (SSE).
If you're still struggling to get good performance then try using a better compiler, such as Intel's ICC. Also, 64-bit Intel executables tend to be somewhat faster than their 32-bit counterparts due to the increased number of registers in the 64-bit model, so build for 64-bit if you can.
And of course you should profile your code too, so that you know for certain where the hotspots are. You don't say what OS you're using but VTune on Windows, Zoom on Linux or Shark on Mac OS X will all help you to quickly and easily find your performance bottlenecks.
正如其他人所说,如果您已经在使用浮点 SIMD,我怀疑定点会带来很大的改进。
您说编译器正在发出 SSE 指令,但听起来您并没有尝试编写矢量化 SSE 代码。我不知道编译器通常在这方面做得有多好,但这是值得研究的事情。
另外两个需要注意的方面是:
内存访问 - 如果所有计算都是在 SSE 中完成,那么缓存未命中可能会比实际数学占用更多时间。
展开 - 您应该能够通过展开内部循环来获得性能优势。目标并不是(如许多人认为的那样)减少循环终止检查的数量。主要好处是允许独立指令交错,以隐藏指令延迟。 此处有一个题为VMX优化:提升水平的演示文稿这可能会有所帮助;它主要关注 Xbox360 上的 Altivec 指令,但一些展开建议也可能对 SSE 有帮助。
正如其他人提到的,个人资料,个人资料,个人资料。然后让我们知道什么仍然很慢:)
PS - 在您的其他帖子之一
As other people have said, if you're already using floating-point SIMD, I doubt you'll get much improvement with fixed point.
You said that the compiler is emitting SSE instructions, but it doesn't sound like you've tried writing your vectorized SSE code. I don't know how good the compilers usually are at that, but it's something to investigate.
Two other areas to look at are:
Memory access - if all your computations are done in SSE, then cache misses might be taking up more time than the actual math.
Unrolling - you should be able to get a performance benefit from unrolling your inner loops. The goal is not (as many people think) to reduce the number of loop termination checks. The main benefit is to allow independent instructions to be interleaved, to hide the instruction latency. There a presentation here entitled VMX Optimization: Taking it up a Level which might help a bit; it's focused on Altivec instructions on Xbox360, but some of the unrolling advice might help on SSE as well.
As other people have mentioned, profile, profile, profile. And then let us know what's still slow :)
PS - on one of your other posts here, I convinced you to use SOR instead of Gauss-Seidel in your matrix solver. Now that I think about it, is there a reason that you're not using a tri-diagonal solver?
您的机器针对浮点进行了很好的优化,因此使用定点分数可能不会节省太多。
你说瓶颈不存在,但可能有多个,如果你设法剃掉其中任何一个,那么其他的就会占用剩余时间更大的比例,吸引你的注意力,所以你也可以剃掉它们。
您可能已经这样做了,但我会确保不仅耗时的函数尽可能快,而且调用它们的次数不会超过必要的次数。
Your machine is pretty well optimized for floating point, so you probably won't save much by going to fixed-point fractions.
You say there's no single bottleneck, but there may be multiple, and if you manage to shave any one of them, then the others will take larger percentages of the remaining time, drawing your attention to them, so you can shave them too.
You've probably done this, but I would make sure not only that the time-consuming functions are as quick as possible, but they are being called no more than necessary.