预取指令
看来预取使用的一般逻辑是,只要代码忙于处理,直到预取指令完成其操作,就可以添加预取。但是,如果使用太多的预取指令,似乎会影响系统的性能。我发现我们首先需要有没有预取指令的工作代码。稍后我们需要在代码的各个位置进行预取指令的各种组合并进行分析以确定由于预取而真正可以改进的代码位置。有没有更好的方法来确定应该使用预取指令的确切位置?
It appears the general logic for prefetch usage is that prefetch can be added, provided the code is busy in processing until the prefetch instruction completes its operation. But, it seems that if too much of prefetch instructions are used, then it would impact the performance of the system. I find that we need to first have the working code without prefetch instruction. Later we need to various combination of prefetch instruction in various locations of code and do analysis to determine the code locations that could actually improve because of prefetch. Is there any better way to determine the exact locations in which the prefetch instruction should be used ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
在大多数情况下,预取指令几乎没有任何好处,甚至在某些情况下甚至会适得其反。大多数现代 CPU 都有自动预取机制,该机制运行良好,因此添加软件预取提示效果甚微,甚至会干扰自动预取,实际上会降低性能。
在某些罕见的情况下,例如当您流式传输大数据块时,您几乎不做任何实际处理,您可能会设法通过软件启动的预取来隐藏一些延迟,但很难做到正确 - 您需要在您要使用数据之前开始预取几百个周期 - 太晚,您仍然会出现缓存未命中,太早,您的数据可能会在您准备使用它之前从缓存中逐出。通常这会将预取放在代码的某些不相关部分中,这不利于模块化和软件维护。更糟糕的是,如果您的架构发生变化(新的 CPU、不同的时钟速度等),导致 DRAM 访问延迟增加或减少,您可能需要将预取指令移至代码的另一部分以保持其有效。
无论如何,如果您觉得确实必须使用预取,我建议在任何预取指令周围使用#ifdef,以便您可以在使用或不使用预取的情况下编译代码,并查看它是否确实有助于(或阻碍)性能,例如,
一般来说,我会建议在完成所有更高效、更明显的事情后,将软件预取作为最后的微优化手段。
In the majority of cases prefetch instructions are of little or no benefit, and can even be counter-productive in some cases. Most modern CPUs have an automatic prefetch mechanism which works well enough that adding software prefetch hints achieves little, or even interferes with automatic prefetch, and can actually reduce performance.
In some rare cases, such as when you are streaming large blocks of data on which you are doing very little actual processing, you may manage to hide some latency with software-initiated prefetching, but it's very hard to get it right - you need to start the prefetch several hundred cycles before you are going to be using the data - do it too late and you still get a cache miss, do it too early and your data may get evicted from cache before you are ready to use it. Often this will put the prefetch in some unrelated part of the code, which is bad for modularity and software maintenance. Worse still, if your architecture changes (new CPU, different clock speed, etc), such that DRAM access latency increases or decreases, you may need to move your prefetch instructions to another part of the code to keep them effective.
Anyway, if you feel you really must use prefetch, I recommend #ifdefs around any prefetch instructions so that you can compile your code with and without prefetch and see if it is actually helping (or hindering) performance, e.g.
In general though, I would recommend leaving software prefetch on the back burner as a last resort micro-optimisation after you've done all the more productive and obvious stuff.
即使考虑预取代码,性能也一定已经成为一个问题。
1:使用代码分析器。尝试在没有分析器的情况下使用预取是浪费时间。
2:每当您在关键位置发现一条指令异常缓慢时,您就有一个预取的候选指令。通常,实际问题出在慢速内存访问之前的线路上,而不是分析器指示的慢速内存访问上。找出导致问题的内存访问(并不总是那么容易)并预取它。
3 再次运行分析器并查看是否有任何变化。如果没有取出来的话。
有时我用这种方法将循环速度提高了 300% 以上。如果您有一个以非顺序方式访问内存的循环,那么它通常是最有效的。
我完全不同意它在现代 CPU 上不太有用,我发现完全相反,尽管在较旧的 CPU 上预取大约 100 条指令是最佳的,现在我会把这个数字设置为 500 左右。
To even consider prefetching code performance must already be an issue.
1: use a code profiler. Trying to use prefetch without a profiler is a waste of time.
2: whenever you find an instruction in a critical place that is anomalously slow, you have a candidate for a prefetch. Often the actual problem is on the memory access on the line before the slow one, rather than the slow one as indicated by the profiler. Work out what memory access is causing the problem (not always easy) and prefetch it.
3 Run your profiler again and see if it made any difference. If it didn't take it out.
On occasion I have sped up loops by >300% this way. It's generally most effective if you have a loop accessing memory in a non-sequential way.
I Disagree completely about it being less useful on modern CPU's, I have found completely the opposite, though on older CPU's prefetching about 100 instructions was optimal, these day's I'd put that number more like 500.
当然,您必须进行一些尝试,但并不是说您需要在需要数据之前获取数百个周期(100-300)。 L2 缓存足够大,预取的数据可以在那里停留一段时间。
这种预取在循环(当然是几百个循环)之前非常有效,特别是如果它是内部循环并且循环每秒启动数千次。
此外,对于快速 LL 实现或树实现,预取可以获得可衡量的优势,因为 CPU 不知道很快需要数据。
但请记住,预取指令会占用一些解码器/队列带宽,因此过度使用它们会损害性能。
Sure, you have to experimate a bit, but not that you need to fetch somme houndred cycles (100-300) before the data is needed. The L2 cache is big enougth that the prefetched data can stay there a while.
This prefetching is very efficient in front of a a loop (a few houndred cycles of course), especialy if it is the inner loop and the loop is started thousand and more times per secound.
Also for ur ow fast LL implementation or a Tree-implementation could prefetching gain an measurable advantage because the CPU don't know jet that the data is needed soon.
But remember that the prefetching instruction eat some decoder/queue bandwidth so overusing them hurts performance because of that reason.