CPU乱序执行会导致内存重新排序吗?
我知道存储缓冲区和无效队列是导致内存重新排序的原因。我不知道乱序执行是否会导致内存重新排序。
在我看来,乱序执行不会导致重新排序,因为结果总是按顺序退休,如这个问题中提到的。
为了让我的问题更清楚,假设我们有这样一个宽松的内存一致性架构:
- 它没有存储缓冲区和无效队列
- 它可以执行乱序执行
在这种架构中内存重新排序还能发生吗?
内存屏障有两个功能,一是禁止乱序执行,二是刷新失效队列并清空存储缓冲区?
I know store buffer and invalidate queues are reasons that cause memory reordering. What I don't know is if Out-of-Order-Execution can cause memory reordering.
In my opinion, Out-of-Order-Execution can't cause reordering because the results are always retired in-order as mentioned in this question.
To make my question more clear, let's say we have such an relax memory consistency architecture:
- It doesn't have store buffer and invalidate queues
- It can do Out-of-Order-Execution
Can memory reordering still happen in this architecture?
Does memory barrier has two functions, one is forbidding the Out-of-Order execution, the other is flushing invalidation queue and draining store buffer?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
是的,失效执行肯定会导致内存重新排序,例如加载/加载重新排序,
而不是按顺序退休的负载的问题,即负载值何时绑定到负载指令。例如,LOAD1可以在程序顺序中载荷2之前,LOAD2在LOAD1之前从内存中获取其值,例如,如果有一个中间存储到按LOAD2读取的位置,则进行了加载/加载重新排序。
但是,某些系统,例如英特尔P6家族系统,具有检测此类条件以获得更强记忆秩序模型的其他机制。
在这些系统中,所有负载均已缓冲直到退休,如果将可能的存储者检测到这样的缓冲而尚未退休的负载,则负载和程序订单指令被“ nuked”,并且恢复执行是ART,EG,LOAD2。
在我得知IBM的布拉德·弗雷(Brad Freye)发明了很多年之前,我称这个弗雷伊(Freye)的规则为窥探。我相信标准的学术参考是Gharachorloo。
即,在退休之前,它的缓冲负载不多,而是提供与缓冲载荷相关的检测和校正机制,直到退休。许多CPU提供缓冲直到退休,但不提供此检测机制。
还要注意,这需要类似基于窥探的高速缓存相干性的东西。许多系统,包括具有此类机制的Intel系统也支持非合并内存,例如可能被缓存但由软件管理的内存。如果允许投机载荷到此类可缓存但非连接的内存区域,则Freye的规则机制将行不通,并且记忆将被弱排序。
注意:我说“缓冲区直到退休”,但是如果您考虑一下,您可以很容易地提出直到退休后才提出缓冲的方法。例如,当所有较早的负载使自己受到束缚时,您可以停止这种窥探,并且不再有可能被介入的商店甚至传统地观察。
这可能很重要,因为“提前退休”可以获得很多绩效,在所有早期说明退休之前,删除了诸如缓冲和维修机制的负担之类的说明。提前退休可以大大降低订单硬件机制的成本。
Yes, out of order execution can definitely cause memory reordering, such as load/load re-ordering
It is not so much a question of the loads being retired in order, as of when the load value is bound to the load instruction. Eg Load1 may precede Load2 in program order, Load2 gets its value from memory before Load1 does, and eg if there is an intervening store to the location read by Load2, then Load/load reordering has occurred.
However, certain systems, such as Intel P6 family systems, have additional mechanisms to detect such conditions to obtain stronger memory order models.
In these systems all loads are buffered until retirement, and if a possible store is detected to such a buffered but not yet retired load, then the load and program order instructions are “nuked”, and execution is resumed art, e.g., Load2.
I call this Freye’s Rule snooping, after I learned that Brad Freye at IBM had invented it many years before I thought I had. I believe the standard academic reference is Gharachorloo.
I.e. it is not so much buffering loads until retirement, as it is providing such a detection and correction mechanism associated with buffering loads until retirement. Many CPUs provide buffering until retirement but do not provide this detection mechanism.
Note also that this requires something like snoop based cache coherence. Many systems, including Intel systems that have such mechanisms also support noncoherent memory, e.g. memory that may be cached but which is managed by software. If speculative loads are allowed to such cacheable but non-coherent memory regions, the Freye’s Rule mechanism will not work and memory will be weakly ordered.
Note: I said “buffer until retirement”, but if you think about it you can easily come up with ways of buffering not quite until retirement. E.g. you can stop this snooping when all earlier loads have them selves been bound, and there is no longer any possibility of an intervening store being observed even transitively.
This can be important, because there is quite a lot of performance to be gained by “early retirement“, removing instructions such as loads from buffering and repair mechanisms before all earlier instructions have retired. Early retirement can greatly reduce the cost of out of order hardware mechanisms.