寻找 ia32、ia64、amd64 和 powerpc 预取指令的最佳等效项
我正在查看一些稍微混乱的代码,这些代码尝试使用各种编译器内置函数对预取指令进行平台抽象。它最初似乎基于 powerpc 语义,分别使用 dcbt 和 dcbtst 进行读取和写入预取变体(这两者都在新的可选流操作码中传递 TH=0)。
在 ia64 平台上,我们有 for read:
__lfetch(__lfhint_nt1, pTouch)
while for write:
__lfetch_excl(__lfhint_nt1, pTouch)
这(读与写预取)似乎与 powerpc 语义相当匹配(除了 ia64 允许临时提示)。
有点奇怪的是,所讨论的 ia32/amd64 代码使用的是
prefetchnta
Not
prefetchnt1
,如果该代码与 ia64 实现一致(我们的(仍然存在的)hpipf 端口和我们现在死了的 Windows 和 linux 的代码中的 #ifdef 变体) ia64 端口)。
由于我们正在使用 intel 编译器进行构建,因此我应该能够通过切换到 xmmintrin.h 内置函数来使我们的许多 ia32/amd64 平台保持一致:
_mm_prefetch( (char *)pTouch, _MM_HINT_NTA )
_mm_prefetch( (char *)pTouch, _MM_HINT_T1 )
...只要我能弄清楚应该使用什么时间提示。
问题:
是否存在读与写 ia32/amd64 预取指令?我在指令集参考中没有看到任何指令。
nt1、nt2、nta 时间变化之一是否会优先用于读取与写入预取?
知道是否有充分的理由在 ia32/amd64 上使用 NTA 时间提示,但在 ia64 上使用 T1?
I'm looking at some slightly confused code that's attempted a platform abstraction of prefetch instructions, using various compiler builtins. It appears to be based on powerpc semantics initially, with Read and Write prefetch variations using dcbt and dcbtst respectively (both of these passing TH=0 in the new optional stream opcode).
On ia64 platforms we've got for read:
__lfetch(__lfhint_nt1, pTouch)
wherease for write:
__lfetch_excl(__lfhint_nt1, pTouch)
This (read vs. write prefetching) appears to match the powerpc semantics fairly well (with the exception that ia64 allows for a temporal hint).
Somewhat curiously the ia32/amd64 code in question is using
prefetchnta
Not
prefetchnt1
as it would if that code were to be consistent with the ia64 implementations (#ifdef variations of that in our code for our (still live) hpipf port and our now dead windows and linux ia64 ports).
Since we are building with the intel compiler I should be able to many of our ia32/amd64 platforms consistent by switching to the xmmintrin.h builtins:
_mm_prefetch( (char *)pTouch, _MM_HINT_NTA )
_mm_prefetch( (char *)pTouch, _MM_HINT_T1 )
... provided I can figure out what temporal hint should be used.
Questions:
Are there read vs. write ia32/amd64 prefetch instructions? I don't see any in the instruction set reference.
Would one of the nt1, nt2, nta temporal variations be preferred for read vs. write prefetching?
Any idea if there would have been a good reason to use the NTA temporal hint on ia32/amd64, yet T1 on ia64?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
某些系统支持
prefetchw
写入指令如果该行由调用线程独占使用,那么如何携带该行并不重要,读取和写入都可以使用它。上面提到的 prefetchw 的好处是,它会带来该行并赋予您对该行的所有权,如果该行也被另一个核心使用,这可能需要一段时间。另一方面,提示级别与 MESI 状态正交,并且仅影响预取行的生存时间。如果您在实际访问之前预取很长时间并且不希望预取在该持续时间内丢失,或者在访问之前预取并且不希望预取过多地破坏您的缓存,那么这一点很重要。
只是推测 - 也许更大的缓存和激进的内存带宽更容易受到不良预取的影响,并且您希望通过非时间提示来减少影响。考虑一下你的预取器突然被释放以获取它可以获取的任何东西,你最终会陷入垃圾预取的泥沼,这会丢失大量有用的缓存行。 NTA 的提示使它们互相超越,而其余部分则完好无损。
当然,这也可能只是一个错误,我不能确定,只有开发编译器的人才能知道,但由于上述原因,这可能是有意义的。
Some systems support the
prefetchw
instructions for writesIf the line is exclusively used by the calling thread, it shouldn't matter how you bring the line, both reads and writes would be able to use it. The benefit for prefetchw mentioned above is that it will bring the line and give you ownership on it, which may take a while if the line was also used by another core. The hint level on the other hand is orthogonal with the MESI states, and only affects how long would the prefetched line survive. This matters if you prefetch long ahead of the actual access and don't want to prefetch to get lost in that duration, or alternatively - prefetch right before the access, and don't want the prefetches to thrash your cache too much.
Just speculating - perhaps the larger caches and aggressive memory BW are more vulnerable to bad prefetching and you'd want to reduce the impact through the non-temporal hint. Consider that your prefetcher is suddenly set loose to fetch anything it can, you'd end up swamped in junk prefetches that would through away lots of useful cachelines. The NTA hint makes them overrun each other, leaving the rest undamaged.
Of course this may also be just a bug, I can't tell for sure, only whoever developed the compiler, but it might make sense for the reason above.
我能找到的有关 x86 预取提示类型的最佳资源是这篇好文章 每个程序员都应该了解内存知识。
对于 x86 上的大部分来说,对于读取和写入预取没有不同的指令。例外似乎是那些非时间对齐的,其中写入可以绕过缓存,但据我所知,读取总是会被缓存。
很难回溯为什么早期的代码所有者在某种架构上使用一种提示而不是另一种。他们可能会假设该系列的处理器上有多少缓存可用、二进制文件的典型工作集大小、长期控制流模式等……并且不知道这些假设中有多少得到了良好的支持。推理或数据。从这里有限的背景来看,我认为您有理由采取对您现在正在开发的平台最有意义的方法,无论在其他平台上做了什么。当您考虑像这篇文章这样的文章时尤其如此,这并不是我遇到的唯一上下文听说通过软件预取很难获得任何性能提升。
是否有更多预先知道的细节,例如使用此代码时的典型缓存未命中率,或者预计预取多少会有所帮助?
The best resource I could find on x86 prefetching hint types was the good ol' article What Every Programmer Should Know About Memory.
For the most part on x86 there aren't different instructions for read and write prefetches. The exceptions seem to be those that are non-temporal aligned, where a write can bypass the cache but as far as I can tell, a read will always get cached.
It's going to be hard to backtrack through why the earlier code owners used one hint and not the other on a certain architecture. They could be making assumptions about how much cache is available on processors in that family, typical working set sizes for binaries there, long term control flow patterns, etc... and there's no telling how much any of those assumptions were backed up with good reasoning or data. From the limited background here I think you'd be justified in taking the approach that makes the most sense for the platform you're developing on now, regardless what was done on other platforms. This is especially true when you consider articles like this one, which is not the only context where I've heard that it's really, really hard to get any performance gain at all with software prefetches.
Are there any more details known up front, like typical cache miss ratios when using this code, or how much prefetches are expected to help?