预取对齐内存
我有一些线程 C 代码,需要对处理后的数据结构进行 64 字节对齐。这种对齐将如何与 gcc __builtin_prefetch 等预取指令交互?预取的效果与使用非对齐数组的效果是否相同?
请注意,我使用 memalign 来获取对齐的数组。
谢谢。
I have some threaded C code that requires 64 byte alignment of the processed data structure. How will this alignment interact with prefetch instructions like the gcc __builtin_prefetch? Will the effects of prefetching be the same as using a non-aligned array or not?
Note that I am using memalign to obtain the aligned array.
Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这个问题的答案高度依赖于实现。
但是,在 x86 和 x86_64 上,GCC 将
__builtin_prefetch
实现为单个PREFETCH
汇编指令。根据英特尔文档(搜索“PREFETCH”):
我 99% 确定 AMD 版本的行为方式相同,但我太忙而无法检查...
因此,如果内存操作数未对齐,它将有效地四舍五入减少到 64 字节的倍数,并且该缓存行将被预取。 (嗯,我知道的所有当前 CPU 上都是 64 字节。指令集引用仅保证“至少 32 字节”。不知道他们为什么要这么说;在任何使用这个小工具有意义的情况下,您必须已经对特定 CPU 做出了很多假设。)
The answer to this one is highly implementation-dependent.
However, on x86 and x86_64, GCC implements
__builtin_prefetch
as a singlePREFETCH
assembly instruction.According to Intel's documentation (search for "PREFETCH"):
I am 99% sure the AMD version behaves the same way, but I am too busy to check...
So if the memory operand is unaligned, it will effectively be rounded down to a multiple of 64 bytes and that cache line will be prefetched. (Well, 64 bytes on all the current CPUs I know of. The instruction set reference only guaranteed to be "a minimum of 32 bytes". Not sure why they bothered saying that; in any situation where it makes sense to use this gadget, you have to be assuming a lot about the particular CPU already.)