Intel Core 2 Duo 预取
有人有过在 Core 2 Duo 处理器上使用预取指令的经验吗?
我一直在一系列 P4 机器上成功使用(标准?)预取集(prefetchnta
、prefetcht1
等),但是在 Core 上运行代码时2 Duo 似乎 prefetcht(i)
指令不执行任何操作,并且 prefetchnta
指令效率较低。
我评估性能的标准是当向量大小足够大以实现缓存外行为时,BLAS 1 向量-向量 (axpy) 操作的计时结果。
英特尔推出了新的预取指令吗?
Has anyone had experience using prefetch instructions for the Core 2 Duo processor?
I've been using the (standard?) prefetch set (prefetchnta
, prefetcht1
, etc) with success for a series of P4 machines, but when running the code on a Core 2 Duo it seems that the prefetcht(i)
instructions do nothing, and that the prefetchnta
instruction is less effective.
My criteria for assessing performance is the timing results for a BLAS 1 vector-vector (axpy) operation, when the vector size is large enough for out-of-cache behaviour.
Have Intel introduced new prefetch instructions?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
来自关于英特尔 64 和 IA-32 架构的英特尔参考文档,查看第 163 和 77 页:
From an Intel reference document on Intel 64 and IA-32 Architectures, check out page 163 and 77:
我不知道这是否是您的代码的问题,但请考虑不同处理器之间的缓存行大小(决定与预取指令一起使用的步幅大小)可能会有所不同。因此,如果您使用在不满足该假设的 CPU 上假设不同缓存行大小的情况下优化的代码,则必然会降低性能。
这个问题询问如何确定预取缓存行大小。
I don't know whether it might be an issue with your code, but consider that the cache line size (which determines the stride size for use with prefetch instructions) may vary between different processors. Therefore, if you use code which is optimised under the assumption of a different cache line size on a CPU where this assumption isn't met, it's bound to deteriorate performance.
This question here asked how to determine prefetch cache line size.
我在一个紧密的循环中尝试过一次,我试图优化加载 4 个双精度数,并且每个循环执行大约 15 次浮点运算。
我发现,要对 core 2 duo 产生积极影响,需要在代码中提前至少 16 个循环设置预取,而对于较旧的处理器,提前 4 个循环就足够了。
I've tried this once on a tight loop I was trying to optimize that loaded 4 doubles and did about 15 floating point operations per loop.
I found that to have a positive effect on a core 2 duo, the prefetch needed to be set for at least 16 loops ahead in the code, where for older processors 4 loops ahead was enough.