Intel Core 2 Duo 预取

发布于 2024-08-12 04:04:47 字数 309 浏览 11 评论 0原文

有人有过在 Core 2 Duo 处理器上使用预取指令的经验吗?

我一直在一系列 P4 机器上成功使用(标准?)预取集(prefetchntaprefetcht1 等),但是在 Core 上运行代码时2 Duo 似乎 prefetcht(i) 指令不执行任何操作,并且 prefetchnta 指令效率较低。

我评估性能的标准是当向量大小足够大以实现缓存外行为时,BLAS 1 向量-向量 (axpy) 操作的计时结果。

英特尔推出了新的预取指令吗?

Has anyone had experience using prefetch instructions for the Core 2 Duo processor?

I've been using the (standard?) prefetch set (prefetchnta, prefetcht1, etc) with success for a series of P4 machines, but when running the code on a Core 2 Duo it seems that the prefetcht(i) instructions do nothing, and that the prefetchnta instruction is less effective.

My criteria for assessing performance is the timing results for a BLAS 1 vector-vector (axpy) operation, when the vector size is large enough for out-of-cache behaviour.

Have Intel introduced new prefetch instructions?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

原来分手还会想你 2024-08-19 04:04:47

来自关于英特尔 64 和 IA-32 架构的英特尔参考文档,查看第 163 和 77 页:

奔腾 4 和英特尔至强处理器
基于英特尔NetBurst
微架构引入硬件
除了软件之外的预取
预取。硬件预取器
透明地操作以获取数据
和来自内存的指令流
无需程序员
干涉。随后的
微架构持续改进
并向硬件添加功能
预取机制。早些时候
硬件的实现
预取机制重点关注
预取数据和指令
内存到L2;最近的
实现提供了额外的
将数据从 L2 预取到的功能
L1。在英特尔 NetBurst 中
微架构、硬件
预取器可以跟踪8个独立的
流。

Pentium M 处理器还提供
数据的硬件预取器。它可以
跟踪 12 个独立的流
前进方向和 4 个流
向后的方向。处理器的
PREFETCHNTA 指令也取
64字节进入第一级数据
缓存而不污染
二级缓存。

英特尔酷睿单核和英特尔酷睿双核
处理器提供更先进的
数据的硬件预取器
奔腾 M 处理器。主要差异
总结如表2-10所示。

From an Intel reference document on Intel 64 and IA-32 Architectures, check out page 163 and 77:

Pentium 4 and Intel Xeon processors
based on Intel NetBurst
microarchitecture introduced hardware
prefetching in addition to software
prefetching. The hardware prefetcher
operates transparently to fetch data
and instruction streams from memory
without requiring programmer
intervention. Subsequent
microarchitectures continue to improve
and add features to the hardware
prefetching mechanisms. Earlier
implementations of hardware
prefetching mechanisms focus on
prefetching data and instruction from
memory to L2; more recent
implementations provide additional
features to prefetch data from L2 to
L1. In Intel NetBurst
microarchitecture, the hardware
prefetcher can track 8 independent
streams.

The Pentium M processor also provides
a hardware prefetcher for data. It can
track 12 separate streams in the
forward direction and 4 streams in the
backward direction. The processor’s
PREFETCHNTA instruction also fetches
64-bytes into the firstlevel data
cache without polluting the
second-level cache.

Intel Core Solo and Intel Core Duo
processors provide more advanced
hardware prefetchers for data than
Pentium M processors. Key differences
are summarized in Table 2-10.

情归归情 2024-08-19 04:04:47

我不知道这是否是您的代码的问题,但请考虑不同处理器之间的缓存行大小(决定与预取指令一起使用的步幅大小)可能会有所不同。因此,如果您使用在不满足该假设的 CPU 上假设不同缓存行大小的情况下优化的代码,则必然会降低性能。

这个问题询问如何确定预取缓存行大小。

I don't know whether it might be an issue with your code, but consider that the cache line size (which determines the stride size for use with prefetch instructions) may vary between different processors. Therefore, if you use code which is optimised under the assumption of a different cache line size on a CPU where this assumption isn't met, it's bound to deteriorate performance.

This question here asked how to determine prefetch cache line size.

挽袖吟 2024-08-19 04:04:47

我在一个紧密的循环中尝试过一次,我试图优化加载 4 个双精度数,并且每个循环执行大约 15 次浮点运算。
我发现,要对 core 2 duo 产生积极影响,需要在代码中提前至少 16 个循环设置预取,而对于较旧的处理器,提前 4 个循环就足够了。

I've tried this once on a tight loop I was trying to optimize that loaded 4 doubles and did about 15 floating point operations per loop.
I found that to have a positive effect on a core 2 duo, the prefetch needed to be set for at least 16 loops ahead in the code, where for older processors 4 loops ahead was enough.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文