每条汇编指令需要多少个CPU周期?

发布于 2024-07-16 04:08:15 字数 724 浏览 13 评论 0原文

我听说英特尔在线书籍描述了特定汇编指令所需的 CPU 周期,但我无法找到它(经过努力后)。 谁能告诉我如何找到CPU周期?

这是一个例子,在下面的代码中,mov/lock 是 1 个 CPU 周期,xchg 是 3 个 CPU 周期。

// This part is Platform dependent!
#ifdef WIN32
inline int CPP_SpinLock::TestAndSet(int* pTargetAddress, 
                                              int nValue)
{
    __asm
    {
        mov edx, dword ptr [pTargetAddress]
        mov eax, nValue
        lock xchg eax, dword ptr [edx]
    }
    // mov = 1 CPU cycle
    // lock = 1 CPU cycle
    // xchg = 3 CPU cycles
}

#endif // WIN32

顺便说一句:这是我发布的代码的 URL: http://www.codeproject.com /KB/threads/spinlocks.aspx

I heard there is Intel book online which describes the CPU cycles needed for a specific assembly instruction, but I can not find it out (after trying hard). Could anyone show me how to find CPU cycle please?

Here is an example, in the below code, mov/lock is 1 CPU cycle, and xchg is 3 CPU cycles.

// This part is Platform dependent!
#ifdef WIN32
inline int CPP_SpinLock::TestAndSet(int* pTargetAddress, 
                                              int nValue)
{
    __asm
    {
        mov edx, dword ptr [pTargetAddress]
        mov eax, nValue
        lock xchg eax, dword ptr [edx]
    }
    // mov = 1 CPU cycle
    // lock = 1 CPU cycle
    // xchg = 3 CPU cycles
}

#endif // WIN32

BTW: here is the URL for the code I posted: http://www.codeproject.com/KB/threads/spinlocks.aspx

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

゛清羽墨安 2024-07-23 04:08:15

现代CPU是复杂的野兽,使用流水线超标量执行,以及 乱序执行以及其他使性能分析变得困难的技术......但并非不可能

虽然您不能再简单地将指令流的延迟相加来获得总运行时间,但您仍然可以(通常)对某些代码段(尤其是循环)的行为进行高度准确的分析,如下文和中所述其他链接资源。

指令时序

首先,您需要实际时序。 这些因 CPU 架构而异,但目前 x86 计时的最佳资源是 Agner Fog 的指令表。 这些表涵盖了不少于三十种不同的微体系结构,列出了指令延迟,这是一条指令从准备好输入到可用输出所需的最短/典型时间。 用阿格纳的话说:

延迟:这是指令在一个指令中生成的延迟。
依赖链。 这些数字是最小值。 缓存未命中,
未对准和异常可能会增加时钟计数
相当。 在启用超线程的情况下,使用相同的
其他线程中的执行单元会导致性能较差。
非正规数、NAN 和无穷大不会增加延迟。 这
使用的时间单位是核心时钟周期,而不是参考时钟周期
由时间戳计数器给出。

例如,add 指令的延迟为一个周期,因此一系列相关添加指令(如图所示)每个的延迟为 1 个周期>add

add eax, eax
add eax, eax
add eax, eax
add eax, eax  # total latency of 4 cycles for these 4 adds

请注意,这并不意味着 add 指令每条仅需要 1 个周期。 例如,如果加法指令依赖,则在现代芯片上,所有 4 个加法指令可能可以在同一个周期中独立执行:

add eax, eax
add ebx, ebx
add ecx, ecx
add edx, edx # these 4 instructions might all execute, in parallel in a single cycle

Agner 提供了一个度量来捕获一些这种潜在的并行性,称为吞吐量倒数

吞吐量倒数:一系列同类独立指令每条指令的平均核心时钟周期数
在同一个线程中。

对于 add,这被列为 0.25,这意味着每个周期最多可以执行 4 条 add 指令(吞吐量的倒数为 1 / 4 = 0.25)。

吞吐量倒数还暗示了指令的流水线能力。 例如,在最新的 x86 芯片上,imul 指令的常见形式有 3 个周期的延迟,并且内部只有一个执行单元可以处理它们(与 add 不同,通常有四个可添加单元)。 然而,观察到的一长串独立 imul 指令的吞吐量是 1 个/周期,而不是您可能期望的每 3 个周期 1 个(考虑到延迟为 3)。原因是 imul code> 单元是流水线式的:它可以启动一个新的 imul 每个周期,即使之前的乘法尚未完成。

这意味着一系列独立 imul 指令每个周期最多可以运行 1 条指令,但是一系列独立 imul 指令> 指令每 3 个周期仅运行 1 次(因为下一个 imul 在前一个指令的结果准备就绪之前无法启动)。

因此,有了这些信息,您就可以开始了解如何分析现代 CPU 上的指令时序。

详细分析

不过,以上只是触及了表面。 您现在可以通过多种方式查看一系列指令(延迟或吞吐量),但可能不清楚使用哪一种。

此外,还有上述数字未涵盖的其他限制,例如某些指令竞争 CPU 内的相同资源,以及 CPU 管道其他部分(例如指令解码)的限制,这可能会导致较低的性能。总吞吐量比您仅通过查看延迟和吞吐量来计算的要高。 除此之外,您还有“ALU 之外”的因素,例如内存访问和分支预测:整个主题本身 - 您基本上可以很好地对这些进行建模,但这需要工作。 例如,这是最近的帖子,其中答案详细介绍了大多数相关因素。

涵盖所有细节会使这个已经很长的答案的大小增加 10 倍或更多,所以我只会向您指出最好的资源。 Agner Fog 有一个优化装配指南,其中详细介绍了精确的分析具有十几个指令的循环。 请参阅当前版本 PDF 中从第 95 页开始的“12.7向量循环瓶颈分析示例”。

基本思想是创建一个表,每条指令一行,并标记每个指令使用的执行资源。 这可以让您看到任何吞吐量瓶颈。 此外,您需要检查循环中是否存在依赖项,以查看其中是否有任何限制吞吐量(有关复杂情况,请参阅“12.16分析依赖项”)。

如果您不想手动完成,英特尔已经发布了 英特尔架构代码分析器,这是一个自动执行此分析的工具。 目前,Skylake 之外尚未进行更新,但对于 Kaby Lake 来说,结果在很大程度上仍然合理,因为微架构没有发生太大变化,因此时间仍然具有可比性。 这个答案提供了很多细节并提供了示例输出,以及用户指南 还不错(尽管它已经过时了)相对于最新版本的日期)。

其他来源

Agner 通常会在新架构发布后不久提供其时序,但您也可以查看 instlatx64 以了解类似信息在 InstLatX86InstLatX64 结果中组织计时。 结果涵盖了许多有趣的旧芯片,而新芯片通常很快就会出现。 结果大部分与阿格纳的结果一致,但也有一些例外。 您还可以在此页面上找到内存延迟和其他值。

您甚至可以直接从英特尔的 附录 C:指令延迟和吞吐量中的 IA32 和 Intel 64 优化手册。 就我个人而言,我更喜欢 Agner 的版本,因为它们更完整,通常在英特尔手册更新之前到达,并且由于提供电子表格和 PDF 版本而更易于使用。

最后,x86 tag wiki 拥有丰富的 x86 优化资源,包括指向如何进行循环的其他示例的链接准确分析代码序列。

如果您想更深入地了解上述“数据流分析”类型,我建议您数据流图旋风简介

Modern CPUs are complex beasts, using pipelining, superscalar execution, and out-of-order execution among other techniques which make performance analysis difficult... but not impossible!

While you can no longer simply add together the latencies of a stream of instructions to get the total runtime, you can still get a (often) highly accurate analysis of the behavior of some piece of code (especially a loop) as described below and in other linked resources.

Instruction Timings

First, you need the actual timings. These vary by CPU architecture, but the best resource currently for x86 timings is Agner Fog's instruction tables. Covering no less than thirty different microarchitecures, these tables list the instruction latency, which is the minimum/typical time that an instruction takes from inputs ready to output available. In Agner's words:

Latency: This is the delay that the instruction generates in a
dependency chain. The numbers are minimum values. Cache misses,
misalignment, and exceptions may increase the clock counts
considerably. Where hyperthreading is enabled, the use of the same
execution units in the other thread leads to inferior performance.
Denormal numbers, NAN's and infinity do not increase the latency. The
time unit used is core clock cycles, not the reference clock cycles
given by the time stamp counter.

So, for example, the add instruction has a latency of one cycle, so a series of dependent add instructions, as shown, will have a latency of 1 cycle per add:

add eax, eax
add eax, eax
add eax, eax
add eax, eax  # total latency of 4 cycles for these 4 adds

Note that this doesn't mean that add instructions will only take 1 cycle each. For example, if the add instructions were not dependent, it is possible that on modern chips all 4 add instructions can execute independently in the same cycle:

add eax, eax
add ebx, ebx
add ecx, ecx
add edx, edx # these 4 instructions might all execute, in parallel in a single cycle

Agner provides a metric which captures some of this potential parallelism, called reciprocal throughput:

Reciprocal throughput: The average number of core clock cycles per instruction for a series of independent instructions of the same kind
in the same thread.

For add this is listed as 0.25 meaning that up to 4 add instructions can execute every cycle (giving a reciprocal throughput of 1 / 4 = 0.25).

The reciprocal throughput number also gives a hint at the pipelining capability of an instruction. For example, on most recent x86 chips, the common forms of the imul instruction have a latency of 3 cycles, and internally only one execution unit can handle them (unlike add which usually has four add-capable units). Yet the observed throughput for a long series of independent imul instructions is 1/cycle, not 1 every 3 cycles as you might expect given the latency of 3. The reason is that the imul unit is pipelined: it can start a new imul every cycle, even while the previous multiplication hasn't completed.

This means a series of independent imul instructions can run at up to 1 per cycle, but a series of dependent imul instructions will run at only 1 every 3 cycles (since the next imul can't start until the result from the prior one is ready).

So with this information, you can start to see how to analyze instruction timings on modern CPUs.

Detailed Analysis

Still, the above is only scratching the surface. You now have multiple ways of looking at a series of instructions (latency or throughput) and it may not be clear which to use.

Furthermore, there are other limits not captured by the above numbers, such as the fact that certain instructions compete for the same resources within the CPU, and restrictions in other parts of the CPU pipeline (such as instruction decoding) which may result in a lower overall throughput than you'd calculate just by looking at latency and throughput. Beyond that, you have factors "beyond the ALUs" such as memory access and branch prediction: entire topics unto themselves - you can mostly model these well, but it takes work. For example here's a recent post where the answer covers in some detail most of the relevant factors.

Covering all the details would increase the size of this already long answer by a factor of 10 or more, so I'll just point you to the best resources. Agner Fog has an Optimizing Asembly guide that covers in detail the precise analysis of a loop with a dozen or so instructions. See "12.7 An example of analysis for bottlenecks in vector loops" which starts on page 95 in the current version of the PDF.

The basic idea is that you create a table, with one row per instruction and mark the execution resources each uses. This lets you see any throughput bottlenecks. In addition, you need to examine the loop for carried dependencies, to see if any of those limit the throughput (see "12.16 Analyzing dependencies" for a complex case).

If you don't want to do it by hand, Intel has released the Intel Architecture Code Analyzer, which is a tool that automates this analysis. It currently hasn't been updated beyond Skylake, but the results are still largely reasonable for Kaby Lake since the microarchitecture hasn't changed much and therefore the timings remain comparable. This answer goes into a lot of detail and provides example output, and the user's guide isn't half bad (although it is out of date with respect to the newest versions).

Other sources

Agner usually provides timings for new architectures shortly after they are released, but you can also check out instlatx64 for similarly organized timings in the InstLatX86 and InstLatX64 results. The results cover a lot of interesting old chips, and new chips usually show up fairly quickly. The results are mostly consistent with Agner's, with a few exceptions here and there. You can also find memory latency and other values on this page.

You can even get the timing results directly from Intel in their IA32 and Intel 64 optimization manual in Appendix C: INSTRUCTION LATENCY AND THROUGHPUT. Personally I prefer Agner's version because they are more complete, often arrive before the Intel manual is updated, and are easier to use as they provide a spreadsheet and PDF version.

Finally, the x86 tag wiki has a wealth of resources on x86 optimization, including links to other examples of how to do a cycle accurate analysis of code sequences.

If you want a deeper look into the type of "dataflow analysis" described above, I would recommend A Whirlwind Introduction to Data Flow Graphs.

有深☉意 2024-07-23 04:08:15

考虑到流水线、乱序处理、微代码、多核处理器等,无法保证汇编代码的特定部分将恰好占用 x 个 CPU 周期/时钟周期/任何周期。

如果存在这样的参考,它只能在给定特定架构的情况下提供广泛的概括,并且根据微代码的实现方式,您可能会发现 Pentium M 与 Core 2 Duo 不同,而 Core 2 Duo 与 AMD 双核不同等等。

注意,这篇文章是2000年更新的,写得更早。 即使是 Pentium 4 也很难确定指令时序 - PIII、PII 和原始 Pentium 更容易,并且引用的文本可能基于那些具有更明确指令时序的早期处理器。

如今,人们通常使用统计分析来估计代码时序。

Given pipelining, out of order processing, microcode, multi-core processors, etc there's no guarantee that a particular section of assembly code will take exactly x CPU cycles/clock cycle/whatever cycles.

If such a reference exists, it will only be able to provide broad generalizations given a particular architecture, and depending on how the microcode is implemented you may find that the Pentium M is different than the Core 2 Duo which is different than the AMD dual core, etc.

Note that this article was updated in 2000, and written earlier. Even the Pentium 4 is hard to pin down regarding instruction timing - PIII, PII, and the original pentium were easier, and the texts referenced were probably based on those earlier processors that had a more well-defined instruction timing.

These days people generally use statistical analysis for code timing estimation.

时光是把杀猪刀 2024-07-23 04:08:15

其他答案所说的不可能准确预测在现代 CPU 上运行的代码的性能是正确的,但这并不意味着延迟是未知的,或者知道它们是无用的。

Agner Fog 的指令表中列出了 Intel 和 AMD 处理器的确切延迟。 另请参阅英特尔® 64 和 IA-32 架构优化参考手册,以及指令延迟和吞吐量适用于 AMD 和 Intel x86 处理器(来自 Can Berk Güder 现已删除的仅链接答案)。 AMD 还在自己的网站上提供了包含官方值的 pdf 手册。

对于(微)优化紧密循环,了解每条指令的延迟对于手动尝试安排代码有很大帮助。 程序员可以做很多编译器做不到的优化(因为编译器不能保证不会改变程序的含义)。

当然,这仍然需要你了解CPU的很多其他细节,比如它的流水线有多深、每个周期可以发出多少条指令、执行单元的数量等等。 当然,这些数字因不同的 CPU 而异。 但您通常可以得出一个或多或少适用于所有 CPU 的合理平均值。

但值得注意的是,在这个级别优化几行代码也需要大量工作。 而且很容易做出一些被证明是悲观的事情。 现代 CPU 非常复杂,并且它们非常努力地从糟糕的代码中获得良好的性能。 但也有一些情况他们无法有效处理,或者您认为自己很聪明并编写了高效的代码,但结果却降低了 CPU 的速度。

编辑
查看Intel的优化手册,表C-13:
第一列是指令类型,然后有许多列表示每个 CPUID 的延迟。 CPUID 指示数字适用于哪个处理器系列,并在文档的其他地方进行了解释。 延迟指定指令结果可用之前需要多少个周期,因此这就是您要查找的数字。

吞吐量列显示每个周期可以执行多少此类指令。

在此表中查找 xchg,我们发现根据 CPU 系列的不同,它需要 1-3 个周期,而一个 mov 需要 0.5-1 个周期。 这些是针对寄存器到寄存器形式的指令,而不是针对内存的lock xchg,后者要慢很多。 更重要的是,巨大变化的延迟和对周围代码的影响(当与另一个核心发生争用时速度会慢得多),因此只考虑最好的情况是错误的。 (我没有查过每个 CPUID 的含义,但我假设 .5 是针对 Pentium 4,它以双倍速度运行芯片的某些组件,允许它在半个周期内完成任务)

我真的不明白什么但是,您计划使用此信息,但如果您知道代码运行的确切 CPU 系列,那么将延迟相加即可得知执行此指令序列所需的最小周期数。

What the other answers say about it being impossible to accurately predict the performance of code running on a modern CPU is true, but that doesn't mean the latencies are unknown, or that knowing them is useless.

The exact latencies for Intels and AMD's processors are listed in Agner Fog's instruction tables. See also Intel® 64 and IA-32 Architectures Optimization Reference Manual, and Instruction latencies and throughput for AMD and Intel x86 processors (from Can Berk Güder's now-deleted link-only answer). AMD also has pdf manuals on their own website with their official values.

For (micro-)optimizing tight loops, knowing the latencies for each instruction can help a lot in manually trying to schedule your code. The programmer can make a lot of optimizations that the compiler can't (because the compiler can't guarantee it won't change the meaning of the program).

Of course, this still requires you to know a lot of other details about the CPU, such as how deeply pipelined it is, how many instructions it can issue per cycle, number of execution units and so on. And of course, these numbers vary for different CPU's. But you can often come up with a reasonable average that more or less works for all CPU's.

It's worth noting though, that it is a lot of work to optimize even a few lines of code at this level. And it is easy to make something that turns out to be a pessimization. Modern CPUs are hugely complicated, and they try extremely hard to get good performance out of bad code. But there are also cases they're unable to handle efficiently, or where you think you're clever and making efficient code, and it turns out to slow the CPU down.

Edit
Looking in Intel's optimization manual, table C-13:
The first column is instruction type, then there is a number of columns for latency for each CPUID. The CPUID indicates which processor family the numbers apply to, and are explained elsewhere in the document. The latency specifies how many cycles it takes before the result of the instruction is available, so this is the number you're looking for.

The throughput columns show how many of this type of instructions can be executed per cycle.

Looking up xchg in this table, we see that depending on the CPU family, it takes 1-3 cycles, and a mov takes 0.5-1. These are for the register-to-register forms of the instructions, not for a lock xchg with memory, which is a lot slower. And more importantly, hugely-variable latency and impact on surrounding code (much slower when there's contention with another core), so looking only at the best-case is a mistake. (I haven't looked up what each CPUID means, but I assume the .5 are for Pentium 4, which ran some components of the chip at double speed, allowing it to do things in half cycles)

I don't really see what you plan to use this information for, however, but if you know the exact CPU family the code is running on, then adding up the latency tells you the minimum number of cycles required to execute this sequence of instructions.

荆棘i 2024-07-23 04:08:15

测量和计算 CPU 周期在 x86 上不再有意义。

首先,问问自己正在计算哪个 CPU 的周期? 核心2? 速龙? 奔腾-M? 原子? 所有这些 CPU 都执行 x86 代码,但它们都有不同的执行时间。 甚至同一 CPU 的不同步进之间的执行也会有所不同。

最后一个对周期计数有意义的 x86 是 Pentium-Pro。

还要考虑到,在 CPU 内部,大多数指令都被转码为微码,并由内部执行单元乱序执行,而该内部执行单元看起来根本不像 x86。 单个CPU指令的性能取决于内部执行单元有多少资源可用。

因此一条指令的时间不仅取决于指令本身,还取决于周围的代码。

无论如何:您可以估计不同处理器的吞吐量资源使用情况和指令延迟。 相关信息可以在Intel和AMD网站上找到。

Agner Fog 在他的网站上有一个非常好的总结。 请参阅指令表了解延迟、吞吐量和 uop 计数。 请参阅微架构 PDF 以了解如何解释它们。

http://www.agner.org/optimize

但请注意 xchg -with-memory 没有可预测的性能,即使您只查看一种 CPU 型号。 即使在 L1D 缓存中缓存行已经很热的无争用情况下,作为完整的内存屏障也意味着它的影响在很大程度上取决于加载和存储到周围代码中的其他地址。


顺便说一句 - 因为您的示例代码是无锁数据结构基本构建块:您是否考虑过使用编译器内置函数? 在 win32 上,您可以包含 intrin.h 并使用 _InterlockedExchange 等函数。

这将为您提供更好的执行时间,因为编译器可以内联指令。 内联汇编器始终强制编译器禁用 asm 代码周围的优化。

Measuring and counting CPU-cycles does not make sense on the x86 anymore.

First off, ask yourself for which CPU you're counting cycles? Core-2? a Athlon? Pentium-M? Atom? All these CPUs execute x86 code but all of them have different execution times. The execution even varies between different steppings of the same CPU.

The last x86 where cycle-counting made sense was the Pentium-Pro.

Also consider, that inside the CPU most instructions are transcoded into microcode and executed out of order by a internal execution unit that does not even remotely look like a x86. The performance of a single CPU instruction depends on how much resources in the internal execution unit is available.

So the time for a instruction depends not only on the instruction itself but also on the surrounding code.

Anyway: You can estimate the throughput-resource usage and latency of instructions for different processors. The relevant information can be found at the Intel and AMD sites.

Agner Fog has a very nice summary on his web-site. See the instruction tables for latency, throughput, and uop count. See the microarchictecture PDF to learn how to interpret those.

http://www.agner.org/optimize

But note that xchg-with-memory does not have predictable performance, even if you look at only one CPU model. Even in the no-contention case with the cache-line already hot in L1D cache, being a full memory barrier will mean it's impact depends a lot on loads and stores to other addresses in the surrounding code.


Btw - since your example-code is a lock-free datastructure basic building block: Have you considered using the compiler built-in functions? On win32 you can include intrin.h and use functions such as _InterlockedExchange.

That'll give you better execution time because the compiler can inline the instructions. Inline-assembler always forces the compiler to disable optimizations around the asm-code.

自此以后,行同陌路 2024-07-23 04:08:15

lock xchg eax, dword ptr [edx]

请注意,锁将锁定所有内核的内存获取内存,这在某些多核上可能需要 100 个周期,并且还需要刷新高速缓存行。 它还将使管道停顿。 所以我不会担心其余的事情。

因此,最佳性能又回到了调整算法关键区域。

请注意,在单核上,您可以通过删除锁来优化它,但多核需要它。

lock xchg eax, dword ptr [edx]

Note the lock will lock memory for the memory fetch for all cores, this can take 100 cycles on some multi cores and a cache line will also need to be flushed. It will also stall the pipeline. So i wouldnt worry about the rest.

So optimal performance gets back to tuning your algorithms critical regions.

Note on a single core you can optmize this by removing the lock but it is needed for multi core.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文