现代处理器的性能

发布于 2024-12-23 10:04:15 字数 717 浏览 5 评论 0原文

在现代处理器 (AMD Phenom II 1090T) 上执行时，以下代码更有可能消耗多少个时钟周期：3 或 11？

label:  mov (%rsi), %rax
        adc %rax, (%rdx)
        lea 8(%rdx), %rdx
        lea 8(%rsi), %rsi
        dec %ecx
        jnz label

问题是，当我执行此类代码的多次迭代时，每次迭代的结果有时会变化近 3 或 11 个刻度。我无法决定“谁是谁”。

UPD 根据指令延迟表 (PDF)，我的一段代码至少需要 10 个时钟AMD K10 微架构上的周期。因此，每次迭代不可能出现 3 个刻度，这是由测量错误引起的。

已解决 @Atom 注意到，现代处理器中的周期频率并不是恒定的。当我在 BIOS 中禁用三个选项 - Core Performance Boost、AMD C1E Support 和 AMD K8 Cool&Quiet Control 时，消耗了我的“六个指令” “稳定在 3 个时钟周期:-)

原文

Being executed on modern processor (AMD Phenom II 1090T), how many clock ticks does the following code consume more likely : 3 or 11?

label:  mov (%rsi), %rax
        adc %rax, (%rdx)
        lea 8(%rdx), %rdx
        lea 8(%rsi), %rsi
        dec %ecx
        jnz label

The problem is, when I execute many iterations of such code, results vary near 3 OR 11 ticks per iteration from time to time. And I can't decide "who is who".

UPD
According to Table of instruction latencies (PDF), my piece of code takes at least 10 clock cycles on AMD K10 microarchitecture. Therefore, impossible 3 ticks per iteration are caused by bugs in measurement.

SOLVED
@Atom noticed, that cycle frequency isn't constant in modern processors. When I disabled in BIOS three options - Core Performance Boost, AMD C1E Support and AMD K8 Cool&Quiet Control, consumption of my "six instructions" stabilized on 3 clock ticks :-)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

誰認得朕 2024-12-30 10:04:15

我不会尝试确切地回答运行每次迭代需要多少个周期（3 或 10 个），但我将解释如何可能每次迭代获得 3 个周期。

（请注意，这适用于一般处理器，我没有提及特定于 AMD 处理器的内容。）

关键概念：

当今最现代的（非嵌入式）处理器都是超标量和无序。不仅可以并行执行多个（独立）指令，而且可以重新排序指令以打破依赖性等。

让我们分解一下您的示例：

label:
    mov (%rsi), %rax
    adc %rax, (%rdx)
    lea 8(%rdx), %rdx
    lea 8(%rsi), %rsi
    dec %ecx
    jnz label

首先要注意的是分支之前的最后 3 条指令都是独立的：

    lea 8(%rdx), %rdx
    lea 8(%rsi), %rsi
    dec %ecx

因此处理器可以并行执行所有 3 条指令。

另一件事是：

adc %rax, (%rdx)
lea 8(%rdx), %rdx

似乎对 rdx 存在依赖性，导致两者无法并行运行。但实际上，这是错误依赖，因为第二条指令实际上并不
取决于第一条指令的输出。现代处理器能够重命名 rdx 寄存器，以允许这两条指令重新排序或并行执行。

同样适用于以下之间的 rsi 寄存器：

mov (%rsi), %rax
lea 8(%rsi), %rsi

因此最终，（可能）可以实现 3 个周期，如下所示（这只是几种可能的顺序之一）：

1:   mov (%rsi), %rax        lea 8(%rdx), %rdx        lea 8(%rsi), %rsi
2:   adc %rax, (%rdx)        dec %ecx
3:   jnz label

*当然，我过于简化了为了简单起见。实际上，延迟可能更长，并且循环的不同迭代之间存在重叠。

无论如何，这可以解释如何能够获得 3 个周期。至于为什么有时会得到 10 个周期，可能有很多原因：分支预测错误、一些随机管道气泡......

I won't try to answer with certainty how many cycles (3 or 10) it will take to run each iteration, but I'll explain how it might be possible to get 3 cycles per iteration.

(Note that this is for processors in general and I make no references specific to AMD processors.)

Key Concepts:

Most modern (non-embedded) processors today are both super-scalar and out-of-order. Not only can execute multiple (independent) instructions in parallel, but they can re-order instructions to break dependencies and such.

Let's break down your example:

label:
    mov (%rsi), %rax
    adc %rax, (%rdx)
    lea 8(%rdx), %rdx
    lea 8(%rsi), %rsi
    dec %ecx
    jnz label

The first thing to notice is that the last 3 instructions before the branch are all independent:

    lea 8(%rdx), %rdx
    lea 8(%rsi), %rsi
    dec %ecx

So it's possible for a processor to execute all 3 of these in parallel.

Another thing is this:

adc %rax, (%rdx)
lea 8(%rdx), %rdx

There seems to be a dependency on rdx that prevents the two from running in parallel. But in reality, this is false dependency because the second instruction doesn't actually
depend on the output of the first instruction. Modern processors are able to rename the rdx register to allow these two instructions to be re-ordered or done in parallel.

Same applies to the rsi register between:

mov (%rsi), %rax
lea 8(%rsi), %rsi

So in the end, 3 cycles is (potentially) achievable as follows (this is just one of several possible orderings):

1:   mov (%rsi), %rax        lea 8(%rdx), %rdx        lea 8(%rsi), %rsi
2:   adc %rax, (%rdx)        dec %ecx
3:   jnz label

*Of course, I'm over-simplifying things for simplicity. In reality the latencies are probably longer and there's overlap between different iterations of the loop.

In any case, this could explain how it's possible to get 3 cycles. As for why you sometimes get 10 cycles, there could be a ton of reasons for that: branch misprediction, some random pipeline bubble...

回复收藏 0 原文