执行更多指令如何加快执行速度

发布于 2024-12-08 09:55:10 字数 575 浏览 0 评论 0原文

当我运行以下函数时,我得到了一些意想不到的结果。

在我的机器上,下面的代码始终需要大约 6 秒才能运行。但是,如果我取消注释“;dec [variable + 24]”行,因此会执行更多代码,运行时间大约为 4.5 秒。为什么?

.DATA
variable dq 0 dup(4)
.CODE             

runAssemblyCode PROC
    mov rax, 2330 * 1000 * 1000
start:
    dec [variable]
    dec [variable + 8]
    dec [variable + 16]
    ;dec [variable + 24]
    dec rax
    jnz start
    ret 
runAssemblyCode ENDP 
END

我注意到 Stack Overflow 上已经有类似的问题,但他们的代码示例并不像这样简单,我找不到这个问题的任何简洁答案。

我尝试用 nop 指令填充代码,看看是否是对齐问题,并将亲和力设置为单个处理器。两者都没有任何区别。

When I run the following function, I get somewhat unexpected results.

On my machine, the code below consistently takes about 6 seconds to run. However, if I uncomment the ";dec [variable + 24]" line, therefore executing more code it takes about 4.5 seconds to run. Why?

.DATA
variable dq 0 dup(4)
.CODE             

runAssemblyCode PROC
    mov rax, 2330 * 1000 * 1000
start:
    dec [variable]
    dec [variable + 8]
    dec [variable + 16]
    ;dec [variable + 24]
    dec rax
    jnz start
    ret 
runAssemblyCode ENDP 
END

I have noticed that there are similar questions already on Stack Overflow, but their code samples are not as simple as this and I couldn't find any succinct answers to this question.

I have tried padding the code with nop instructions to see if it is an alignment problem, and also set the affinity to a single processor. Neither made any difference.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

窝囊感情。 2024-12-15 09:55:10

简单的答案是因为现代CPU 极其复杂。在观察者看来,幕后发生的许多事情似乎是不可预测或随机的。

插入额外的指令可能会导致它以不同的方式调度指令,这在这样的紧密循环中可能会产生影响。但这只是一个猜测。

据我所知,它与前一条指令触及相同的缓存行,因此它似乎不是一种预取。我真的想不出一个合乎逻辑的解释,但同样,CPU 利用大量未记录的启发式方法和猜测来尽可能快地执行代码,有时,这意味着它们失败的奇怪的极端情况,并且代码变成比你想象的要慢。

你在不同的CPU型号上测试过这个吗?看看这是否只发生在您的特定 CPU 上,或者其他 x86 CPU 是否也出现同样的情况,将会很有趣。

The simple answer is because modern CPUs are extremely complex. There is a lot going on under the hood that appears unpredictable or random to the observer.

Inserting that extra instruction might cause it to schedule instructions differently, which, in a tight loop like this, might make a difference. But that's just a guess.

As far as I can see, it touches the same cache line as the previous instruction, so it doesn't seem to be a kind of prefetching. I can't really think of a logical explanation, but again, the CPU makes use of a lot of undocumented heuristics and guesses to execute code as fast as possible, and sometimes, that means weird corner cases where they fail, and the code becomes slower than you'd expect.

Have you tested this on different CPU models? Would be interesting to see if this is just on your specific CPU, or if other x86 CPUs exhibit the same thing.

醉生梦死 2024-12-15 09:55:10

bob.s

.data
variable:
    .word 0,0,0,0
    .word 0,0,0,0
    .word 0,0,0,0
    .word 0,0,0,0
    .word 0,0,0,0
    .word 0,0,0,0

.text
.globl runAssemblyCode
runAssemblyCode:
  mov    $0xFFFFFFFF,%eax

start_loop:
  decl variable+0
  decl variable+8
  decl variable+16
  ;decl variable+24
  dec    %eax
  jne    start_loop
  retq

ted.c

#include <stdio.h>
#include <time.h>

void runAssemblyCode ( void );

int main ( void )
{
    volatile unsigned int ra,rb;

    ra=(unsigned int)time(NULL);
    runAssemblyCode();
    rb=(unsigned int)time(NULL);
    printf("%u\n",rb-ra);
    return(0);
}

gcc -O2 ted.c bob.s -o ted

这是带有额外指令的:

00000000004005d4 <runAssemblyCode>:
  4005d4:   b8 ff ff ff ff          mov    $0xffffffff,%eax

00000000004005d9 <start_loop>:
  4005d9:   ff 0c 25 28 10 60 00    decl   0x601028
  4005e0:   ff 0c 25 30 10 60 00    decl   0x601030
  4005e7:   ff 0c 25 38 10 60 00    decl   0x601038
  4005ee:   ff 0c 25 40 10 60 00    decl   0x601040 
  4005f5:   ff c8                   dec    %eax
  4005f7:   75 e0                   jne    4005d9 <start_loop>
  4005f9:   c3                      retq   
  4005fa:   90                      nop

我没有看到差异,也许你可以纠正我的代码,或者其他人可以在他们的系统上尝试看看他们看到了什么。 ..

这是一条极其痛苦的指令,而且如果您正在执行除基于字节的内存递减之外的其他操作,那么这些指令是未对齐的,并且会给内存系统带来痛苦。因此这个例程应该对缓存行以及核心数量等敏感。

无论有没有额外的指令,它都花费了大约 13 秒。

AMD Phenom 9950 四核处理器

Intel(R) Core(TM)2 CPU 6300

无论有或没有额外指令,

上运行大约需要 9-10 秒。两个处理器:
Intel(R) Xeon(TM) CPU

无论有或没有额外指令,

大约需要 13 秒。对此:
Intel(R) Core(TM)2 Duo CPU T7500

8 秒(带或不带)。

所有都运行 Ubuntu 64 位 10.04 或 10.10,可能还有 11.04。

更多机器,64 位,ubuntu

Intel(R) Xeon(R) CPU X5450(8 核)

6 秒,有或没有额外指令。

Intel(R) Xeon(R) CPU E5405(8 核)

9 秒,有或没有。

您系统中 DDR/DRAM 的速度是多少?您正在运行哪种处理器(如果在 Linux 上,则为 cat /proc/cpuinfo)。

Intel(R) Xeon(R) CPU E5440(8 核)

6 秒,有或没有

啊,找到了一个单核,xeon:
Intel(R) Xeon(TM) CPU

15 秒(带或不带额外指令)

bob.s

.data
variable:
    .word 0,0,0,0
    .word 0,0,0,0
    .word 0,0,0,0
    .word 0,0,0,0
    .word 0,0,0,0
    .word 0,0,0,0

.text
.globl runAssemblyCode
runAssemblyCode:
  mov    $0xFFFFFFFF,%eax

start_loop:
  decl variable+0
  decl variable+8
  decl variable+16
  ;decl variable+24
  dec    %eax
  jne    start_loop
  retq

ted.c

#include <stdio.h>
#include <time.h>

void runAssemblyCode ( void );

int main ( void )
{
    volatile unsigned int ra,rb;

    ra=(unsigned int)time(NULL);
    runAssemblyCode();
    rb=(unsigned int)time(NULL);
    printf("%u\n",rb-ra);
    return(0);
}

gcc -O2 ted.c bob.s -o ted

this was with the extra instruction:

00000000004005d4 <runAssemblyCode>:
  4005d4:   b8 ff ff ff ff          mov    $0xffffffff,%eax

00000000004005d9 <start_loop>:
  4005d9:   ff 0c 25 28 10 60 00    decl   0x601028
  4005e0:   ff 0c 25 30 10 60 00    decl   0x601030
  4005e7:   ff 0c 25 38 10 60 00    decl   0x601038
  4005ee:   ff 0c 25 40 10 60 00    decl   0x601040 
  4005f5:   ff c8                   dec    %eax
  4005f7:   75 e0                   jne    4005d9 <start_loop>
  4005f9:   c3                      retq   
  4005fa:   90                      nop

I dont see a difference, maybe you can correct my code or others can try on their systems to see what they see...

that is an extremely painful instruction plus if you are doing something other than byte based memory decrements that is unaligned and going to be painful for the memory system. so this routine should be sensitive to cache lines as well as number of cores, etc.

it took about 13 seconds with or without the extra instruction.

amd phenom 9950 quad-core processor

on an

Intel(R) Core(TM)2 CPU 6300

took about 9-10 seconds with or without the extra instruction.

A two processor:
Intel(R) Xeon(TM) CPU

took about 13 seconds with or without the extra instruction.

On this:
Intel(R) Core(TM)2 Duo CPU T7500

8 seconds with or without.

All are running Ubuntu 64 bit 10.04 or 10.10, might be an 11.04 in there.

Some more machines, 64 bit, ubuntu

Intel(R) Xeon(R) CPU X5450 (8 core)

6 seconds with or without extra instruction.

Intel(R) Xeon(R) CPU E5405 (8 core)

9 seconds with or without.

What is the speed of your DDR/DRAM in your system? What kind of processor are you running (cat /proc/cpuinfo if on linux).

Intel(R) Xeon(R) CPU E5440 (8 core)

6 seconds with or without

Ahh, found a single core, xeon though:
Intel(R) Xeon(TM) CPU

15 seconds with or without the extra instruction

旧时模样 2024-12-15 09:55:10

没那么糟糕。平均而言,完整循环的执行时间为 2.6 ns,而另一个循环的执行时间为 1.9 ns。假设 2GHz CPU 的周期为 0.5 ns,则差异约为每个循环 (2.6 - 1.9) / 0.5 = 1 个时钟周期,这并不奇怪。
不过,由于您请求的周期数,时间差异变得如此明显:0.5 ns * 2330000000 = 1.2 秒,即您观察到的差异。

It's not that bad. On average, the complete loop takes 2.6 ns to execute, while the other takes 1.9 ns. Assuming a 2GHz CPU, which has a period of 0.5 ns, the difference is about (2.6 - 1.9) / 0.5 = 1 clock cycle per loop, nothing surprising.
The time difference becomes so noticeable, though, due to the number of cycles you requested: 0.5 ns * 2330000000 = 1.2 seconds, the difference you observed.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文