limit 关键字在 gcc/g++ 中是否提供了显着的好处?

发布于 2024-08-16 07:16:37 字数 341 浏览 3 评论 0原文

有没有人见过关于在 gcc/g++ 中使用 C/C++ restrict 关键字是否在现实中(而不仅仅是理论上)提供任何显着性能提升的任何数字/分析?

我读过各种推荐/贬低其使用的文章,但我还没有遇到任何实际数字可以证明双方的论点。

编辑

我知道 restrict 并不是 C++ 的正式一部分,但它受到一些编译器的支持,并且我读过 Christer Ericson 强烈推荐使用它。

Has anyone ever seen any numbers/analysis on whether or not use of the C/C++ restrict keyword in gcc/g++ actual provides any significant performance boost in reality (and not just in theory)?

I've read various articles recommending / disparaging its use, but I haven't ran across any real numbers practically demonstrating either sides arguments.

EDIT

I know that restrict is not officially part of C++, but it is supported by some compilers and I've read a paper by Christer Ericson which strongly recommends its usage.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

坚持沉默 2024-08-23 07:16:37

limit 关键字有不同的作用。

在某些情况下(图像处理),我已经看到了 2 倍甚至更多的改进。大多数时候,差异并没有那么大。大约10%。

这是一个说明差异的小例子。我编写了一个非常基本的 4x4 矢量 * 矩阵变换作为测试。请注意,我必须强制该函数不被内联。否则,GCC 会检测到我的基准代码中没有任何别名指针,并且由于内联而限制不会产生影响。

我也可以将转换函数移动到不同的文件中。

#include <math.h>

#ifdef USE_RESTRICT
#else
#define __restrict
#endif


void transform (float * __restrict dest, float * __restrict src, 
                float * __restrict matrix, int n) __attribute__ ((noinline));

void transform (float * __restrict dest, float * __restrict src, 
                float * __restrict matrix, int n)
{
  int i;

  // simple transform loop.

  // written with aliasing in mind. dest, src and matrix 
  // are potentially aliasing, so the compiler is forced to reload
  // the values of matrix and src for each iteration.

  for (i=0; i<n; i++)
  {
    dest[0] = src[0] * matrix[0] + src[1] * matrix[1] + 
              src[2] * matrix[2] + src[3] * matrix[3];

    dest[1] = src[0] * matrix[4] + src[1] * matrix[5] + 
              src[2] * matrix[6] + src[3] * matrix[7];

    dest[2] = src[0] * matrix[8] + src[1] * matrix[9] + 
              src[2] * matrix[10] + src[3] * matrix[11];

    dest[3] = src[0] * matrix[12] + src[1] * matrix[13] + 
              src[2] * matrix[14] + src[3] * matrix[15];

    src  += 4;
    dest += 4;
  }
}

float srcdata[4*10000];
float dstdata[4*10000];

int main (int argc, char**args)
{
  int i,j;
  float matrix[16];

  // init all source-data, so we don't get NANs  
  for (i=0; i<16; i++)   matrix[i] = 1;
  for (i=0; i<4*10000; i++) srcdata[i] = i;

  // do a bunch of tests for benchmarking. 
  for (j=0; j<10000; j++)
    transform (dstdata, srcdata, matrix, 10000);
}

结果:(在我的 2 Ghz Core Duo 上)

nils@doofnase:~$ gcc -O3 test.c
nils@doofnase:~$ time ./a.out

real    0m2.517s
user    0m2.516s
sys     0m0.004s

nils@doofnase:~$ gcc -O3 -DUSE_RESTRICT test.c
nils@doofnase:~$ time ./a.out

real    0m2.034s
user    0m2.028s
sys     0m0.000s

系统上,执行速度提高了 20%。

为了显示它在多大程度上取决于架构,我让相同的代码在 Cortex-A8 嵌入式 CPU 上运行(稍微调整了循环计数,因为我不想等待那么久):

root@beagleboard:~# gcc -O3 -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp test.c
root@beagleboard:~# time ./a.out

real    0m 7.64s
user    0m 7.62s
sys     0m 0.00s

root@beagleboard:~# gcc -O3 -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -DUSE_RESTRICT test.c 
root@beagleboard:~# time ./a.out

real    0m 7.00s
user    0m 6.98s
sys     0m 0.00s

这里的差异仅为 9% (顺便说一句,相同的编译器。)

The restrict keyword does a difference.

I've seen improvements of factor 2 and more in some situations (image processing). Most of the time the difference is not that large though. About 10%.

Here is a little example that illustrate the difference. I've written a very basic 4x4 vector * matrix transform as a test. Note that I have to force the function not to be inlined. Otherwise GCC detects that there aren't any aliasing pointers in my benchmark code and restrict wouldn't make a difference due to inlining.

I could have moved the transform function to a different file as well.

#include <math.h>

#ifdef USE_RESTRICT
#else
#define __restrict
#endif


void transform (float * __restrict dest, float * __restrict src, 
                float * __restrict matrix, int n) __attribute__ ((noinline));

void transform (float * __restrict dest, float * __restrict src, 
                float * __restrict matrix, int n)
{
  int i;

  // simple transform loop.

  // written with aliasing in mind. dest, src and matrix 
  // are potentially aliasing, so the compiler is forced to reload
  // the values of matrix and src for each iteration.

  for (i=0; i<n; i++)
  {
    dest[0] = src[0] * matrix[0] + src[1] * matrix[1] + 
              src[2] * matrix[2] + src[3] * matrix[3];

    dest[1] = src[0] * matrix[4] + src[1] * matrix[5] + 
              src[2] * matrix[6] + src[3] * matrix[7];

    dest[2] = src[0] * matrix[8] + src[1] * matrix[9] + 
              src[2] * matrix[10] + src[3] * matrix[11];

    dest[3] = src[0] * matrix[12] + src[1] * matrix[13] + 
              src[2] * matrix[14] + src[3] * matrix[15];

    src  += 4;
    dest += 4;
  }
}

float srcdata[4*10000];
float dstdata[4*10000];

int main (int argc, char**args)
{
  int i,j;
  float matrix[16];

  // init all source-data, so we don't get NANs  
  for (i=0; i<16; i++)   matrix[i] = 1;
  for (i=0; i<4*10000; i++) srcdata[i] = i;

  // do a bunch of tests for benchmarking. 
  for (j=0; j<10000; j++)
    transform (dstdata, srcdata, matrix, 10000);
}

Results: (on my 2 Ghz Core Duo)

nils@doofnase:~$ gcc -O3 test.c
nils@doofnase:~$ time ./a.out

real    0m2.517s
user    0m2.516s
sys     0m0.004s

nils@doofnase:~$ gcc -O3 -DUSE_RESTRICT test.c
nils@doofnase:~$ time ./a.out

real    0m2.034s
user    0m2.028s
sys     0m0.000s

Over the thumb 20% faster execution, on that system.

To show how much it depends on the architecture I've let the same code run on a Cortex-A8 embedded CPU (adjusted the loop count a bit cause I don't want to wait that long):

root@beagleboard:~# gcc -O3 -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp test.c
root@beagleboard:~# time ./a.out

real    0m 7.64s
user    0m 7.62s
sys     0m 0.00s

root@beagleboard:~# gcc -O3 -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -DUSE_RESTRICT test.c 
root@beagleboard:~# time ./a.out

real    0m 7.00s
user    0m 6.98s
sys     0m 0.00s

Here the difference is just 9% (same compiler btw.)

帅冕 2024-08-23 07:16:37

restrict 关键字在 gcc / g++ 中提供显着的好处吗?

可以减少指令数量,如下例所示,因此请尽可能使用它。

GCC 4.8 Linux x86-64 exmample

输入:

void f(int *a, int *b, int *x) {
  *a += *x;
  *b += *x;
}

void fr(int *restrict a, int *restrict b, int *restrict x) {
  *a += *x;
  *b += *x;
}

编译和反编译:

gcc -g -std=c99 -O0 -c main.c
objdump -S main.o

使用-O0,它们是相同的。

使用 -O3

void f(int *a, int *b, int *x) {
    *a += *x;
   0:   8b 02                   mov    (%rdx),%eax
   2:   01 07                   add    %eax,(%rdi)
    *b += *x;
   4:   8b 02                   mov    (%rdx),%eax
   6:   01 06                   add    %eax,(%rsi)  

void fr(int *restrict a, int *restrict b, int *restrict x) {
    *a += *x;
  10:   8b 02                   mov    (%rdx),%eax
  12:   01 07                   add    %eax,(%rdi)
    *b += *x;
  14:   01 06                   add    %eax,(%rsi) 

对于新手来说,调用约定是:

  • rdi = 第一个参数
  • rsi = 第二个参数
  • rdx = 第三个参数

结论:3 条指令而不是 4 条

当然,指令可以有不同的延迟,但是这提供了一个好主意。

为什么 GCC 能够优化它?

上面的代码取自Wikipedia 示例非常很有启发性。

f 的伪汇编:

load R1 ← *x    ; Load the value of x pointer
load R2 ← *a    ; Load the value of a pointer
add R2 += R1    ; Perform Addition
set R2 → *a     ; Update the value of a pointer
; Similarly for b, note that x is loaded twice,
; because x may point to a (a aliased by x) thus 
; the value of x will change when the value of a
; changes.
load R1 ← *x
load R2 ← *b
add R2 += R1
set R2 → *b

对于 fr

load R1 ← *x
load R2 ← *a
add R2 += R1
set R2 → *a
; Note that x is not reloaded,
; because the compiler knows it is unchanged
; "load R1 ← *x" is no longer needed.
load R2 ← *b
add R2 += R1
set R2 → *b

真的更快吗?

呃……不适用于这个简单的测试:

.text
    .global _start
    _start:
        mov $0x10000000, %rbx
        mov $x, %rdx
        mov $x, %rdi
        mov $x, %rsi
    loop:
        # START of interesting block
        mov (%rdx),%eax
        add %eax,(%rdi)
        mov (%rdx),%eax # Comment out this line.
        add %eax,(%rsi)
        # END ------------------------
        dec %rbx
        cmp $0, %rbx
        jnz loop
        mov $60, %rax
        mov $0, %rdi
        syscall
.data
    x:
        .int 0

然后:

as -o a.o a.S && ld a.o && time ./a.out

在 Ubuntu 上14.04 AMD64 CPU 英特尔 i5-3210M。

我承认我仍然不了解现代CPU。如果您出现以下情况,请告诉我:

  • 在我的方法中发现了缺陷
  • 发现了汇编器测试用例,它变得更快
  • 了解为什么没有差异

Does the restrict keyword provide significant benefits in gcc / g++ ?

It can reduce the number of instructions as shown on the example below, so use it whenever possible.

GCC 4.8 Linux x86-64 exmample

Input:

void f(int *a, int *b, int *x) {
  *a += *x;
  *b += *x;
}

void fr(int *restrict a, int *restrict b, int *restrict x) {
  *a += *x;
  *b += *x;
}

Compile and decompile:

gcc -g -std=c99 -O0 -c main.c
objdump -S main.o

With -O0, they are the same.

With -O3:

void f(int *a, int *b, int *x) {
    *a += *x;
   0:   8b 02                   mov    (%rdx),%eax
   2:   01 07                   add    %eax,(%rdi)
    *b += *x;
   4:   8b 02                   mov    (%rdx),%eax
   6:   01 06                   add    %eax,(%rsi)  

void fr(int *restrict a, int *restrict b, int *restrict x) {
    *a += *x;
  10:   8b 02                   mov    (%rdx),%eax
  12:   01 07                   add    %eax,(%rdi)
    *b += *x;
  14:   01 06                   add    %eax,(%rsi) 

For the uninitiated, the calling convention is:

  • rdi = first parameter
  • rsi = second parameter
  • rdx = third parameter

Conclusion: 3 instructions instead of 4.

Of course, instructions can have different latencies, but this gives a good idea.

Why GCC was able to optimize that?

The code above was taken from the Wikipedia example which is very illuminating.

Pseudo assembly for f:

load R1 ← *x    ; Load the value of x pointer
load R2 ← *a    ; Load the value of a pointer
add R2 += R1    ; Perform Addition
set R2 → *a     ; Update the value of a pointer
; Similarly for b, note that x is loaded twice,
; because x may point to a (a aliased by x) thus 
; the value of x will change when the value of a
; changes.
load R1 ← *x
load R2 ← *b
add R2 += R1
set R2 → *b

For fr:

load R1 ← *x
load R2 ← *a
add R2 += R1
set R2 → *a
; Note that x is not reloaded,
; because the compiler knows it is unchanged
; "load R1 ← *x" is no longer needed.
load R2 ← *b
add R2 += R1
set R2 → *b

Is it really any faster?

Ermmm... not for this simple test:

.text
    .global _start
    _start:
        mov $0x10000000, %rbx
        mov $x, %rdx
        mov $x, %rdi
        mov $x, %rsi
    loop:
        # START of interesting block
        mov (%rdx),%eax
        add %eax,(%rdi)
        mov (%rdx),%eax # Comment out this line.
        add %eax,(%rsi)
        # END ------------------------
        dec %rbx
        cmp $0, %rbx
        jnz loop
        mov $60, %rax
        mov $0, %rdi
        syscall
.data
    x:
        .int 0

And then:

as -o a.o a.S && ld a.o && time ./a.out

on Ubuntu 14.04 AMD64 CPU Intel i5-3210M.

I confess that I still don't understand modern CPUs. Let me know if you:

  • found a flaw in my method
  • found an assembler test case where it becomes much faster
  • understand why there wasn't a difference
ζ澈沫 2024-08-23 07:16:37

揭秘限制关键字一文提到了论文 为什么程序员指定的别名是一个坏主意 (pdf)说它通常没有帮助,并提供了测量结果来支持这一点。

The article Demystifying The Restrict Keyword refers to the paper Why Programmer-specified Aliasing is a Bad Idea (pdf) which says it generally doesn't help and provides measurements to back this up.

飘逸的'云 2024-08-23 07:16:37

请注意,允许 restrict 关键字的 C++ 编译器可能仍会忽略它。例如 此处

Note that C++ compilers that allow the restrict keyword may still ignore it. That is the case for example here.

蝶舞 2024-08-23 07:16:37

我测试了 这个 C 程序。如果没有 restrict,则需要 12.640 秒才能完成,如果有 restrict,则需要 12.516 秒。看起来可以节省一些时间。

I tested this C-Program. Without restrict it took 12.640 seconds to complete, with restrict 12.516. Looks like it can save some time.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文