当前位置：文江博客话题详情

通过指针算术访问数组值与 C 中的下标访问数组值

发布于 2024-07-07 15:48:10 字数 302 浏览 15 评论 0原文

我一直读到，在 C 语言中，使用指针算术通常比数组访问的下标更快。即使对于现代（据称是优化的）编译器也是如此吗？

如果是这样，当我开始从学习 C 转向 Objective-C 和 Cocoa？

在 C 和 Objective-C 中，哪种是数组访问的首选编码风格？（由各自语言的专业人士认为）哪一个更清晰、更“正确”（因为缺乏更好的术语）？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

江南烟雨〆相思醉 2024-07-14 15:48:11

“使用指针运算通常是
比数组的下标更快
访问”

，无论哪种方式都是相同的操作。下标是将（元素大小*索引）添加到数组的起始地址的语法糖。

也就是说，当迭代数组中的元素时，获取指向第一个元素的指针并增加每次通过循环通常会比每次从循环变量计算当前元素的位置稍快一些（尽管这在现实应用程序中并不重要。首先检查您的算法，过早优化是根源。所有邪恶等等）

回复收藏 0 原文

后知后觉 2024-07-14 15:48:11

这可能有点偏离主题（抱歉），因为它没有回答您有关执行速度的问题，但您应该考虑到过早的优化是万恶之源 (Knuth)。在我看来，特别是当仍在（重新）学习这门语言时，无论如何首先要以最容易阅读的方式编写它。
然后，如果您的程序运行正确，请考虑优化速度。
无论如何，大多数时候你的编码速度都会足够快。

回复收藏 0 原文

挽心 2024-07-14 15:48:11

Mecki 有一个很好的解释。根据我的经验，索引与指针最重要的事情之一是循环中的其他代码。示例：

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <iostream>

using namespace std;

typedef int64_t int64;
static int64 nsTime() {
  struct timespec tp;
  clock_gettime(CLOCK_REALTIME, &tp);
  return tp.tv_sec*(int64)1000000000 + tp.tv_nsec;
}

typedef int T;
size_t const N = 1024*1024*128;
T data[N];

int main(int, char**) {
  cout << "starting\n";

  {
    int64 const a = nsTime();
    int sum = 0;
    for (size_t i=0; i<N; i++) {
      sum += data[i];
    }
    int64 const b = nsTime();
    cout << "Simple loop (indexed): " << (b-a)/1e9 << "\n";
  }

  {
    int64 const a = nsTime();
    int sum = 0;
    T *d = data;
    for (size_t i=0; i<N; i++) {
      sum += *d++;
    }
    int64 const b = nsTime();
    cout << "Simple loop (pointer): " << (b-a)/1e9 << "\n";
  }

  {
    int64 const a = nsTime();
    int sum = 0;
    for (size_t i=0; i<N; i++) {
      int a = sum+3;
      int b = 4-sum;
      int c = sum+5;
      sum += data[i] + a - b + c;
    }
    int64 const b = nsTime();
    cout << "Loop that uses more ALUs (indexed): " << (b-a)/1e9 << "\n";
  }

  {
    int64 const a = nsTime();
    int sum = 0;
    T *d = data;
    for (size_t i=0; i<N; i++) {
      int a = sum+3;
      int b = 4-sum;
      int c = sum+5;
      sum += *d++ + a - b + c;
    }
    int64 const b = nsTime();
    cout << "Loop that uses more ALUs (pointer): " << (b-a)/1e9 << "\n";
  }
}

在基于 Core 2 的快速系统（g++ 4.1.2、x64）上，时序如下：

    Simple loop (indexed): 0.400842
    Simple loop (pointer): 0.380633
    Loop that uses more ALUs (indexed): 0.768398
    Loop that uses more ALUs (pointer): 0.777886

有时索引更快，有时指针算术更快。这取决于 CPU 和编译器如何管道化循环执行。

Mecki has a great explanation. From my experience, one of the things that often matters with indexing vs. pointers is what other code sits in the loop. Example:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <iostream>

using namespace std;

typedef int64_t int64;
static int64 nsTime() {
  struct timespec tp;
  clock_gettime(CLOCK_REALTIME, &tp);
  return tp.tv_sec*(int64)1000000000 + tp.tv_nsec;
}

typedef int T;
size_t const N = 1024*1024*128;
T data[N];

int main(int, char**) {
  cout << "starting\n";

  {
    int64 const a = nsTime();
    int sum = 0;
    for (size_t i=0; i<N; i++) {
      sum += data[i];
    }
    int64 const b = nsTime();
    cout << "Simple loop (indexed): " << (b-a)/1e9 << "\n";
  }

  {
    int64 const a = nsTime();
    int sum = 0;
    T *d = data;
    for (size_t i=0; i<N; i++) {
      sum += *d++;
    }
    int64 const b = nsTime();
    cout << "Simple loop (pointer): " << (b-a)/1e9 << "\n";
  }

  {
    int64 const a = nsTime();
    int sum = 0;
    for (size_t i=0; i<N; i++) {
      int a = sum+3;
      int b = 4-sum;
      int c = sum+5;
      sum += data[i] + a - b + c;
    }
    int64 const b = nsTime();
    cout << "Loop that uses more ALUs (indexed): " << (b-a)/1e9 << "\n";
  }

  {
    int64 const a = nsTime();
    int sum = 0;
    T *d = data;
    for (size_t i=0; i<N; i++) {
      int a = sum+3;
      int b = 4-sum;
      int c = sum+5;
      sum += *d++ + a - b + c;
    }
    int64 const b = nsTime();
    cout << "Loop that uses more ALUs (pointer): " << (b-a)/1e9 << "\n";
  }
}

On a fast Core 2-based system (g++ 4.1.2, x64), here's the timing:

    Simple loop (indexed): 0.400842
    Simple loop (pointer): 0.380633
    Loop that uses more ALUs (indexed): 0.768398
    Loop that uses more ALUs (pointer): 0.777886

Sometimes indexing is faster, sometimes pointer arithmetic is. It depends on the how the CPU and compiler are able to pipeline the loop execution.

回复收藏 0 原文

爱情眠于流年 2024-07-14 15:48:11

如果您正在处理数组类型数据，我会说使用下标会使代码更具可读性。在今天的机器上（特别是对于像这样简单的东西），可读的代码更为重要。

现在，如果您正在显式处理您 malloc() 的一大块数据，并且您想要获取该数据内的指针，例如音频文件头内的 20 个字节，那么我认为地址算术更清楚地表达了您的意思试图做。

~~我不确定这方面的编译器优化，但即使下标速度较慢，最多也只会慢几个时钟周期。当您可以从清晰的思路中获得更多收益时，这几乎没有什么意义。~~

编辑：根据其他一些回复，下标只是一个句法元素，对性能没有影响，就像我想象的那样。在这种情况下，一定要使用您试图通过指针指向的块内的访问数据来表达的任何上下文。

回复收藏 0 原文

云淡月浅 2024-07-14 15:48:11

请记住，即使在使用超标量 cpu 等查看具有

无序执行
流水线
分支预测
超线程
的机器代码时，执行速度也很难预测......

这不仅仅是计算机器指令，甚至不仅仅是计算时钟周期。
在真正需要的情况下进行测量似乎更容易。即使计算给定程序的正确周期计数并非不可能（我们必须在大学里这样做），但这一点也不有趣，而且很难做到正确。
旁注：在多线程/多处理器环境中正确测量也很困难。

回复收藏 0 原文

莫多说 2024-07-14 15:48:11

char p1[ ] = "12345";
char* p2 = "12345";

char *ch = p1[ 3 ]; /* 4 */
ch = *(p2 + 3); /* 4 */

C 标准没有说哪个更快。可观察的行为是相同的，由编译器以任何它想要的方式实现它。通常它甚至根本不会读取内存。

一般来说，除非您指定编译器、版本、体系结构和编译选项，否则您无法说出哪个“更快”。即便如此，优化仍将取决于周围的环境。

因此，一般建议是使用任何可以提供更清晰和更简单的代码的东西。使用 array[ i ] 使某些工具能够尝试查找索引越界条件，因此如果您使用数组，最好将它们视为数组。

如果它很关键 - 请查看编译器生成的汇编程序。但请记住，当您更改它周围的代码时，它可能会发生变化。

char p1[ ] = "12345";
char* p2 = "12345";

char *ch = p1[ 3 ]; /* 4 */
ch = *(p2 + 3); /* 4 */

The C standard doesn't say which is faster. On the observable behavior is same and it is up to compiler to implement it in any way it wants. More often than not it won't even read memory at all.

In general, you have no way to say which is "faster" unless you specify a compiler, version, architecture, and compile options. Even then, optimization will depend on the surrounding context.

So the general advice is to use whatever gives clearer and simpler code. Using array[ i ] gives some tools ability to try and find index-out-of-bound conditions, so if you are using arrays, it's better to just treat them as such.

If it is critical - look into assembler that you compiler generates. But keep in mind it may change as you change the code that surrounds it.

回复收藏 0 原文

<逆流佳人身旁 2024-07-14 15:48:11

不，使用指针算术并不更快，而且很可能更慢，因为优化编译器可能会使用 Intel 处理器上的 LEA（加载有效地址）等指令或其他处理器上的类似指令来进行指针算术，这比 add 或 add/mul 更快。它的优点是可以同时做几件事并且不影响标志，并且还需要一个周期来计算。顺便说一句，以下内容来自 GCC 手册。因此 -Os 并不是主要针对速度进行优化。

我也完全同意themarko的观点。首先尝试编写干净、可读和可重用的代码，然后考虑优化并使用一些分析工具来查找瓶颈。大多数时候，性能问题与 I/O 相关，或者是一些糟糕的算法，或者是一些您必须找出的错误。 Knuth 就是这个人；-)

我刚刚想到，你会用一个结构数组。如果你想进行指针运算，那么你绝对应该对结构体的每个成员进行运算。听起来是不是太过分了？是的，这当然是矫枉过正，而且还为掩盖错误打开了大门。

-Os 优化大小。 Os 启用通常不会增加代码大小的所有 O2 优化。它还执行旨在减少代码大小的进一步优化。

回复收藏 0 原文

旧时光的容颜 2024-07-14 15:48:11

这不是真的。它与下标运算符一样快。在 Objective-C 中，您可以像 C 和面向对象风格一样使用数组，其中面向对象风格要慢得多，因为由于调用的动态性质，它在每次调用中都会进行一些操作。

回复收藏 0 原文

旧瑾黎汐 2024-07-14 15:48:11

速度上不太可能有任何差异。

使用数组运算符 [] 可能是首选，因为在 C++ 中，您可以对其他容器（例如向量）使用相同的语法。

回复收藏 0 原文

も让我眼熟你 2024-07-14 15:48:11

我为多个 AAA 游戏进行了 10 年的 C++/汇编优化工作，我可以说，在我所使用的特定平台/编译器上，指针算术产生了相当大的差异。

作为一个正确看待事物的例子，通过用指针算术替换所有数组访问，我能够在粒子生成器中建立一个非常紧密的循环，速度提高了 40%，这让我的同事们完全难以置信。我当时从一位老师那里听说这是一个好技巧，但我认为它不会对我们今天拥有的编译器/CPU 产生任何影响。我错了；）

必须指出的是，许多控制台 ARM 处理器不支持具有现代 CISC CPU 的所有可爱功能，但编译器有时有点不稳定。

回复收藏 0 原文

听不够的曲调 2024-07-14 15:48:10

您需要了解这种说法背后的原因。您是否曾经问过自己为什么它更快？让我们比较一些代码：

int i;
int a[20];

// Init all values to zero
memset(a, 0, sizeof(a));
for (i = 0; i < 20; i++) {
    printf("Value of %d is %d\n", i, a[i]);
}

它们都是零，真是令人惊讶:-P 问题是，a[i] 实际上在低级机器代码中意味着什么？意思是

取得a在内存中的地址。
将 i 乘以 a 单个项目的大小添加到该地址（int 通常是四个字节）。
从该地址获取值。

因此，每次从 a 获取值时，a 的基地址都会添加到 i 乘以 4 的结果中。如果您只是取消引用指针，则不需要执行步骤 1. 和 2.，只需执行步骤 3。

请考虑下面的代码。

int i;
int a[20];
int * b;

memset(a, 0, sizeof(a));
b = a;
for (i = 0; i < 20; i++) {
    printf("Value of %d is %d\n", i, *b);
    b++;
}

这段代码可能更快...但即使是这样，差异也很小。为什么它可能会更快？ “*b”与上面的步骤3相同。但是，“b++”与步骤 1 和步骤 2 不同。“b++”会将指针增加 4。

（对新手很重要：运行 ++
指针上不会增加
指针在内存中占一个字节！它会
将指针增加尽可能多的字节
在内存中，因为它指向的数据是
在尺寸方面。它指向一个int并且
int 在我的机器上是四个字节，所以 b++
b 增加四！）

好的，但是为什么它会更快呢？因为向指针添加 4 比将 i 乘以 4 然后将其添加到指针要快。在这两种情况下，您都会进行加法运算，但在第二种情况下，您不会进行乘法运算（您可以避免一次乘法运算所需的 CPU 时间）。考虑到现代 CPU 的速度，即使数组有 1 个 mio 元素，我想知道您是否真的可以对差异进行基准测试。

现代编译器是否可以将任一编译器优化得同样快，您可以通过查看它生成的汇编输出来检查。您可以通过将“-S”选项（大写 S）传递给 GCC 来完成此操作。

这是第一个 C 代码的代码（已使用优化级别 -Os ，这意味着针对代码大小和速度进行优化，但不要进行会显着增加代码大小的速度优化，这与 不同-O2 与 -O3 非常不同）：

_main:
    pushl   %ebp
    movl    %esp, %ebp
    pushl   %edi
    pushl   %esi
    pushl   %ebx
    subl    $108, %esp
    call    ___i686.get_pc_thunk.bx
"L00000000001$pb":
    leal    -104(%ebp), %eax
    movl    $80, 8(%esp)
    movl    $0, 4(%esp)
    movl    %eax, (%esp)
    call    L_memset$stub
    xorl    %esi, %esi
    leal    LC0-"L00000000001$pb"(%ebx), %edi
L2:
    movl    -104(%ebp,%esi,4), %eax
    movl    %eax, 8(%esp)
    movl    %esi, 4(%esp)
    movl    %edi, (%esp)
    call    L_printf$stub
    addl    $1, %esi
    cmpl    $20, %esi
    jne L2
    addl    $108, %esp
    popl    %ebx
    popl    %esi
    popl    %edi
    popl    %ebp
    ret

与第二个代码相同：

_main:
    pushl   %ebp
    movl    %esp, %ebp
    pushl   %edi
    pushl   %esi
    pushl   %ebx
    subl    $124, %esp
    call    ___i686.get_pc_thunk.bx
"L00000000001$pb":
    leal    -104(%ebp), %eax
    movl    %eax, -108(%ebp)
    movl    $80, 8(%esp)
    movl    $0, 4(%esp)
    movl    %eax, (%esp)
    call    L_memset$stub
    xorl    %esi, %esi
    leal    LC0-"L00000000001$pb"(%ebx), %edi
L2:
    movl    -108(%ebp), %edx
    movl    (%edx,%esi,4), %eax
    movl    %eax, 8(%esp)
    movl    %esi, 4(%esp)
    movl    %edi, (%esp)
    call    L_printf$stub
    addl    $1, %esi
    cmpl    $20, %esi
    jne L2
    addl    $124, %esp
    popl    %ebx
    popl    %esi
    popl    %edi
    popl    %ebp
    ret

嗯，它是不同的，这是肯定的。 104 和 108 的数字差异来自于变量 b（在第一个代码中，堆栈上少了一个变量，现在我们多了一个，改变了堆栈地址）。 for 循环中的真正代码差异

movl    -104(%ebp,%esi,4), %eax

与

movl    -108(%ebp), %edx
movl    (%edx,%esi,4), %eax

实际上对我来说，看起来第一种方法更快（！），因为它发出一个 CPU 机器代码来执行所有工作（CPU 确实这一切都是为了我们），而不是有两个机器代码。另一方面，下面的两个汇编命令的运行时间可能比上面的命令要短。

作为结束语，我想说，根据您的编译器和 CPU 功能（CPU 提供哪些命令以何种方式访问内存），结果可能是任一方式。任一者都可能更快/更慢。除非您将自己严格限制为一种编译器（也意味着一种版本）和一种特定的 CPU，否则您不能肯定地说。由于 CPU 可以在单个汇编命令中执行越来越多的操作（很久以前，编译器实际上必须手动获取地址，将 i 乘以四，然后将两者相加，然后再获取值），使用的语句很久以前的绝对真理现在变得越来越值得怀疑。还有谁知道CPU内部是如何工作的？上面我将一个汇编指令与另外两个指令进行了比较。

我可以看到指令的数量不同，并且指令所需的时间也可能不同。此外，这些指令在其机器表示中需要多少内存（毕竟它们需要从内存传输到 CPU 缓存）也是不同的。然而，现代 CPU 并不按照您输入的方式执行指令。他们将大指令（通常称为 CISC）拆分为小子指令（通常称为 RISC），这也使他们能够更好地优化内部程序流程以提高速度。事实上，第一条指令和下面的另外两条指令可能会产生相同的子指令集，在这种情况下，没有任何可测量的速度差异。

对于 Objective-C，它只是带有扩展的 C。因此，适用于 C 的所有内容也适用于 Objective-C，就指针和数组而言也是如此。另一方面，如果您使用对象（例如，NSArray 或 NSMutableArray），那么这是一个完全不同的野兽。但是，在这种情况下，无论如何您都必须使用方法访问这些数组，没有指针/数组访问可供选择。

You need to understand the reason behind this claim. Have you ever questioned yourself why it is faster? Let's compare some code:

int i;
int a[20];

// Init all values to zero
memset(a, 0, sizeof(a));
for (i = 0; i < 20; i++) {
    printf("Value of %d is %d\n", i, a[i]);
}

They are all zero, what a surprise :-P The question is, what means a[i] actually in low level machine code? It means

Take the address of a in memory.
Add i times the size of a single item of a to that address (int usually is four bytes).
Fetch the value from that address.

So each time you fetch a value from a, the base address of a is added to the result of the multiplication of i by four. If you just dereference a pointer, step 1. and 2. don't need to be performed, only step 3.

Consider the code below.

int i;
int a[20];
int * b;

memset(a, 0, sizeof(a));
b = a;
for (i = 0; i < 20; i++) {
    printf("Value of %d is %d\n", i, *b);
    b++;
}

This code might be faster... but even if it is, the difference is tiny. Why might it be faster? "*b" is the same as step 3. of above. However, "b++" is not the same as step 1. and step 2. "b++" will increase the pointer by 4.

(important for newbies: running ++
on a pointer will not increase the
pointer one byte in memory! It will
increase the pointer by as many bytes
in memory as the data it points to is
in size. It points to an int and the
int is four bytes on my machine, so b++
increases b by four!)

Okay, but why might it be faster? Because adding four to a pointer is faster than multiplying i by four and adding that to a pointer. You have an addition in either case, but in the second one, you have no multiplication (you avoid the CPU time needed for one multiplication). Considering the speed of modern CPUs, even if the array was 1 mio elements, I wonder if you could really benchmark a difference, though.

That a modern compiler can optimize either one to be equally fast is something you can check by looking at the assembly output it produces. You do so by passing the "-S" option (capital S) to GCC.

Here's the code of first C code (optimization level -Os has been used, which means optimize for code size and speed, but don't do speed optimizations that will increase code size noticeably, unlike -O2 and much unlike -O3):

_main:
    pushl   %ebp
    movl    %esp, %ebp
    pushl   %edi
    pushl   %esi
    pushl   %ebx
    subl    $108, %esp
    call    ___i686.get_pc_thunk.bx
"L00000000001$pb":
    leal    -104(%ebp), %eax
    movl    $80, 8(%esp)
    movl    $0, 4(%esp)
    movl    %eax, (%esp)
    call    L_memset$stub
    xorl    %esi, %esi
    leal    LC0-"L00000000001$pb"(%ebx), %edi
L2:
    movl    -104(%ebp,%esi,4), %eax
    movl    %eax, 8(%esp)
    movl    %esi, 4(%esp)
    movl    %edi, (%esp)
    call    L_printf$stub
    addl    $1, %esi
    cmpl    $20, %esi
    jne L2
    addl    $108, %esp
    popl    %ebx
    popl    %esi
    popl    %edi
    popl    %ebp
    ret

Same with the second code:

_main:
    pushl   %ebp
    movl    %esp, %ebp
    pushl   %edi
    pushl   %esi
    pushl   %ebx
    subl    $124, %esp
    call    ___i686.get_pc_thunk.bx
"L00000000001$pb":
    leal    -104(%ebp), %eax
    movl    %eax, -108(%ebp)
    movl    $80, 8(%esp)
    movl    $0, 4(%esp)
    movl    %eax, (%esp)
    call    L_memset$stub
    xorl    %esi, %esi
    leal    LC0-"L00000000001$pb"(%ebx), %edi
L2:
    movl    -108(%ebp), %edx
    movl    (%edx,%esi,4), %eax
    movl    %eax, 8(%esp)
    movl    %esi, 4(%esp)
    movl    %edi, (%esp)
    call    L_printf$stub
    addl    $1, %esi
    cmpl    $20, %esi
    jne L2
    addl    $124, %esp
    popl    %ebx
    popl    %esi
    popl    %edi
    popl    %ebp
    ret

Well, it's different, that's for sure. The 104 and 108 number difference comes of the variable b (in the first code there was one variable less on stack, now we have one more, changing stack addresses). The real code difference in the for loop is

movl    -104(%ebp,%esi,4), %eax

compared to

movl    -108(%ebp), %edx
movl    (%edx,%esi,4), %eax

Actually to me it rather looks like the first approach is faster(!), since it issues one CPU machine code to perform all the work (the CPU does it all for us), instead of having two machine codes. On the other hand, the two assembly commands below might have a lower runtime altogether than the one above.

As a closing word, I'd say depending on your compiler and the CPU capabilities (what commands CPUs offer to access memory in what way), the result might be either way. Either one might be faster/slower. You cannot say for sure unless you limit yourself exactly to one compiler (meaning also one version) and one specific CPU. As CPUs can do more and more in a single assembly command (ages ago, a compiler really had to manually fetch the address, multiply i by four and add both together before fetching the value), statements that used to be an absolute truth ages ago are nowadays more and more questionable. Also who knows how CPUs work internally? Above I compare one assembly instructions to two other ones.

I can see that the number of instructions is different and the time such an instruction needs can be different as well. Also how much memory these instructions needs in their machine presentation (they need to be transferred from memory to CPU cache after all) is different. However modern CPUs don't execute instructions the way you feed them. They split big instructions (often referred to as CISC) into small sub-instructions (often referred to as RISC), which also allows them to better optimize program flow for speed internally. In fact, the first, single instruction and the two other instructions below might result in the same set of sub-instructions, in which case there is no measurable speed difference whatsoever.

Regarding Objective-C, it is just C with extensions. So everything that holds true for C will hold true for Objective-C as well in terms of pointers and arrays. If you use Objects on the other hand (for example, an NSArray or NSMutableArray), this is a completely different beast. However in that case you must access these arrays with methods anyway, there is no pointer/array access to choose from.

回复收藏 0 原文

~没有更多了~