C语言中访问指针有多少条指令?

发布于 2024-08-30 13:46:39 字数 380 浏览 6 评论 0原文

我试图计算出访问 C 中的指针需要多少个时钟周期或总指令。我认为我不知道如何计算出例如 p->x = d->a + f->; b

我会假设每个指针有两个负载,只是猜测会有一个指针负载和一个值负载。因此,在这个操作中,指针分辨率将是比实际加法更大的因素,就试图加快这段代码而言,对吧?

这可能取决于所实现的编译器和架构,但我走在正确的轨道上吗?

我见过一些代码,其中使用的每个值(例如 3 个加法)都来自一种

 f2->sum = p1->p2->p3->x + p1->p2->p3->a + p1->p2->p3->m

结构类型,我试图定义这有多糟糕

I am trying to figure out how many clock cycles or total instructions it takes to access a pointer in C. I dont think I know how to figure out for example, p->x = d->a + f->b

i would assume two loads per pointer, just guessing that there would be a load for the pointer, and a load for the value. So in this operations, the pointer resolution would be a much larger factor than the actual addition, as far as trying to speed this code up, right?

This may depend on the compiler and architecture implemented, but am I on the right track?

I have seen some code where each value used in say, 3 additions, came from a

 f2->sum = p1->p2->p3->x + p1->p2->p3->a + p1->p2->p3->m

type of structure, and I am trying to define how bad this is

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

不乱于心 2024-09-06 13:46:40

这取决于手头的架构。

某些体系结构可以引用/取消引用指令的内存,而无需先将其加载到寄存器中,而其他体系结构则不然。某些体系结构没有计算偏移量以供您取消引用的指令概念,并且会让您加载内存地址,将偏移量添加到其中,然后允许您取消引用内存位置。我确信芯片与芯片之间存在更多差异。

一旦你克服了这些问题,每条指令也会花费不同的时间,具体取决于架构。但老实说,这是一个非常非常小的开销。

对于取消引用一系列项目的直接问题,速度会变得缓慢,因为您在取消引用链中走得越远,引用的位置可能就越差。这意味着更多的缓存未命中,这意味着更多地访问主内存(或磁盘!)来获取数据。与 CPU 相比,主存的速度非常慢。

This depends on the architecture at hand.

Some architectures can reference/dereference memory for an instruction without first loading it into a register, others don't. Some architectures don't have the notion of instructions that compute the offsets for you to dereference and will make you load the memory address, add your offset to it, and then allow you to dereference the memory location. I'm sure there are more variances chip-to-chip.

Once you get past these, each instruction takes varying amount of time depending on the architecture as well. To be honest though, it's an overhead that is very, very minimal.

For your immediate question of dereferencing a chain of items, the slowness will come in the fact that there is likely a poor locality of reference the farther you go in a dereferencing chain. This means more cache misses, which means more hits to main memory (or disk!) to get the data. Main memory is very slow compared to the CPU.

紫南 2024-09-06 13:46:40

有些 IDE(例如 VisualStudio)允许您查看生成的程序集以及源代码。

如何使用 Visual 查看代码背后的程序集C++?

然后您就可以看到您的确切架构和实现是什么样子。

如果您使用 GDB(linux、mac),请使用 disassemble

(gdb) disas 0x32c4 0x32e4
Dump of assembler code from 0x32c4 to 0x32e4:
0x32c4 <main+204>:      addil 0,dp
0x32c8 <main+208>:      ldw 0x22c(sr0,r1),r26
0x32cc <main+212>:      ldil 0x3000,r31
0x32d0 <main+216>:      ble 0x3f8(sr4,r31)
0x32d4 <main+220>:      ldo 0(r31),rp
0x32d8 <main+224>:      addil -0x800,dp
0x32dc <main+228>:      ldo 0x588(r1),r26
0x32e0 <main+232>:      ldil 0x3000,r31
End of assembler dump.

Some IDEs like VisualStudio allow you to view the assembly generated along with the source code.

How to view the assembly behind the code using Visual C++?

Then you can see for your exact architecture and implementation what it looks like.

If you are using GDB (linux, mac) use disassemble

(gdb) disas 0x32c4 0x32e4
Dump of assembler code from 0x32c4 to 0x32e4:
0x32c4 <main+204>:      addil 0,dp
0x32c8 <main+208>:      ldw 0x22c(sr0,r1),r26
0x32cc <main+212>:      ldil 0x3000,r31
0x32d0 <main+216>:      ble 0x3f8(sr4,r31)
0x32d4 <main+220>:      ldo 0(r31),rp
0x32d8 <main+224>:      addil -0x800,dp
0x32dc <main+228>:      ldo 0x588(r1),r26
0x32e0 <main+232>:      ldil 0x3000,r31
End of assembler dump.
林空鹿饮溪 2024-09-06 13:46:40

取决于你在做什么,一个简单的指针取消引用 y = *z;

int x = 1;
int* z = &x;
int y;

可能会在 x86 上组装成类似的东西:

mov eax, [z]
mov eax, [eax]
mov [y], eax

并且 y = x 仍然会进行内存取消引用:

mov eax, [x]
mov [y], eax

将指令移至内存大约需要 2-4 个周期 IIRC。

不过,如果您从完全随机的位置加载内存,则会导致大量页面错误,从而导致数百个时钟周期被浪费。

Depends what you are doing, a trivial pointer dereference y = *z; where

int x = 1;
int* z = &x;
int y;

might assemble to something like this on the x86:

mov eax, [z]
mov eax, [eax]
mov [y], eax

and y = x would still take a memory dereference:

mov eax, [x]
mov [y], eax

Mov instructions to memory take about 2-4 cycles IIRC.

Although, if you are loading memory from completely random locations, you will be causing a lot of page faults, resulting in hundreds of clock cycles being wasted.

时光病人 2024-09-06 13:46:40

在可能的情况下,编译器将通过在寄存器中保留重复使用的基址来为您消除该开销(例如,示例中的 p1->p2->p3)。

但是,有时编译器无法确定哪些指针可能与函数中使用的其他指针别名 - 这意味着它必须退回到非常保守的位置,并频繁地从指针重新加载值。

这就是 C99 的 restrict 关键字可以提供帮助的地方。当某些指针永远不会被函数范围内的其他指针别名时,它可以让您通知编译器,从而可以提高优化效果。


例如,采用以下函数:

struct xyz {
    int val1;
    int val2;
    int val3;
};

struct abc {
    struct xyz *p2;
};

int foo(struct abc *p1)
{
    int sum;

    sum = p1->p2->val1 + p1->p2->val2 + p1->p2->val3;

    return sum;
}

在优化级别 -O1 的 gcc 4.3.2 下,它编译为以下 x86 代码:

foo:
    pushl   %ebp
    movl    %esp, %ebp
    movl    8(%ebp), %eax
    movl    (%eax), %edx
    movl    4(%edx), %eax
    addl    (%edx), %eax
    addl    8(%edx), %eax
    popl    %ebp
    ret

如您所见,它仅引用 p1 一次 -它将 p1->p2 的值保存在 %edx 寄存器中,并使用它三次从该结构中获取三个值。

Where it can, the compiler will remove that overhead for you by keeping repeatedly-used base locations in a register (eg. p1->p2->p3 in your example).

However, sometimes the compiler can't determine which pointers might alias other pointers used within your function - which means that it has to fall back to a very conservative position, and reload values from pointers frequently.

This is where C99's restrict keyword can help. It lets you inform the compiler when certain pointers are never aliased by other pointers in the scope of the function, which consquently can improve the optimisation.


For example, take this function:

struct xyz {
    int val1;
    int val2;
    int val3;
};

struct abc {
    struct xyz *p2;
};

int foo(struct abc *p1)
{
    int sum;

    sum = p1->p2->val1 + p1->p2->val2 + p1->p2->val3;

    return sum;
}

Under gcc 4.3.2 with optimisation level -O1, it compiles to this x86 code:

foo:
    pushl   %ebp
    movl    %esp, %ebp
    movl    8(%ebp), %eax
    movl    (%eax), %edx
    movl    4(%edx), %eax
    addl    (%edx), %eax
    addl    8(%edx), %eax
    popl    %ebp
    ret

As you can see, it only deferences p1 once - it keeps the value of p1->p2 in the %edx register and uses it three times to fetch the three values from that structure.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文