现代编译器上的联合是否比转换更有效?

发布于 2024-11-09 20:30:41 字数 274 浏览 0 评论 0原文

考虑简单的代码:

UINT64 result;
UINT32 high, low;
...
result = ((UINT64)high << 32) | (UINT64)low;

现代编译器是否会将其转换为真正的高位桶移位,或者将其优化为到正确位置的简单副本?

如果没有,那么使用联合似乎比大多数人使用的轮班更有效。然而,让编译器对此进行优化是理想的解决方案。

我想知道当人们确实需要额外的一点性能时我应该如何建议他们。

Consider the simple code:

UINT64 result;
UINT32 high, low;
...
result = ((UINT64)high << 32) | (UINT64)low;

Do modern compilers turn that into a real barrel shift on high, or optimize it to a simple copy to the right location?

If not, then using a union would seem to be more efficient than the shift that most people appear to use. However, having the compiler optimize this is the ideal solution.

I'm wondering how I should advise people when they do require that extra little bit of performance.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

凹づ凸ル 2024-11-16 20:30:46

编辑:此响应基于没有强制转换的 OP 代码的早期版本

此代码

result = (high << 32) | low;

实际上会产生未定义的结果......因为使用 high 你将 32 位值移位 32 位(值的宽度),结果将是未定义的,并且取决于编译器和操作系统平台决定如何处理移位。然后,该未定义移位的结果将与 low 进行或运算,这又将是未定义的,因为您正在将未定义的值与已定义的值进行或运算,因此最终结果将是可能不是您想要的 64 位值。例如,OSX 10.6 上的 gcc -s 发出的代码如下所示:

movl    -4(%rbp), %eax      //retrieving the value of "high"
movl    $32, %ecx          
shal    %cl, %eax           //performing the 32-bit shift on "high"
orl    -8(%rbp), %eax       //OR'ing the value of "low" to the shift op result

因此,您可以看到移位仅发生在具有 32- 位寄存器的 32 位值上。位汇编命令...结果最终与 high | 完全相同low 根本不进行任何移位,因为在本例中,shal $32, %eax 仅返回最初位于 EAX 中的值。您没有得到 64 位结果。

为了避免这种情况,请将 high 转换为 uint64_t,如下所示:

result = ((uint64_t)high << 32) | low;

EDIT: This response is based on an earlier version of the OP's code that did not have a cast

This code

result = (high << 32) | low;

is actually going to have undefined results ... since with high you're shifting a 32-bit value by 32-bits (the width of the value), the results are going to be undefined and will depend on how a compiler and OS platform decide to handle the shift. The results of that undefined shift will then be or'd with low, which again will be undefined since you're or'ing an undefined value against a defined value, and so the end-result will most likely not be a 64-bit value like you want. For instance, the code emitted by gcc -s on OSX 10.6 looks like:

movl    -4(%rbp), %eax      //retrieving the value of "high"
movl    $32, %ecx          
shal    %cl, %eax           //performing the 32-bit shift on "high"
orl    -8(%rbp), %eax       //OR'ing the value of "low" to the shift op result

So you can see that the shift is only taking place on a 32-bit value in a 32-bit register with a 32-bit assembly command ... the results end up being the exact same as high | low without any shifting at all because in this case, shal $32, %eax just returns the value that was originally in EAX. You're not getting a 64-bit result.

In order to avoid that, cast high to a uint64_t like:

result = ((uint64_t)high << 32) | low;
花开半夏魅人心 2024-11-16 20:30:45

现代编译器比您想象的更聪明;-)(所以是的,我认为您可以期待任何像样的编译器上的桶式移位)。

无论如何,我会使用语义更接近您实际尝试执行的选项。

Modern compilers are smarter than what you might think ;-) (so yes, I think you can expect a barrel shift on any decent compiler).

Anyway, I would use the option that has a semantic closer to what you are actually trying to do.

来世叙缘 2024-11-16 20:30:45

如果这应该是独立于平台的,那么唯一的选择就是在这里使用轮班。

使用union { r64; struct{low;high}} 您无法判断低/高字段将映射到什么。考虑字节顺序。

现代编译器可以很好地处理这种转变。

If this supposed to be platform independent then the only option is to use shifts here.

With union { r64; struct{low;high}} you cannot tell on what low/high fields will map to. Think about endianess.

Modern compilers are pretty good handling such shifts.

萌辣 2024-11-16 20:30:44

我编写了以下(希望有效)测试:

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>

void func(uint64_t x);

int main(int argc, char **argv)
{
#ifdef UNION
  union {
    uint64_t full;
    struct {
      uint32_t low;
      uint32_t high;
    } p;
  } result;
  #define value result.full
#else
  uint64_t result;
  #define value result
#endif
  uint32_t high, low;

  if (argc < 3) return 0;

  high = atoi(argv[1]);
  low = atoi(argv[2]);

#ifdef UNION
  result.p.high = high;
  result.p.low = low;
#else
  result = ((uint64_t) high << 32) | low;
#endif

  // printf("%08x%08x\n", (uint32_t) (value >> 32), (uint32_t) (value & 0xffffffff));
  func(value);

  return 0;
}

运行 gcc -s 的未优化输出的差异:

<   mov -4(%rbp), %eax
<   movq    %rax, %rdx
<   salq    $32, %rdx
<   mov -8(%rbp), %eax
<   orq %rdx, %rax
<   movq    %rax, -16(%rbp)
---
>   movl    -4(%rbp), %eax
>   movl    %eax, -12(%rbp)
>   movl    -8(%rbp), %eax
>   movl    %eax, -16(%rbp)

我不知道汇编,所以我很难分析它。然而,看起来非联合(顶部)版本上正在发生一些变化。

但启用优化 -O2 后,输出是相同的。因此生成了相同的代码,两种方式将具有相同的性能。

(Linux/AMD64 上的 gcc 版本 4.5.2)

带或不带联合的优化 -O2 代码的部分输出:

    movq    8(%rsi), %rdi
    movl    $10, %edx
    xorl    %esi, %esi
    call    strtol

    movq    16(%rbx), %rdi
    movq    %rax, %rbp
    movl    $10, %edx
    xorl    %esi, %esi
    call    strtol

    movq    %rbp, %rdi
    mov     %eax, %eax
    salq    $32, %rdi
    orq     %rax, %rdi
    call    func

代码片段在 if 行生成的跳转之后立即开始。

I wrote the following (hopefully valid) test:

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>

void func(uint64_t x);

int main(int argc, char **argv)
{
#ifdef UNION
  union {
    uint64_t full;
    struct {
      uint32_t low;
      uint32_t high;
    } p;
  } result;
  #define value result.full
#else
  uint64_t result;
  #define value result
#endif
  uint32_t high, low;

  if (argc < 3) return 0;

  high = atoi(argv[1]);
  low = atoi(argv[2]);

#ifdef UNION
  result.p.high = high;
  result.p.low = low;
#else
  result = ((uint64_t) high << 32) | low;
#endif

  // printf("%08x%08x\n", (uint32_t) (value >> 32), (uint32_t) (value & 0xffffffff));
  func(value);

  return 0;
}

Running a diff of the unoptimized output of gcc -s:

<   mov -4(%rbp), %eax
<   movq    %rax, %rdx
<   salq    $32, %rdx
<   mov -8(%rbp), %eax
<   orq %rdx, %rax
<   movq    %rax, -16(%rbp)
---
>   movl    -4(%rbp), %eax
>   movl    %eax, -12(%rbp)
>   movl    -8(%rbp), %eax
>   movl    %eax, -16(%rbp)

I don't know assembly, so it's hard for me to analyze that. However, it looks like some shifting is taking place as expected on the non-union (top) version.

But with optimizations -O2 enabled, the output was identical. So the same code was generated and both ways will have the same performance.

(gcc version 4.5.2 on Linux/AMD64)

Partial output of optimized -O2 code with or without union:

    movq    8(%rsi), %rdi
    movl    $10, %edx
    xorl    %esi, %esi
    call    strtol

    movq    16(%rbx), %rdi
    movq    %rax, %rbp
    movl    $10, %edx
    xorl    %esi, %esi
    call    strtol

    movq    %rbp, %rdi
    mov     %eax, %eax
    salq    $32, %rdi
    orq     %rax, %rdi
    call    func

The snippet begins immediately after the jump generated by the if line.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文