C++ 中循环移位（旋转）操作的最佳实践

发布于 2024-07-17 20:43:23 字数 366 浏览 11 评论 0原文

C++ 中已提供左移和右移运算符（<< 和 >>）。但是，我不知道如何执行循环移位或旋转操作。

如何进行“向左旋转”、“向右旋转”等操作？

在这里向右旋转两次

Initial --> 1000 0011 0100 0010

应该会导致：

Final   --> 1010 0000 1101 0000

一个例子会很有帮助。

（编者注：如果旋转计数为零，或者编译为不仅仅是单个旋转机器指令，则在 C 中表达旋转的许多常见方法都会出现未定义的行为。这个问题的答案应该记录最佳实践。）

原文

Left and right shift operators (<< and >>) are already available in C++.
However, I couldn't find out how I could perform circular shift or rotate operations.

How can operations like "Rotate Left" and "Rotate Right" be performed?

Rotating right twice here

Initial --> 1000 0011 0100 0010

should result in:

Final   --> 1010 0000 1101 0000

An example would be helpful.

(editor's note: Many common ways of expressing rotates in C suffer from undefined behaviour if the rotate count is zero, or compile to more than just a single rotate machine instruction. This question's answer should document best practices.)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

笑看君怀她人 2024-07-24 20:43:23

另请参阅这个关于另一个轮换问题的答案的早期版本，其中包含有关 asm gcc/clang 为 x86 生成的内容的更多详细信息。

在 C 和 C++ 中表达轮换以避免任何未定义行为的对编译器最友好的方式似乎是 John Regehr 的实施。我已将其调整为按类型的宽度进行旋转（使用像 uint32_t 这样的固定宽度类型）。

#include <stdint.h>   // for uint32_t
#include <limits.h>   // for CHAR_BIT
// #define NDEBUG
#include <assert.h>

static inline uint32_t rotl32 (uint32_t n, unsigned int c)
{
  const unsigned int mask = (CHAR_BIT*sizeof(n) - 1);  // assumes width is a power of 2.

  // assert ( (c<=mask) &&"rotate by type width or more");
  c &= mask;
  return (n<<c) | (n>>( (-c)&mask ));
}

static inline uint32_t rotr32 (uint32_t n, unsigned int c)
{
  const unsigned int mask = (CHAR_BIT*sizeof(n) - 1);

  // assert ( (c<=mask) &&"rotate by type width or more");
  c &= mask;
  return (n>>c) | (n<<( (-c)&mask ));
}

适用于任何无符号整数类型，而不仅仅是 uint32_t，因此您可以制作其他大小的版本。

请参阅还有一个 C++11 模板版本，其中包含大量安全检查（包括static_assert 类型宽度是 2) 的幂，例如，在某些 24 位 DSP 或 36 位大型机上并非如此。

我建议仅使用模板作为名称明确包含旋转宽度的包装器的后端。 整数提升规则意味着 rotl_template(u16 & 0x11UL, 7) 将执行 32 或 64 位旋转，而不是 16（取决于 无符号长）。甚至 uint16_t & uint16_t 根据 C++ 的整数提升规则提升为 signed int，除非在 int 不宽于 uint16_t 的平台上。

On x86, this version 内联到单个 rol r32, cl （或 rol r32, imm8），编译器能够理解它，因为编译器知道 x86 旋转和移位指令屏蔽了移位计数与 C 源代码的做法相同。

编译器在 x86 上支持这种避免 UB 的习惯用法，对于用于变量计数移位的 uint32_t x 和 unsigned int n：

clang：从 clang3.5 开始识别变量计数旋转，在此之前多次轮班+或insn。
gcc：自 gcc4.9 起可识别可变计数循环，多个之前的转变+或insns。 gcc5 及更高版本也优化了维基百科版本中的分支和掩码，仅使用 ror 或 rol 指令进行变量计数。
icc：支持可变计数轮换自 ICC13 或更早以来。常量计数循环使用 shld edi,edi,7 ，在某些 CPU（尤其是 AMD，但也有一些 Intel）上，它比 rol edi,7 更慢并且占用更多字节，当 BMI2 无法用于 rorx eax,edi,25 保存 MOV 时。
MSVC：x86-64 CL19：仅识别恒定计数循环。（维基百科习语已被识别，但分支和 AND 并未被优化掉）。在 x86（包括 x86-64）上使用中的 _rotl / _rotr 内在函数。

ARM 的 gcc 使用和 r1, r1, #31 进行可变计数循环，但仍然使用单个指令进行实际循环：ror r0, r0 ，r1。所以 gcc 没有意识到旋转计数本质上是模块化的。正如 ARM 文档所说， "移位长度 n 大于 32 的 ROR 与移位长度 n-32" 的 ROR 相同。我认为 gcc 在这里会感到困惑，因为 ARM 上的左/右移位使计数饱和，因此移位 32 或更多将清除寄存器。（与 x86 不同，x86 中的移位屏蔽计数与旋转相同）。由于非循环移位在该目标上的工作方式，它可能决定在识别旋转习惯用法之前需要 AND 指令。

当前的 x86 编译器仍然使用额外的指令来屏蔽 8 位和 16 位循环的变量计数，可能与它们不避免 ARM 上的 AND 的原因相同。这是一个错过的优化，因为性能不依赖于任何 x86-64 CPU 上的循环计数。（出于性能原因，在 286 中引入了计数掩码，因为它迭代地处理移位，而不是像现代 CPU 那样具有恒定延迟。）

顺便说一句，对于可变计数旋转，更喜欢向右旋转，以避免使编译器执行 32- n 在仅提供右旋转的 ARM 和 MIPS 等架构上实现左旋转。（这可以优化编译时间常数计数。）

有趣的事实：ARM 并没有真正专用的移位/旋转指令，它只是带有源操作数在 ROR 模式下通过桶形移位器：mov r0, r0, ror r1。因此，循环可以折叠到 EOR 指令或其他指令的寄存器源操作数中。

确保 n 和返回值使用无符号类型，否则它不会是旋转。（x86 目标的 gcc 进行算术右移，移入符号位的副本而不是零，当您将两个移位的值进行“或”运算时，会导致出现问题。负符号整数的右移是C 中实现定义的行为。）

此外，确保移位计数是无符号类型，因为带符号类型的 (-n)&31 可能是补码或符号/数值，与使用无符号或二进制补码得到的模 2^n 不同。（请参阅雷格尔博客文章的评论）。 unsigned int 在我见过的每个编译器上，对于 x 的每个宽度都表现良好。某些其他类型实际上无法识别某些编译器的惯用语，因此不要仅使用与 x 相同的类型。

某些编译器提供旋转的内在函数，如果可移植版本不能在您的目标编译器上生成良好的代码，则这比内联汇编要好得多。据我所知，任何编译器都没有跨平台的内在函数。以下是一些 x86 选项：

英特尔文档提供 _rotl 和_rotl64 内在函数，右移相同。 MSVC 需要，而 gcc 需要。 #ifdef 负责处理 gcc 与 icc。 Clang 9.0 也有它，但在此之前它似乎没有在任何地方提供它们，除非在带有 -fms-extensions -fms-compatibility -fms-compatibility-version=17.00 的 MSVC 兼容模式下。它为它们发出的 asm 很糟糕（额外的掩蔽和 CMOV）。
MSVC：_rotr8 和_rotr16。
gcc 和 icc（不是 clang）：还提供 __rolb/__rorb 用于 8 位左/右旋转，<代码>__rolw/__rorw（16位），__rold/__rord（32位），__rolq< /code>/__rorq （64 位，仅为 64 位目标定义）。对于窄循环，实现使用 __builtin_ia32_rolhi 或 ...qi，但 32 和 64 位循环是使用 shift/or 定义的（没有针对 UB 的保护，因为ia32intrin 中的代码。 h 只需要在 x86 的 gcc 上工作）。 GNU C 似乎没有像 __builtin_popcount 那样具有任何跨平台的 __builtin_rotate 函数（即使它不是单个指令，它也会扩展到目标平台上的最佳值））。大多数时候，您可以通过习语识别获得好的代码。
Clang 有 __builtin_rotateleft8、__builtin_rotateright8，其他宽度也类似。请参阅 Clang 的语言扩展文档。还有其他位操作的内置函数/内在函数，例如 __builtin_bitreverse32 （可以在 ARM / AArch64 上编译为 rbit）。您可以使用 __has_builtin 来测试它们。

// For real use, probably use a rotate intrinsic for MSVC, or this idiom for other compilers.  This pattern of #ifdefs may be helpful
#if defined(__x86_64__) || defined(__i386__)

#ifdef _MSC_VER
#include <intrin.h>
#else
#include <x86intrin.h>  // Not just <immintrin.h> for compilers other than icc
#endif

uint32_t rotl32_x86_intrinsic(rotwidth_t x, unsigned n) {
  //return __builtin_ia32_rorhi(x, 7);  // 16-bit rotate, GNU C
  return _rotl(x, n);  // gcc, icc, msvc.  Intel-defined.
  //return __rold(x, n);  // gcc, icc.
  // can't find anything for clang
}
#endif

想必某些非 x86 编译器也具有内在函数，但我们不要扩展此社区 wiki 答案以包含所有这些编译器。（也许可以在有关内在函数的现有答案中做到这一点）。

（此答案的旧版本建议使用 MSVC 特定的内联汇编（仅适用于 32 位 x86 代码），或 http://www.devx.com/tips/Tip/14043 用于 C 版本。评论正在对此进行回复。）

内联汇编击败了许多优化，特别是MSVC风格，因为它强制存储/重新加载输入。精心编写的 GNU C 内联汇编循环将允许计数成为编译时常量移位计数的立即操作数，但如果要移位的值也是编译时常量，它仍然无法完全优化掉内联后。 https://gcc.gnu.org/wiki/DontUseInlineAsm 。

See also an earlier version of this answer on another rotate question with some more details about what asm gcc/clang produce for x86.

The most compiler-friendly way to express a rotate in C and C++ that avoids any Undefined Behaviour seems to be John Regehr's implementation. I've adapted it to rotate by the width of the type (using fixed-width types like uint32_t).

#include <stdint.h>   // for uint32_t
#include <limits.h>   // for CHAR_BIT
// #define NDEBUG
#include <assert.h>

static inline uint32_t rotl32 (uint32_t n, unsigned int c)
{
  const unsigned int mask = (CHAR_BIT*sizeof(n) - 1);  // assumes width is a power of 2.

  // assert ( (c<=mask) &&"rotate by type width or more");
  c &= mask;
  return (n<<c) | (n>>( (-c)&mask ));
}

static inline uint32_t rotr32 (uint32_t n, unsigned int c)
{
  const unsigned int mask = (CHAR_BIT*sizeof(n) - 1);

  // assert ( (c<=mask) &&"rotate by type width or more");
  c &= mask;
  return (n>>c) | (n<<( (-c)&mask ));
}

Works for any unsigned integer type, not just uint32_t, so you could make versions for other sizes.

See also a C++11 template version with lots of safety checks (including a static_assert that the type width is a power of 2), which isn't the case on some 24-bit DSPs or 36-bit mainframes, for example.

I'd recommend only using the template as a back-end for wrappers with names that include the rotate width explicitly. Integer-promotion rules mean that rotl_template(u16 & 0x11UL, 7) would do a 32 or 64-bit rotate, not 16 (depending on the width of unsigned long). Even uint16_t & uint16_t is promoted to signed int by C++'s integer-promotion rules, except on platforms where int is no wider than uint16_t.

On x86, this version inlines to a single rol r32, cl (or rol r32, imm8) with compilers that grok it, because the compiler knows that x86 rotate and shift instructions mask the shift-count the same way the C source does.

Compiler support for this UB-avoiding idiom on x86, for uint32_t x and unsigned int n for variable-count shifts:

clang: recognized for variable-count rotates since clang3.5, multiple shifts+or insns before that.
gcc: recognized for variable-count rotates since gcc4.9, multiple shifts+or insns before that. gcc5 and later optimize away the branch and mask in the wikipedia version, too, using just a ror or rol instruction for variable counts.
icc: supported for variable-count rotates since ICC13 or earlier. Constant-count rotates use shld edi,edi,7 which is slower and takes more bytes than rol edi,7 on some CPUs (especially AMD, but also some Intel), when BMI2 isn't available for rorx eax,edi,25 to save a MOV.
MSVC: x86-64 CL19: Only recognized for constant-count rotates. (The wikipedia idiom is recognized, but the branch and AND aren't optimized away). Use the _rotl / _rotr intrinsics from <intrin.h> on x86 (including x86-64).

gcc for ARM uses an and r1, r1, #31 for variable-count rotates, but still does the actual rotate with a single instruction: ror r0, r0, r1. So gcc doesn't realize that rotate-counts are inherently modular. As the ARM docs say, "ROR with shift length, n, more than 32 is the same as ROR with shift length n-32". I think gcc gets confused here because left/right shifts on ARM saturate the count, so a shift by 32 or more will clear the register. (Unlike x86, where shifts mask the count the same as rotates). It probably decides it needs an AND instruction before recognizing the rotate idiom, because of how non-circular shifts work on that target.

Current x86 compilers still use an extra instruction to mask a variable count for 8 and 16-bit rotates, probably for the same reason they don't avoid the AND on ARM. This is a missed optimization, because performance doesn't depend on the rotate count on any x86-64 CPU. (Masking of counts was introduced with 286 for performance reasons because it handled shifts iteratively, not with constant-latency like modern CPUs.)

BTW, prefer rotate-right for variable-count rotates, to avoid making the compiler do 32-n to implement a left rotate on architectures like ARM and MIPS that only provide a rotate-right. (This optimizes away with compile-time-constant counts.)

Fun fact: ARM doesn't really have dedicated shift/rotate instructions, it's just MOV with the source operand going through the barrel-shifter in ROR mode: mov r0, r0, ror r1. So a rotate can fold into a register-source operand for an EOR instruction or something.

Make sure you use unsigned types for n and the return value, or else it won't be a rotate. (gcc for x86 targets does arithmetic right shifts, shifting in copies of the sign-bit rather than zeroes, leading to a problem when you OR the two shifted values together. Right-shifts of negative signed integers is implementation-defined behaviour in C.)

Also, make sure the shift count is an unsigned type, because (-n)&31 with a signed type could be one's complement or sign/magnitude, and not the same as the modular 2^n you get with unsigned or two's complement. (See comments on Regehr's blog post). unsigned int does well on every compiler I've looked at, for every width of x. Some other types actually defeat the idiom-recognition for some compilers, so don't just use the same type as x.

Some compilers provide intrinsics for rotates, which is far better than inline-asm if the portable version doesn't generate good code on the compiler you're targeting. There aren't cross-platform intrinsics for any compilers that I know of. These are some of the x86 options:

Intel documents that <immintrin.h> provides _rotl and _rotl64 intrinsics, and same for right shift. MSVC requires <intrin.h>, while gcc require <x86intrin.h>. An #ifdef takes care of gcc vs. icc. Clang 9.0 also has it, but before that it doesn't seem to provide them anywhere, except in MSVC compatibility mode with -fms-extensions -fms-compatibility -fms-compatibility-version=17.00. And the asm it emits for them sucks (extra masking and a CMOV).
MSVC: _rotr8 and _rotr16.
gcc and icc (not clang): <x86intrin.h> also provides __rolb/__rorb for 8-bit rotate left/right, __rolw/__rorw (16-bit), __rold/__rord (32-bit), __rolq/__rorq (64-bit, only defined for 64-bit targets). For narrow rotates, the implementation uses __builtin_ia32_rolhi or ...qi, but the 32 and 64-bit rotates are defined using shift/or (with no protection against UB, because the code in ia32intrin.h only has to work on gcc for x86). GNU C appears not to have any cross-platform __builtin_rotate functions the way it does for __builtin_popcount (which expands to whatever's optimal on the target platform, even if it's not a single instruction). Most of the time you get good code from idiom-recognition.
Clang has __builtin_rotateleft8, __builtin_rotateright8, and similar for other widths. See Clang's language extensions documentation. There are also builtins / intrinsics for other bit-manipulation things like __builtin_bitreverse32 (which can compile to rbit on ARM / AArch64). You can test for them with __has_builtin.

// For real use, probably use a rotate intrinsic for MSVC, or this idiom for other compilers.  This pattern of #ifdefs may be helpful
#if defined(__x86_64__) || defined(__i386__)

#ifdef _MSC_VER
#include <intrin.h>
#else
#include <x86intrin.h>  // Not just <immintrin.h> for compilers other than icc
#endif

uint32_t rotl32_x86_intrinsic(rotwidth_t x, unsigned n) {
  //return __builtin_ia32_rorhi(x, 7);  // 16-bit rotate, GNU C
  return _rotl(x, n);  // gcc, icc, msvc.  Intel-defined.
  //return __rold(x, n);  // gcc, icc.
  // can't find anything for clang
}
#endif

Presumably some non-x86 compilers have intrinsics, too, but let's not expand this community-wiki answer to include them all. (Maybe do that in the existing answer about intrinsics).

(The old version of this answer suggested MSVC-specific inline asm (which only works for 32bit x86 code), or http://www.devx.com/tips/Tip/14043 for a C version. The comments are replying to that.)

Inline asm defeats many optimizations, especially MSVC-style because it forces inputs to be stored/reloaded. A carefully-written GNU C inline-asm rotate would allow the count to be an immediate operand for compile-time-constant shift counts, but it still couldn't optimize away entirely if the value to be shifted is also a compile-time constant after inlining. https://gcc.gnu.org/wiki/DontUseInlineAsm.

C++ 中循环移位（旋转）操作的最佳实践

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（15）

关于作者

相关话题

热门标签

推荐作者

小瓶盖

wxsp_Ukbq8xGR

1638627670

仅一夜美梦

夜访吸血鬼

近卫軍团

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。