C++ 中循环移位(旋转)操作的最佳实践
C++ 中已提供左移和右移运算符(<< 和 >>)。 但是,我不知道如何执行循环移位或旋转操作。
如何进行“向左旋转”、“向右旋转”等操作?
在这里向右旋转两次
Initial --> 1000 0011 0100 0010
应该会导致:
Final --> 1010 0000 1101 0000
一个例子会很有帮助。
(编者注:如果旋转计数为零,或者编译为不仅仅是单个旋转机器指令,则在 C 中表达旋转的许多常见方法都会出现未定义的行为。这个问题的答案应该记录最佳实践。)
Left and right shift operators (<< and >>) are already available in C++.
However, I couldn't find out how I could perform circular shift or rotate operations.
How can operations like "Rotate Left" and "Rotate Right" be performed?
Rotating right twice here
Initial --> 1000 0011 0100 0010
should result in:
Final --> 1010 0000 1101 0000
An example would be helpful.
(editor's note: Many common ways of expressing rotates in C suffer from undefined behaviour if the rotate count is zero, or compile to more than just a single rotate machine instruction. This question's answer should document best practices.)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(15)
另请参阅这个关于另一个轮换问题的答案的早期版本,其中包含有关 asm gcc/clang 为 x86 生成的内容的更多详细信息。
在 C 和 C++ 中表达轮换以避免任何未定义行为的对编译器最友好的方式似乎是 John Regehr 的实施。 我已将其调整为按类型的宽度进行旋转(使用像
uint32_t
这样的固定宽度类型)。适用于任何无符号整数类型,而不仅仅是
uint32_t
,因此您可以制作其他大小的版本。请参阅还有一个 C++11 模板版本,其中包含大量安全检查(包括
static_assert
类型宽度是 2) 的幂,例如,在某些 24 位 DSP 或 36 位大型机上并非如此。我建议仅使用模板作为名称明确包含旋转宽度的包装器的后端。 整数提升规则意味着
rotl_template(u16 & 0x11UL, 7)
将执行 32 或 64 位旋转,而不是 16(取决于无符号长
)。 甚至 uint16_t & uint16_t 根据 C++ 的整数提升规则提升为signed int
,除非在int
不宽于uint16_t
的平台上。On x86, this version 内联到单个
rol r32, cl
(或rol r32, imm8
),编译器能够理解它,因为编译器知道 x86 旋转和移位指令 屏蔽了移位计数与 C 源代码的做法相同。编译器在 x86 上支持这种避免 UB 的习惯用法,对于用于变量计数移位的
uint32_t x
和unsigned int n
:ror
或rol
指令进行变量计数。shld edi,edi,7
,在某些 CPU(尤其是 AMD,但也有一些 Intel)上,它比rol edi,7
更慢并且占用更多字节,当 BMI2 无法用于rorx eax,edi,25
保存 MOV 时。
中的_rotl
/_rotr
内在函数。ARM 的 gcc 使用
和 r1, r1, #31
进行可变计数循环,但仍然使用单个指令进行实际循环:ror r0, r0 ,r1。 所以 gcc 没有意识到旋转计数本质上是模块化的。 正如 ARM 文档所说, "移位长度
n
大于 32 的 ROR 与移位长度n-32
" 的 ROR 相同。 我认为 gcc 在这里会感到困惑,因为 ARM 上的左/右移位使计数饱和,因此移位 32 或更多将清除寄存器。 (与 x86 不同,x86 中的移位屏蔽计数与旋转相同)。 由于非循环移位在该目标上的工作方式,它可能决定在识别旋转习惯用法之前需要 AND 指令。当前的 x86 编译器仍然使用额外的指令来屏蔽 8 位和 16 位循环的变量计数,可能与它们不避免 ARM 上的 AND 的原因相同。 这是一个错过的优化,因为性能不依赖于任何 x86-64 CPU 上的循环计数。 (出于性能原因,在 286 中引入了计数掩码,因为它迭代地处理移位,而不是像现代 CPU 那样具有恒定延迟。)
顺便说一句,对于可变计数旋转,更喜欢向右旋转,以避免使编译器执行
32- n
在仅提供右旋转的 ARM 和 MIPS 等架构上实现左旋转。 (这可以优化编译时间常数计数。)有趣的事实:ARM 并没有真正专用的移位/旋转指令,它只是带有 源操作数在 ROR 模式下通过桶形移位器:
mov r0, r0, ror r1
。 因此,循环可以折叠到 EOR 指令或其他指令的寄存器源操作数中。确保
n
和返回值使用无符号类型,否则它不会是旋转。 (x86 目标的 gcc 进行算术右移,移入符号位的副本而不是零,当您将两个移位的值进行“或”运算时,会导致出现问题。负符号整数的右移是C 中实现定义的行为。)此外,确保移位计数是无符号类型,因为带符号类型的
(-n)&31
可能是补码或符号/数值,与使用无符号或二进制补码得到的模 2^n 不同。 (请参阅雷格尔博客文章的评论)。unsigned int
在我见过的每个编译器上,对于x
的每个宽度都表现良好。 某些其他类型实际上无法识别某些编译器的惯用语,因此不要仅使用与x
相同的类型。某些编译器提供旋转的内在函数,如果可移植版本不能在您的目标编译器上生成良好的代码,则这比内联汇编要好得多。 据我所知,任何编译器都没有跨平台的内在函数。 以下是一些 x86 选项:
提供_rotl
和_rotl64
内在函数,右移相同。 MSVC 需要
,而 gcc 需要
。#ifdef
负责处理 gcc 与 icc。 Clang 9.0 也有它,但在此之前它似乎没有在任何地方提供它们,除非在带有-fms-extensions -fms-compatibility -fms-compatibility-version=17.00
的 MSVC 兼容模式下。 它为它们发出的 asm 很糟糕(额外的掩蔽和 CMOV)。_rotr8
和_rotr16
。
还提供__rolb
/__rorb
用于 8 位左/右旋转,<代码>__rolw/__rorw
(16位),__rold
/__rord
(32位),__rolq< /code>/
__rorq
(64 位,仅为 64 位目标定义)。 对于窄循环,实现使用__builtin_ia32_rolhi
或...qi
,但 32 和 64 位循环是使用 shift/or 定义的(没有针对 UB 的保护,因为ia32intrin 中的代码。 h
只需要在 x86 的 gcc 上工作)。 GNU C 似乎没有像__builtin_popcount
那样具有任何跨平台的__builtin_rotate
函数(即使它不是单个指令,它也会扩展到目标平台上的最佳值) )。 大多数时候,您可以通过习语识别获得好的代码。__builtin_rotateleft8
、__builtin_rotateright8
,其他宽度也类似。 请参阅 Clang 的语言扩展文档。 还有其他位操作的内置函数/内在函数,例如 __builtin_bitreverse32 (可以在 ARM / AArch64 上编译为rbit
)。 您可以使用 __has_builtin 来测试它们。想必某些非 x86 编译器也具有内在函数,但我们不要扩展此社区 wiki 答案以包含所有这些编译器。 (也许可以在有关内在函数的现有答案中做到这一点)。
(此答案的旧版本建议使用 MSVC 特定的内联汇编(仅适用于 32 位 x86 代码),或 http://www.devx.com/tips/Tip/14043 用于 C 版本。评论正在对此进行回复。)
内联汇编击败了许多优化,特别是MSVC风格,因为它强制存储/重新加载输入。 精心编写的 GNU C 内联汇编循环将允许计数成为编译时常量移位计数的立即操作数,但如果要移位的值也是编译时常量,它仍然无法完全优化掉内联后。 https://gcc.gnu.org/wiki/DontUseInlineAsm 。
See also an earlier version of this answer on another rotate question with some more details about what asm gcc/clang produce for x86.
The most compiler-friendly way to express a rotate in C and C++ that avoids any Undefined Behaviour seems to be John Regehr's implementation. I've adapted it to rotate by the width of the type (using fixed-width types like
uint32_t
).Works for any unsigned integer type, not just
uint32_t
, so you could make versions for other sizes.See also a C++11 template version with lots of safety checks (including a
static_assert
that the type width is a power of 2), which isn't the case on some 24-bit DSPs or 36-bit mainframes, for example.I'd recommend only using the template as a back-end for wrappers with names that include the rotate width explicitly. Integer-promotion rules mean that
rotl_template(u16 & 0x11UL, 7)
would do a 32 or 64-bit rotate, not 16 (depending on the width ofunsigned long
). Evenuint16_t & uint16_t
is promoted tosigned int
by C++'s integer-promotion rules, except on platforms whereint
is no wider thanuint16_t
.On x86, this version inlines to a single
rol r32, cl
(orrol r32, imm8
) with compilers that grok it, because the compiler knows that x86 rotate and shift instructions mask the shift-count the same way the C source does.Compiler support for this UB-avoiding idiom on x86, for
uint32_t x
andunsigned int n
for variable-count shifts:ror
orrol
instruction for variable counts.shld edi,edi,7
which is slower and takes more bytes thanrol edi,7
on some CPUs (especially AMD, but also some Intel), when BMI2 isn't available forrorx eax,edi,25
to save a MOV._rotl
/_rotr
intrinsics from<intrin.h>
on x86 (including x86-64).gcc for ARM uses an
and r1, r1, #31
for variable-count rotates, but still does the actual rotate with a single instruction:ror r0, r0, r1
. So gcc doesn't realize that rotate-counts are inherently modular. As the ARM docs say, "ROR with shift length,n
, more than 32 is the same as ROR with shift lengthn-32
". I think gcc gets confused here because left/right shifts on ARM saturate the count, so a shift by 32 or more will clear the register. (Unlike x86, where shifts mask the count the same as rotates). It probably decides it needs an AND instruction before recognizing the rotate idiom, because of how non-circular shifts work on that target.Current x86 compilers still use an extra instruction to mask a variable count for 8 and 16-bit rotates, probably for the same reason they don't avoid the AND on ARM. This is a missed optimization, because performance doesn't depend on the rotate count on any x86-64 CPU. (Masking of counts was introduced with 286 for performance reasons because it handled shifts iteratively, not with constant-latency like modern CPUs.)
BTW, prefer rotate-right for variable-count rotates, to avoid making the compiler do
32-n
to implement a left rotate on architectures like ARM and MIPS that only provide a rotate-right. (This optimizes away with compile-time-constant counts.)Fun fact: ARM doesn't really have dedicated shift/rotate instructions, it's just MOV with the source operand going through the barrel-shifter in ROR mode:
mov r0, r0, ror r1
. So a rotate can fold into a register-source operand for an EOR instruction or something.Make sure you use unsigned types for
n
and the return value, or else it won't be a rotate. (gcc for x86 targets does arithmetic right shifts, shifting in copies of the sign-bit rather than zeroes, leading to a problem when youOR
the two shifted values together. Right-shifts of negative signed integers is implementation-defined behaviour in C.)Also, make sure the shift count is an unsigned type, because
(-n)&31
with a signed type could be one's complement or sign/magnitude, and not the same as the modular 2^n you get with unsigned or two's complement. (See comments on Regehr's blog post).unsigned int
does well on every compiler I've looked at, for every width ofx
. Some other types actually defeat the idiom-recognition for some compilers, so don't just use the same type asx
.Some compilers provide intrinsics for rotates, which is far better than inline-asm if the portable version doesn't generate good code on the compiler you're targeting. There aren't cross-platform intrinsics for any compilers that I know of. These are some of the x86 options:
<immintrin.h>
provides_rotl
and_rotl64
intrinsics, and same for right shift. MSVC requires<intrin.h>
, while gcc require<x86intrin.h>
. An#ifdef
takes care of gcc vs. icc. Clang 9.0 also has it, but before that it doesn't seem to provide them anywhere, except in MSVC compatibility mode with-fms-extensions -fms-compatibility -fms-compatibility-version=17.00
. And the asm it emits for them sucks (extra masking and a CMOV)._rotr8
and_rotr16
.<x86intrin.h>
also provides__rolb
/__rorb
for 8-bit rotate left/right,__rolw
/__rorw
(16-bit),__rold
/__rord
(32-bit),__rolq
/__rorq
(64-bit, only defined for 64-bit targets). For narrow rotates, the implementation uses__builtin_ia32_rolhi
or...qi
, but the 32 and 64-bit rotates are defined using shift/or (with no protection against UB, because the code inia32intrin.h
only has to work on gcc for x86). GNU C appears not to have any cross-platform__builtin_rotate
functions the way it does for__builtin_popcount
(which expands to whatever's optimal on the target platform, even if it's not a single instruction). Most of the time you get good code from idiom-recognition.__builtin_rotateleft8
,__builtin_rotateright8
, and similar for other widths. See Clang's language extensions documentation. There are also builtins / intrinsics for other bit-manipulation things like__builtin_bitreverse32
(which can compile torbit
on ARM / AArch64). You can test for them with__has_builtin
.Presumably some non-x86 compilers have intrinsics, too, but let's not expand this community-wiki answer to include them all. (Maybe do that in the existing answer about intrinsics).
(The old version of this answer suggested MSVC-specific inline asm (which only works for 32bit x86 code), or http://www.devx.com/tips/Tip/14043 for a C version. The comments are replying to that.)
Inline asm defeats many optimizations, especially MSVC-style because it forces inputs to be stored/reloaded. A carefully-written GNU C inline-asm rotate would allow the count to be an immediate operand for compile-time-constant shift counts, but it still couldn't optimize away entirely if the value to be shifted is also a compile-time constant after inlining. https://gcc.gnu.org/wiki/DontUseInlineAsm.
C++20
std::rotl
和std::rotr
已经到来! http://www.open-std.org /jtc1/sc22/wg21/docs/papers/2019/p0553r4.html 并应将其添加到
标头中。cppreference 说 用法如下:
给出输出:
我会给它尝试一下,当 GCC 支持到达时,带有
g++-9 -std=c++2a
的 GCC 9.1.0 仍然不支持它。该提案称:
和:
还添加了
std::popcount
来计算 1 位的数量:如何计算 32 位整数中设置的位数?C++20
std::rotl
andstd::rotr
It has arrived! http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p0553r4.html and should add it to the
<bit>
header.cppreference says that the usage will be like:
giving output:
I'll give it a try when support arrives to GCC, GCC 9.1.0 with
g++-9 -std=c++2a
still doesn't support it.The proposal says:
and:
A
std::popcount
was also added to count the number of 1 bits: How to count the number of set bits in a 32-bit integer?由于它是 C++,因此使用内联函数:
C++11 变体:
Since it's C++, use an inline function:
C++11 variant:
大多数编译器都有相应的内在函数。
Visual Studio 例如 _rotr8、_rotr16
Most compilers have intrinsics for that.
Visual Studio for example _rotr8, _rotr16
明确地说:
Definitively:
如果 x 是 8 位值,则可以使用:
If x is an 8 bit value, you can use this:
使用标准位集怎么可能是这样的……
HTH,
How abt something like this, using the standard bitset ...
HTH,
详细来说,您可以应用以下逻辑。
如果位模式为整数 33602
,并且您需要使用 2 个右移进行翻转,则:
首先复制位模式,然后左移它:长度 - 右移
即长度为16右移值为2
16 - 2 = 14
左移 14 次后得到。
现在根据需要右移值 33602 2 次。
现在,您可以
在 14 次左移值和 2 次右移值之间进行“或”运算。
然后您就得到了转移的展期价值。 请记住,按位运算速度更快,甚至不需要任何循环。
In details you can apply the following logic.
If Bit Pattern is 33602 in Integer
and you need to Roll over with 2 right shifs then:
first make a copy of bit pattern and then left shift it: Length - RightShift
i.e. length is 16 right shift value is 2
16 - 2 = 14
After 14 times left shifting you get.
Now right shift the value 33602, 2 times as required.
You get
Now take an OR between 14 time left shifted value and 2 times right shifted value.
And you get your shifted rollover value. Remember bit wise operations are faster and this don't even required any loop.
假设您要右移
L
位,并且输入x
是一个N
位的数字:Assuming you want to shift right by
L
bits, and the inputx
is a number withN
bits:正确答案如下:
The correct answer is following:
源代码
x 位数
Source Code
x bit number
另一个建议
another suggestion
下面是 Dídac Pérez 的答案的稍微改进版本,有两个方向实现,以及使用 unsigned char 和 unsigned long long 值演示这些函数的用法。 几个注意事项:
cout << 我在这里找到了以数字方式简洁输出无符号字符的 +value
技巧: https://stackoverflow.com/a/28414758 /1599699
语法。这是我正在使用的代码:
Below is a slightly improved version of Dídac Pérez's answer, with both directions implemented, along with a demo of these functions' usages using unsigned char and unsigned long long values. Several notes:
cout << +value
trick for tersely outputting an unsigned char numerically that I found here: https://stackoverflow.com/a/28414758/1599699<put the type here>
syntax for clarity and safety.Here's the code I'm using:
重载一个函数:
Overload a function: