有没有一种快速的方法将 8 个 ASCII 十进制数字字符串转换为二进制数?
将 8 位字符(如 12345678
)视为字符串。它可以转换为每个字节都包含一个数字的数字,如下所示:
const char* const str = "12345678";
const char* const base = "00000000";
const uint64_t unpacked = *reinterpret_cast<const uint64_t*>(str)
- *reinterpret_cast<const uint64_t*>(base);
那么在小端系统上,unpacked
将是 0x0807060504030201
。
将数字转换为 12345678
的最快方法是什么,也许是将其乘以某个幻数或使用高达 AVX2 的 SIMD?
更新:12345678
必须是存储在 32 位或 64 位整数中的数字,而不是字符串。
Consider 8 digit characters like 12345678
as a string. It can be converted to a number where every byte contains a digit like this:
const char* const str = "12345678";
const char* const base = "00000000";
const uint64_t unpacked = *reinterpret_cast<const uint64_t*>(str)
- *reinterpret_cast<const uint64_t*>(base);
Then unpacked
will be 0x0807060504030201
on a little-endian system.
What is the fastest way to convert the number into 12345678
, perhaps by multiplying it by some magic number or using SIMD up to AVX2?
UPDATE: 12345678
has to be a number stored in a 32-bit or 64-bit integer, not a string.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
在没有 AVX2 的较旧 x86-64 系统上,这个基于以树方式收集数字的简单版本非常高效,根据我的测量,其性能与基于 SWAR 的简单实现相当。然而,这需要具有大量指令级并行性的处理器,因为在完全优化的情况下编译时,它包含的指令比基于 SWAR 的代码多 50%。
On an older x86-64 system without AVX2, this simple version based on gathering digits in tree fashion is quite efficient, with performance on par with a simple SWAR-based implementation per my measurements. This requires a processor with a lot of instruction-level parallelism however, as it comprises 50% more instructions than the SWAR -based code when compiled with full optimizations.
如果您将输入格式更改为广度优先元素顺序,如下所示:
如果您转换的值超过 9 个,例如 512 或 8192,并填充为 32 的任意倍数,则编译器应该对其进行矢量化。
要准备输入,您可以使用 8 个不同的通道,每个解析值的每个数字 1 个。
If you change your input format to breadth-first element order like this:
And if you convert more than just 9 values, like 512 or 8192 with padding to any multiple of 32, compiler should vectorize it.
To prepare input, you can use 8 different channels, 1 per digit of every parsed value.
我已经实现了一个小程序来测试一些想法。 AVX2 实现比 naive 快约 1.5 倍,中间是表实现:
源代码:
I've implemented a small program to test some ideas. AVX2 implementation is ~1.5 times faster than the naive, with the table implementation in the middle:
Source code:
二进制乘法只是一系列移位和乘法运算。补充道。 SWAR 方法
应该不会太难理解。有关详细说明,请参阅:
Multiplication in binary is just a series of shift & adds. A SWAR approach
shouldn't be too hard to understand. For a detailed walk-thru see: