将二进制转换为十进制的最快方法?
我有四个无符号 32 位整数,以小端顺序表示一个无符号 128 位整数:
typedef struct {
unsigned int part[4];
} bigint_t;
我想将此数字转换为其十进制字符串表示形式并将其输出到文件中。
现在,我使用 bigint_divmod10
函数将数字除以 10,并跟踪余数。我重复调用这个函数,将余数作为数字输出,直到数字为零。速度相当慢。这是最快的方法吗?如果是这样,是否有一种我没有看到的聪明的方法来实现这个功能?我尝试查看 GMP 的 get_str.c
,但我发现它非常难以理解。
编辑:这是我能够为 divmod10 函数想出的最快代码:
static unsigned uint128_divmod10(uint128 *value)
{
unsigned int a = value->word[3];
unsigned int b = value->word[2];
unsigned int c = value->word[1];
unsigned int d = value->word[0];
unsigned int diva = a / 5;
unsigned int divb = b / 5;
unsigned int divc = c / 5;
unsigned int divd = d / 5;
value->word[3] = diva;
value->word[2] = divb;
value->word[1] = divc;
value->word[0] = divd;
unsigned int moda = a - diva*5;
unsigned int modb = b - divb*5;
unsigned int modc = c - divc*5;
unsigned int modd = d - divd*5;
unsigned int mod = 0;
mod += moda;
unsigned int carryb = mod*858993459;
mod += modb;
if (mod >= 5) {
mod -= 5;
carryb++;
}
unsigned int carryc = mod*858993459;
mod += modc;
if (mod >= 5) {
mod -= 5;
carryc++;
}
unsigned int carryd = mod*858993459;
mod += modd;
if (mod >= 5) {
mod -= 5;
carryd++;
}
uint128_add(value, carryd, 0);
uint128_add(value, carryc, 1);
uint128_add(value, carryb, 2);
if (value->word[0] & 1) {
mod += 5;
}
uint128_shift(value, -1);
return mod;
}
其中 add 函数定义为:
static void uint128_add(uint128 *value, unsigned int k, unsigned int pos)
{
unsigned int a = value->word[pos];
value->word[pos] += k;
if (value->word[pos] < a) {
// overflow
for (int i=pos+1; i<4; i++) {
value->word[i]++;
if (value->word[i]) {
break;
}
}
}
}
I've got four unsigned 32-bit integers representing an unsigned 128-bit integer, in little endian order:
typedef struct {
unsigned int part[4];
} bigint_t;
I'd like to convert this number into its decimal string representation and output it to a file.
Right now, I'm using a bigint_divmod10
function to divide the number by 10, keeping track of the remainder. I call this function repeatedly, outputting the remainder as a digit, until the number is zero. It's pretty slow. Is this the fastest way to do it? If so, is there a clever way to implement this function that I'm not seeing? I've tried looking at GMP's get_str.c
, but I find it pretty impenetrable.
EDIT: here's the fastest code I was able to come up with for the divmod10 function:
static unsigned uint128_divmod10(uint128 *value)
{
unsigned int a = value->word[3];
unsigned int b = value->word[2];
unsigned int c = value->word[1];
unsigned int d = value->word[0];
unsigned int diva = a / 5;
unsigned int divb = b / 5;
unsigned int divc = c / 5;
unsigned int divd = d / 5;
value->word[3] = diva;
value->word[2] = divb;
value->word[1] = divc;
value->word[0] = divd;
unsigned int moda = a - diva*5;
unsigned int modb = b - divb*5;
unsigned int modc = c - divc*5;
unsigned int modd = d - divd*5;
unsigned int mod = 0;
mod += moda;
unsigned int carryb = mod*858993459;
mod += modb;
if (mod >= 5) {
mod -= 5;
carryb++;
}
unsigned int carryc = mod*858993459;
mod += modc;
if (mod >= 5) {
mod -= 5;
carryc++;
}
unsigned int carryd = mod*858993459;
mod += modd;
if (mod >= 5) {
mod -= 5;
carryd++;
}
uint128_add(value, carryd, 0);
uint128_add(value, carryc, 1);
uint128_add(value, carryb, 2);
if (value->word[0] & 1) {
mod += 5;
}
uint128_shift(value, -1);
return mod;
}
where the add function is defined as:
static void uint128_add(uint128 *value, unsigned int k, unsigned int pos)
{
unsigned int a = value->word[pos];
value->word[pos] += k;
if (value->word[pos] < a) {
// overflow
for (int i=pos+1; i<4; i++) {
value->word[i]++;
if (value->word[i]) {
break;
}
}
}
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
这取决于您对这些数字还做了什么。您可以权衡空间效率的轻微损失和多精度算术效率的适度损失,以换取非常高效的十进制转换。关键是使用 10 的幂而不是 2 的幂进行多精度算术。
例如,您可以使用基数 10,000,将一位数字打包到 16 位字中,然后对数字进行算术以 32 位整数表示。 (如果您使用的是 64 位计算机,则可以将其加倍并以 1,000,000,000 为基数。)这种代码在时间上相对高效,但不如使用 2 的本机幂那么快,因为您无法利用硬件上的进位位。
而且你不能用相同的位数表示尽可能多的整数。
但它在十进制之间的转换方面非常出色,因为您可以在不进行任何长除法的情况下转换各个数字。
如果您需要表示从零到
((1 << 128) - 1)
的完整数字范围,您仍然可以这样做,但添加一个额外的数字,这样您的数字将是大。如果事实证明您确实需要额外的空间/速度(也许您正在执行大量加密 128 位计算),那么同时将 div/mod 除以 10 的方法是我所知道的最快方法。唯一的另一个技巧是,如果小整数很常见,您可以特殊处理它们。 (也就是说,如果三个最高有效的32位字都为零,则只需使用本机除法进行转换即可。)
Dave Hanson 的 C 接口和实现 有一个关于多精度算术的冗长章节。将大数除以一位数是一种特殊情况,具有这种有效的实现:
为了充分理解,拥有这本书确实很有帮助,但是 源代码仍然比GNU源代码更容易理解。您可以轻松地将其调整为使用基数 10,000(目前使用基数 256)。
摘要:如果您的性能瓶颈是转换为十进制,请实现以 10 的幂为底的多精度算术。如果您的机器的本机字大小为 32 并且您使用的是 C 代码,请在 16 位字中使用 10,000。
It depends what else you're doing with the numbers. You can trade off a slight loss in space efficiency and a modest loss in efficiency of multiprecision arithmetic in return for very efficient conversion to and from decimal. The key is to do multiprecision arithmetic with a base that is a power of 10 rather than a power of 2.
For example, you might use base 10,000, where you pack one digit into a 16-bit word and you do your arithmetic on digits in 32-bit integers. (If you're on a 64-bit machine you can double that and do base 1,000,000,000.) This kind of code is relatively efficient timewise, although not quite as fast as using the native power of two because you can't take advantage of the carry bit on the hardware.
And you can't represent as many integers in the same number of bits.
But it's a whiz at converting to and from decimal, because you get to convert the individual digits without any long division.
If you need to represent the full range of numbers from zero to
((1 << 128) - 1)
, you can still do this, but add an extra digit, so your numbers will be bigger.If it turns out you really need the extra space/speed (maybe you're doing a lot of cryptographic 128-bit calculations) then the method of simultanous div/mod by 10 is the fastest method I know. The only other trick is that if small integers are common, you can handle them specially. (That is, if the three most significant 32-bit words are all zero, just use the native division to convert.)
Dave Hanson's C Interfaces and Implementations has a lengthy chapter on multiprecision arithmetic. Dividing a large number by a single digit is a special case that has this efficient implementation:
For full understanding, it really helps to have the book, but the source code is still a lot easier to understand than the GNU source code. And you could easily adapt it to use base 10,000 (it currently uses base 256).
Summary: if your performance bottleneck is conversion to decimal, implement multiprecision arithmetic with a base that is a power of 10. If your machine's native word size is 32 and you are using C code, use 10,000 in a 16-bit word.
如果您的值大多小于
ULLONG_MAX
(18446744073709551615),我会尝试使用sprintf(buf,"%llu",ullong_val)
。我敢打赌这在标准库中得到了很好的优化,但格式解析将需要一些周期。否则,我将创建一个
bigint_divmod1000000000
(或更好的名称 mod10to9)函数并使用它。它需要的除法比bigint_divmod10
少 9 倍。If your values are mostly less than
ULLONG_MAX
(18446744073709551615) I'd try to use for themsprintf(buf,"%llu",ullong_val)
. I bet this is rather well optimized in standard library, but parsing of format will take some cycles though.Otherwise I'd create a
bigint_divmod1000000000
(or better name mod10to9) function and use that. It would need 9 times less divides thanbigint_divmod10
.8 位查找表。
您可以有 4 个包含 256 个数字的查找表。
第一个表是从 0-256 的 LSB 字节,第二个表是第一个表乘以 256,依此类推。
因此,当您需要数字时,请从查找表中求和数字。
添加时,您可以添加为二进制,然后再遍历每个字节以修复溢出。
例子
号码 0x12345678
在第一个查找表中,地址位于 (0x78 = 120)
所以 0x010200 是第一个数字
在(0x56=87)下的第二个表中是0x0202000106(十月中的0x56是22016)
在第三个表中,您将得到 0x03040007080702
在 0x12 处的最后一个标签下,您有 0x030001090809080808 (这不适合 32 位算术,但您都知道)
然后将这些数字相加(作为二进制数字)并逐字节进行溢出
for 循环中的代码类似于“
如果我们计算为此所需的操作”。
1.(查表并添加)4个查找表。 16个补充(请记住,当您不需要进行overflow时,因为它们不会发生)
2.每步一关 3操作16步通过。
悲观上限 6*16 = 100 次操作。
编辑:
这是 C++ 代码,比简单的实现快 30%。
Lookup table of 8 bits.
You can have 4 lookup tables of 256 numbers.
First is from 0-256 for LSB bytes, Second table is first table multiplied by 256 and so on.
SO when you need your number sum up numbers from lookup table.
When you adding you can add as bunary and go later one pass over each byte to fix owerflows.
Example
number 0x12345678
In first lookup table there is under addres (0x78 = 120)
so 0x010200 is first number
in second table under(0x56=87) is 0x0202000106 (0x56 in dec is 22016)
in third table you hou would have 0x03040007080702
and under last lable at 0x12 you have 0x030001090809080808 (this does not fit in 32 bit arithmetic, but that you allredy know)
Then sum up this numbers (as binary bumbers) and go one pass, byte by byte for overflow
code in for loop is something like
If we count operations needed for this.
1.(looking in tables and adding) 4 lookup tables. 16 additions (keep in mind that when you do not need to carry about owerflow, becuase they can not ocur)
2. one pass in each step 3 operatins 16 steps to pass.
passimistic upper bound 6*16 = 100 operations.
EDIT:
Here is c++ code, and is 30% faster than naive implementation.
为了将来参考,我没有实现 uint128 类型,而是直接使用字符串的字符。事实证明,这比从字符串到 uint128 来回要快得多。
For future reference, instead of implementing a uint128 type, I just used the characters of the string directly. This turned out to be much faster than going from string to uint128 and back.
最直接的加速将来自内联转换而不是调用函数;它可以像标记
bigint_divmod10()
inline 一样简单,或者使用编译器提供的配置文件引导优化。The most immediate speedup will come from inlining the conversion rather than calling functions; it could be as simple as marking
bigint_divmod10()
inline, or using profile-guided optimisation as offered by your compiler.我知道这个问题很老了,但我想做出贡献,因为没有人提出避免除法循环的方法。这个使用 pow2,我还没有测试过基准测试,但理论上应该比其他任何一个都快,并且也可以在 pow 函数中进行调整。
输出:
36
I know this question is old, but I want to contribute as none put a way avoiding the division cycle. This one uses pow2, I haven't tested the benchmark but in theory should be faster than any other, and also could be tweaked in the pow function as well.
Output:
36