整数除法算法
我正在考虑一种大数除法的算法:用余数 bigint C 除以 bigint D,其中我们知道 C 在基数 b 中的表示,而 D 的形式为 b^k-1。通过示例来展示它可能是最简单的。我们尝试用 C=21979182173 除以 D=999。
- 我们将数字写为三位数字的集合: 21 979 182 173
- 我们从左侧开始对连续集合求和(以 999 为模): 21 001 183 356
- 我们在“超过 999”的集合之前的集合中添加 1 : 22 001 183 356
事实上,21979182173/999=22001183 和余数 356。
我已经计算了复杂性,如果我没有记错的话,该算法应该在 O(n) 中工作,n 是 C 的位数以 b 为基数表示。我还在 C++ 中做了一个非常粗略且未经优化的算法版本(仅适用于 b=10),并针对 GMP 的通用整数除法算法对其进行了测试,它似乎确实比 GMP 更好。我在任何地方都找不到类似的实现,因此我不得不针对一般除法进行测试。
我发现了几篇文章讨论了似乎非常相似的问题,但没有一篇文章专注于实际实现,特别是在不同于 2 的基数中。我想这是因为数字内部存储的方式,尽管提到的算法似乎有用,假设 b=10,即使考虑到这一点。我也尝试联系其他一些人,但同样无济于事。
因此,我的问题是:是否有一篇文章或一本书或其他东西描述了上述算法,可能讨论了实现?如果不是,那么我尝试在 C/C++ 中实现和测试这样的算法是否有意义,或者这个算法本质上是不好的?
另外,我不是程序员,虽然我在编程方面相当不错,但我承认我对计算机“内部结构”了解不多。因此,请原谅我的无知——这篇文章中很可能存在一个或多个非常愚蠢的事情。再次抱歉。
多谢!
进一步澄清评论/答案中提出的观点:
谢谢大家 - 因为我不想对同一件事的所有精彩答案和建议进行评论,我只想解决一个问题你们很多人都谈到了这一点。
我完全意识到,一般来说,以 2^n 为底数工作显然是最有效的做事方式。几乎所有 bigint 库都使用 2^32 或其他。然而,如果(并且,我强调,它只对这个特定的算法有用!)我们将 bigint 实现为以 b 为基数的数字数组会怎么样?当然,我们这里要求b是“合理的”:b=10,最自然的情况,看起来足够合理。我知道考虑到内存和时间,考虑到数字如何内部存储,它或多或少效率低下,但如果我的(基本的,可能有某种缺陷的)测试是正确的,我已经能够比 GMP 的一般划分更快地产生结果,这对于实现这样的算法是有意义的。
Ninefingers 注意到在这种情况下我必须使用昂贵的模运算。我希望不会:我可以通过查看旧+新+1的位数来判断旧+新是否交叉,比如999。如果有 4 位数字,我们就完成了。更重要的是,由于old<999和new<=999,我们知道如果old+new+1有4位数字(不能有更多),那么,(old+new)%999等于删除(的最左边的数字)旧+新+1),我想我们可以便宜地做到这一点。
当然,我并不是质疑这个算法的明显局限性,也不是说它无法改进——它只能除以某一类数字,并且我们必须先验地知道基数 b 中被除数的表示。然而,例如,对于 b=10,后者似乎是自然的。
现在,假设我们已经实现了我上面概述的 bignum。假设以 b 为基数的 C=(a_1a_2...a_n) 且 D=b^k-1。该算法(可能会更加优化)会像这样。我希望没有太多错别字。
- 如果 k>n,我们显然已经
- 在 C 的开头添加了一个零(即 a_0=0)(以防万一我们尝试将 9999 除以 99)
- l=n% k (“常规”整数的 mod - 不应该太贵)
- old=(a_0...a_l) (第一组数字,可能少于 k 位)< /em>
- for (i=l+1; i < n; i=i+k) (我们将进行约(n/k)次迭代)
- 新=(a_i...a_(i+k-1))
- new=new+old (这是 bigint 加法,因此 O(k))
- aux=new+1 (再次,bigint 加法 - O(k) - 我对此不满意)
- 如果 aux 超过 k 位数字
- 删除 aux 的第一位数字
- old=old+1 (bigint 再次相加)
- 在开头用零填充 old,以便它具有应有的数字
- (a_(ik)...a_(i-1))=旧 (如果 i=l+1, (a _ 0...a _ l)=旧)
- 新=aux
- 在开头用零填充 new,以便它具有应有的数字
- (a_i...a_(i+k-1)=新
- quot=(a_0...a_(n-k+ 1))
- rem=new
这里,感谢您与我讨论这个问题 - 正如我所说,在我看来,这确实是一个有趣的“特殊情况”算法,可以尝试实现、测试和讨论,如果没有人看到其中有任何致命缺陷的话如果到目前为止还没有广泛讨论,那就更好了,请让我知道您的想法。
另外,还有一些个人评论:
@Ninefingers:我实际上有一些(非常基本的!)知识。关于 GMP 的工作原理、它的作用以及一般的 bigint 除法算法,所以我能够理解您的大部分论点,我也知道 GMP 是高度优化的,并且在某种程度上针对不同的平台进行了自定义,所以我当然可以。一般来说,不要试图“击败它”——这似乎与用尖头棍子攻击坦克一样有效。然而,这不是这个算法的想法——它适用于非常特殊的情况(GMP 似乎没有涵盖)。顺便说一句,你确定一般除法是在 O(n) 内完成的吗?我见过最多的是 M(n)。 (如果我理解正确的话,在实践中(Schönhage–Strassen 等)可能达不到 O(n)。Fürer 的算法仍然达不到 O(n),如果我是对的,它几乎纯粹是理论上。)
@Avi Berger:这实际上似乎与“抛出九”并不完全相同,尽管想法相似。然而,如果我没记错的话,上述算法应该一直有效。
I was thinking about an algorithm in division of large numbers: dividing with remainder bigint C by bigint D, where we know the representation of C in base b, and D is of form b^k-1. It's probably the easiest to show it on an example. Let's try dividing C=21979182173 by D=999.
- We write the number as sets of three digits: 21 979 182 173
- We take sums (modulo 999) of consecutive sets, starting from the left: 21 001 183 356
- We add 1 to those sets preceding the ones where we "went over 999": 22 001 183 356
Indeed, 21979182173/999=22001183 and remainder 356.
I've calculated the complexity and, if I'm not mistaken, the algorithm should work in O(n), n being the number of digits of C in base b representation. I've also done a very crude and unoptimized version of the algorithm (only for b=10) in C++, tested it against GMP's general integer division algorithm and it really does seem to fare better than GMP. I couldn't find anything like this implemented anywhere I looked, so I had to resort to testing it against general division.
I found several articles which discuss what seem to be quite similar matters, but none of them concentrate on actual implementations, especially in bases different than 2. I suppose that's because of the way numbers are internally stored, although the mentioned algorithm seems useful for, say, b=10, even taking that into account. I also tried contacting some other people, but, again, to no avail.
Thus, my question would be: is there an article or a book or something where the aforementioned algorithm is described, possibly discussing the implementations? If not, would it make sense for me to try and implement and test such an algorithm in, say, C/C++ or is this algorithm somehow inherently bad?
Also, I'm not a programmer and while I'm reasonably OK at programming, I admittedly don't have much knowledge of computer "internals". Thus, pardon my ignorance - it's highly possible there are one or more very stupid things in this post. Sorry once again.
Thanks a lot!
Further clarification of points raised in the comments/answers:
Thanks, everyone - as I didn't want to comment on all the great answers and advice with the same thing, I'd just like to address one point a lot of you touched on.
I am fully aware that working in bases 2^n is, generally speaking, clearly the most efficient way of doing things. Pretty much all bigint libraries use 2^32 or whatever. However, what if (and, I emphasize, it would be useful only for this particular algorithm!) we implement bigints as an array of digits in base b? Of course, we require b here to be "reasonable": b=10, the most natural case, seems reasonable enough. I know it's more or less inefficient both considering memory and time, taking into account how numbers are internally stored, but I have been able to, if my (basic and possibly somehow flawed) tests are correct, produce results faster than GMP's general division, which would give sense to implementing such an algorithm.
Ninefingers notices I'd have to use in that case an expensive modulo operation. I hope not: I can see if old+new crossed, say, 999, just by looking at the number of digits of old+new+1. If it has 4 digits, we're done. Even more, since old<999 and new<=999, we know that if old+new+1 has 4 digits (it can't have more), then, (old+new)%999 equals deleting the leftmost digit of (old+new+1), which I presume we can do cheaply.
Of course, I'm not disputing obvious limitations of this algorithm nor I claim it can't be improved - it can only divide with a certain class of numbers and we have to a priori know the representation of dividend in base b. However, for b=10, for instance, the latter seems natural.
Now, say we have implemented bignums as I outlined above. Say C=(a_1a_2...a_n) in base b and D=b^k-1. The algorithm (which could be probably much more optimized) would go like this. I hope there aren't many typos.
- if k>n, we're obviously done
- add a zero (i.e. a_0=0) at the beginning of C (just in case we try to divide, say, 9999 with 99)
- l=n%k (mod for "regular" integers - shouldn't be too expensive)
- old=(a_0...a_l) (the first set of digits, possibly with less than k digits)
- for (i=l+1; i < n; i=i+k) (We will have floor(n/k) or so iterations)
- new=(a_i...a_(i+k-1))
- new=new+old (this is bigint addition, thus O(k))
- aux=new+1 (again, bigint addition - O(k) - which I'm not happy about)
- if aux has more than k digits
- delete first digit of aux
- old=old+1 (bigint addition once again)
- fill old with zeroes at the beginning so it has as much digits as it should
- (a_(i-k)...a_(i-1))=old (if i=l+1, (a _ 0...a _ l)=old)
- new=aux
- fill new with zeroes at the beginning so it has as much digits as it should
- (a_i...a_(i+k-1)=new
- quot=(a_0...a_(n-k+1))
- rem=new
There, thanks for discussing this with me - as I said, this does seem to me to be an interesting "special case" algorithm to try to implement, test and discuss, if nobody sees any fatal flaws in it. If it's something not widely discussed so far, even better. Please, let me know what you think. Sorry about the long post.
Also, just a few more personal comments:
@Ninefingers: I actually have some (very basic!) knowledge of how GMP works, what it does and of general bigint division algorithms, so I was able to understand much of your argument. I'm also aware GMP is highly optimized and in a way customizes itself for different platforms, so I'm certainly not trying to "beat it" in general - that seems as much fruitful as attacking a tank with a pointed stick. However, that's not the idea of this algorithm - it works in very special cases (which GMP does not appear to cover). On an unrelated note, are you sure general divisions are done in O(n)? The most I've seen done is M(n). (And that can, if I understand correctly, in practice (Schönhage–Strassen etc.) not reach O(n). Fürer's algorithm, which still doesn't reach O(n), is, if I'm correct, almost purely theoretical.)
@Avi Berger: This doesn't actually seem to be exactly the same as "casting out nines", although the idea is similar. However, the aforementioned algorithm should work all the time, if I'm not mistaken.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您的算法是基数为 10 的算法的变体,称为“抛出 9”。您的示例使用基数 1000 并“淘汰”999(比基数少 1)。这曾经在小学中被教授作为快速检查手算的方法。我有一位高中数学老师,当他得知不再教授数学时,他感到非常震惊,并向我们介绍了这一点。
以 1000 为基数去除 999 不能作为一般的除法算法。它将生成与实际商和余数模 999 全等的值,而不是实际值。你的算法有点不同,我没有检查它是否有效,但它是基于有效使用基数 1000 且除数比基数小 1。如果您想尝试除以 47,则必须先转换为以 48 为基数的数字系统。
谷歌“casting outs nines”了解更多信息。
编辑:我最初读你的文章有点太快了,你确实知道这是一种有效的算法。正如 @Ninefingers 和 @Karl Bielefeldt 在他们的评论中比我更清楚地指出的那样,您在性能估计中没有包括转换为适合当前特定除数的基数。
Your algorithm is a variation of a base 10 algorithm known as "casting out nines". Your example is using base 1000 and "casting out" 999's (one less than the base). This used to be taught in elementary school as way to do a quick check on hand calculations. I had a high school math teacher who was horrified to learn that it wasn't being taught anymore and filled us in on it.
Casting out 999's in base 1000 won't work as a general division algorithm. It will generate values that are congruent modulo 999 to the actual quotient and remainder - not the actual values. Your algorithm is a bit different and I haven't checked if it works, but it is based on effectively using base 1000 and the divisor being 1 less than the base. If you wanted to try it for dividing by 47, you would have to convert to a base 48 number system first.
Google "casting out nines" for more information.
Edit: I originally read your post a bit too quickly, and you do know of this as a working algorithm. As @Ninefingers and @Karl Bielefeldt have stated more clearly than me in their comments, what you aren't including in your performance estimate is the conversion into a base appropriate for the particular divisor at hand.
我觉得有必要根据我的评论对此进行补充。这不是答案,而是对背景的解释。
bignum 库使用所谓的肢体 - 在 gmp 源中搜索 mp_limb_t,通常是固定大小的整数字段。
当您执行加法之类的操作时,处理它的一种方法(尽管效率低下)是这样做:
在总和大于肢体大小的情况下,这个双倍大小的肢体会捕获肢体_a + 肢体_b 的溢出。因此,如果我们使用 uint32_t 作为肢体大小,总数大于 2^32,则可以捕获溢出。
为什么我们需要这个?好吧,你通常做的是循环遍历所有肢体 - 你自己通过将整数除以并遍历每个肢体来完成此操作 - 但我们首先执行 LSL (因此最小的肢体首先),就像你做算术一样手工。
这可能看起来效率低下,但这只是 C 的做事方式。为了真正发挥作用,x86 有
adc
作为指令 - 带进位的加法。它的作用是对您的字段进行算术 and 并在算术溢出寄存器大小时设置进位位。下次执行add
或adc
时,处理器也会考虑进位位。在减法中,它被称为借位标志。这也适用于轮班操作。因此,处理器的这一特性对于 bignum 的快速运行至关重要。所以事实是,芯片中有电子电路可以完成这些工作 - 在软件中执行此操作总是会变慢。
无需过多讨论,操作就是通过加、移、减等能力构建的。它们至关重要。哦,如果你做得正确的话,你可以使用每个肢体的处理器寄存器的完整宽度。
第二点——碱基之间的转换。您不能在数字中间取一个值并更改其基数,因为您无法解释原始基数中该数字下方数字的溢出,并且该数字无法解释从下面的数字溢出......等等。简而言之,每次您想要更改基数时,您都需要将整个bignum从原始基数再次转换回新基数。所以你必须至少走bignum(所有四肢)三遍。或者,或者,在所有其他操作中检测溢出成本高昂......记住,现在您需要执行模运算来计算是否溢出,而在处理器为我们做这件事之前。
我还想补充一点,虽然在这种情况下您所得到的可能很快,但请记住,作为一个 bignum 库 gmp 为您做了相当多的工作,例如内存管理。如果您使用
mpz_
,那么对于初学者来说,您正在使用上面我所描述的抽象。最后,gmp 使用手工优化的装配和展开循环,适用于您听说过的几乎所有平台,甚至更多。它与 Mathematica、Maple 等一起发布是有充分理由的。现在,仅供参考,一些阅读材料。
总而言之:除法汇编指令很糟糕,所以人们通常计算逆并乘法,就像在模算术中定义除法时所做的那样。现有的各种技术(参见 MCA)大部分都是 O(n)。
编辑:好吧,并不是所有的技术都是 O(n) 的。大多数称为 div1 的技术(除以不大于肢体的东西都是 O(n) 的。当你变得更大时,你最终会得到 O(n^2) 复杂性;这是很难避免的。
现在,你可以将 bigint 实现为一个数字数组?是的,当然可以。但是,请考虑一下加法下的想法
,这是您通过方案进行加法的粗略草图。所以您必须运行。因此,您需要转换到您的基础表示,然后在完成后返回,因为这种形式在其他地方都非常慢 em> 这里我们不是在讨论 O(n) 和 O(n^2) 之间的区别,而是在讨论每个肢体 的昂贵除法指令或昂贵的转换。每次您想要划分时查看此 ?
接下来,如何将您的部门扩展到一般案例部门 我的意思是,当您想要将上述代码中的两个数字 x 和 y 相除时。答案是,如果不求助于基于 bignum 的设施,你就无法做到这一点,因为这些设施很昂贵。参见高德纳。对大于您的大小的数字取模是行不通的。
让我解释一下。尝试 21979182173 mod 1099。为了简单起见,我们假设我们可以拥有的最大大小字段是三位数。这是一个人为的示例,但我知道的最大字段大小是否使用 gcc 扩展使用 128 位。无论如何,重点是,你:
将你的人数分成四部分。然后你取模并求和:
它不起作用。这就是 Avi 正确的地方,因为这是一种抛出 9 的形式,或其改编形式,但它在这里不起作用,因为我们的字段一开始就溢出了 - 您正在使用模数来确保每个字段都在范围内它的肢体/视野大小。
那么解决办法是什么呢?将你的数字分成一系列适当大小的bignum?并开始使用 bignum 函数来计算您需要的一切?这将比任何现有的直接操作字段的方法慢得多。
现在,也许您只是提出除以肢体而不是大数的这种情况,在这种情况下它可以工作,但亨塞尔除法和预先计算的逆等可以没有转换要求。我不知道这个算法是否会比亨塞尔除法更快;这将是一个有趣的比较;问题来自于bignum 库中的通用表示。在现有 bignum 库中选择的表示形式是出于我所扩展的原因 - 它在首次完成的汇编级别上有意义。
作为旁注;您不必使用
uint32_t
来表示您的四肢。理想情况下,您使用的大小是系统寄存器的大小(例如 uint64_t),以便您可以利用汇编优化版本。因此,在 64 位系统上,adc rax, rbx
仅当结果溢出 2^64 位时才设置溢出 (CF)。tl;dr 版本:问题不在于你的算法或想法;而是在于你的算法或想法。这是基数之间转换的问题,因为算法所需的表示并不是在 add/sub/mul 等中执行此操作的最有效方法。套用 knuth 的话:这向您展示了数学优雅和计算效率之间的差异。
I feel the need to add to this based on my comment. This isn't an answer, but an explanation as to the background.
A bignum library uses what are called limbs - search for mp_limb_t in the gmp source, which are usually a fixed-size integer field.
When you do something like addition, one way (albeit inefficient) to approach it is to do this:
This double-sized limb catches the overflow of limb_a + limb_b in the case that the sum is bigger than the limb size. So if the total is bigger than 2^32 if we're using uint32_t as our limb size, the overflow can be caught.
Why do we need this? Well, what you typically do is loop through all the limbs - you've done this yourself in dividing your integer up and going through each one - but we do it LSL first (so the smallest limb first) just as you'd do arithmetic by hand.
This might seem inefficient, but this is just the C way of doing things. To really break out the big guns, x86 has
adc
as an instruction - add with carry. What this does is an arithmetic and on your fields and sets the carry bit if the arithmetic overflows the size of the register. The next time you doadd
oradc
, the processor factors in the carry bit too. In subtraction it's called the borrow flag.This also applies to shift operations. As such, this feature of the processor is crucial to what makes bignums fast. So the fact is, there's electronic circuitry in the chip for doing this stuff - doing it in software is always going to be slower.
Without going into too much detail, operations are built up from this ability to add, shift, subtract etc. They're crucial. Oh and you use the full width of your processor's register per limb if you're doing it right.
Second point - conversion between bases. You cannot take a value in the middle of a number and change it's base, because you can't account for the overflow from the digit beneath it in your original base, and that number can't account for the overflow from the digit beneath... and so on. In short, every time you want to change base, you need to convert the entire bignum from the original base to your new base back again. So you have to walk the bignum (all the limbs) three times at least. Or, alternatively, detect overflows expensively in all other operations... remember, now you need to do modulo operations to work out if you overflowed, whereas before the processor was doing it for us.
I should also like to add that whilst what you've got is probably quick for this case, bear in mind that as a bignum library gmp does a fair bit of work for you, like memory management. If you're using
mpz_
you're using an abstraction above what I've described here, for starters. Finally, gmp uses hand optimised assembly with unrolled loops for just about every platform you've ever heard of, plus more. There's a very good reason it ships with Mathematica, Maple et al.Now, just for reference, some reading material.
To sum it up for you: division assembly instructions suck, so people generally compute inverses and multiply instead, as you do when defining division in modular arithmetic. The various techniques that exist (see MCA) are mostly O(n).
Edit: Ok, not all of the techniques are O(n). Most of the techniques called div1 (dividing by something not bigger than a limb are O(n). When you go bigger you end up with O(n^2) complexity; this is hard to avoid.
Now, could you implement bigints as an array of digits? Well yes, of course you could. However, consider the idea just under addition
That's a rough sketch of what you're looking at for addition via your scheme. So you have to run the conversion between bases. So you're going to need a conversion to your representation for the base, then back when you're done, because this form is just really slow everywhere else. We're not talking about the difference between O(n) and O(n^2) here, but we are talking about an expensive division instruction per limb or an expensive conversion every time you want to divide. See this.
Next up, how do you expand your division for general case division? By that, I mean when you want to divide those two numbers x and y from the above code. You can't, is the answer, without resorting to bignum-based facilities, which are expensive. See Knuth. Taking modulo a number greater than your size doesn't work.
Let me explain. Try 21979182173 mod 1099. Let's assume here for simplicity's sake that the biggest size field we can have is three digits. This is a contrived example, but the biggest field size I know if uses 128 bits using gcc extensions. Anyway, the point is, you:
Split your number into limbs. Then you take modulo and sum:
It doesn't work. This is where Avi is correct, because this is a form of casting out nines, or an adaption thereof, but it doesn't work here because our fields have overflowed for a start - you're using the modulo to ensure each field stays within its limb/field size.
So what's the solution? Split your number up into a series of appropriately sized bignums? And start using bignum functions to calculate everything you need to? This is going to be much slower than any existing way of manipulating the fields directly.
Now perhaps you're only proposing this case for dividing by a limb, not a bignum, in which case it can work, but hensel division and precomputed inverses etc do to without the conversion requirement. I have no idea if this algorithm would be faster than say hensel division; it would be an interesting comparison; the problem comes with a common representation across the bignum library. The representation chosen in existing bignum libraries is for the reasons I've expanded on - it makes sense at the assembly level, where it was first done.
As a side note; you don't have to use
uint32_t
to represent your limbs. You use a size ideally the size of the registers of the system (say uint64_t) so that you can take advantage of assembly-optimised versions. So on a 64-bit systemadc rax, rbx
only sets the overflow (CF) if the result overspills 2^64 bits.tl;dr version: the problem isn't your algorithm or idea; it's the problem of converting between bases, since the representation you need for your algorithm isn't the most efficient way to do it in add/sub/mul etc. To paraphrase knuth: This shows you the difference between mathematical elegance and computational efficiency.
如果您需要经常除以同一个除数,那么使用它(或它的幂)作为基数使得除法的成本就像基数为 2 的二进制整数的位移位一样便宜。
如果需要,您可以使用基数 999;使用 10 的幂基数并没有什么特别之处,只是它使转换为十进制整数的成本非常低。 (您可以一次处理一个分支,而不必对整个整数进行全除。这就像将二进制整数转换为十进制与将每 4 位转换为十六进制数字之间的区别。二进制 -> 十六进制可以从最高有效位开始,但转换为非 2 的幂基数必须使用除法进行 LSB 优先。)
例如,要计算 Fibonacci 的前 1000 个十进制数字(109 )对于具有性能要求的代码高尔夫问题,我的 105 字节 x86 机器代码答案 使用与这个Python答案相同的算法:通常的
a+=乙; b+=a
斐波那契迭代,但每次a
变得太大时除以 10(的幂)。斐波那契增长速度快于进位传播速度,因此偶尔丢弃低位小数位不会长期改变高位位。 (您保留了一些超出您想要的精度的额外内容)。
除以 2 的幂不起作用,除非你记录你丢弃了多少次 2 的幂,因为最终的二进制 ->最后的十进制转换取决于此。
因此,对于此算法,您必须进行扩展精度加法,然后除以 10(或您想要的任何 10 的幂)。
我将 base-109 肢体存储在 32 位整数元素中。除以 109 非常便宜:只需一个指针增量即可跳过下肢。我没有实际执行
memmove
,而是只是偏移下一次 add 迭代使用的指针。我认为除以 10 的幂而不是 10^9 会有点便宜,但需要在每个分支上进行实际除法,并将余数传播到下一个分支。
这种方式的扩展精度加法比二进制肢体更昂贵,因为我必须通过比较手动生成进位:
sum[i] = a[i] + b [i];
进位 = sum < a;
(无符号比较)。并且还可以使用条件移动指令根据该比较手动换行至 10^9。但我能够使用该进位输出作为 adc(x86 add-with-carry 指令)的输入。您不需要完整的模数来处理加法的换行,因为您知道您最多换行一次。
这浪费了每个 32 位肢体的 2 位多一点:10^9,而不是
2^32 = 4.29... * 10^9
。每个字节存储一个 10 基数的数字会显着降低空间效率,并且性能非常更差,因为 8 位二进制加法的成本与现代 64 上的 64 位二进制加法相同位CPU。我的目标是代码大小:为了纯粹的性能,我会使用 64 位肢体来保存基数 10^19“数字”。 (
2^64 = 1.84... * 10^19
,因此每 64 浪费不到 1 位。)这可以让您使用每个硬件add指令。嗯,实际上这可能是一个问题:两个分支的总和可能会包含 64 位整数,因此只需检查
> 10^19
已经不够了。您可以在基数5*10^18
或基数10^18
中工作,或者进行更复杂的进位检测,检查二进制进位以及手动进位。每 4 位半字节存储一位数字的打包 BCD 性能会更差,因为没有硬件支持阻止在一个字节内从一个半字节到下一个半字节的进位。
总体而言,我的版本在相同硬件上的运行速度比 Python 扩展精度版本快约 10 倍(但通过减少除法频率,它还有显着优化速度的空间)。 (70 秒或 80 秒与 12 分钟)
不过,我认为对于 那个 算法的特定实现(我只需要加法和除法,每隔几次加法后就会发生除法),选择base-10^9 四肢非常好。对于第 N 个斐波那契数有更高效的算法,不需要执行 10 亿次扩展精度加法。
If you need to frequently divide by the same divisor, using it (or a power of it) as your base makes division as cheap as bit-shifting is for base 2 binary integers.
You could use base 999 if you want; there's nothing special about using a power-of-10 base except that it makes conversion to decimal integer very cheap. (You can work one limb at a time instead of having to do a full division over the whole integer. It's like the difference between converting a binary integer to decimal vs. turning every 4 bits into a hex digit. Binary -> hex can start with the most significant bits, but converting to non-power-of-2 bases has to be LSB-first using division.)
For example, to compute the first 1000 decimal digits of Fibonacci(109) for a code-golf question with a performance requirement, my 105 bytes of x86 machine code answer used the same algorithm as this Python answer: the usual
a+=b; b+=a
Fibonacci iteration, but divide by (a power of) 10 every timea
gets too large.Fibonacci grows faster than carry propagates, so discarding the low decimal digits occasionally doesn't change the high digits long-term. (You keep a few extra beyond the precision you want).
Dividing by a power of 2 doesn't work, unless you keep track of how many powers of 2 you've discarded, because the eventual binary -> decimal conversion at the end would depend on that.
So for this algorithm, you have to do extended-precision addition, and division by 10 (or whatever power of 10 you want).
I stored base-109 limbs in 32-bit integer elements. Dividing by 109 is trivially cheap: just a pointer increment to skip the low limb. Instead of actually doing a
memmove
, I just offset the pointer used by the next add iteration.I think division by a power of 10 other than 10^9 would be somewhat cheap, but would require an actual division on each limb, and propagating the remainder to the next limb.
Extended-precision addition is somewhat more expensive this way than with binary limbs, because I have to generate the carry-out manually with a compare:
sum[i] = a[i] + b[i];
carry = sum < a;
(unsigned comparison). And also manually wrap to 10^9 based on that compare, with a conditional-move instruction. But I was able to use that carry-out as an input toadc
(x86 add-with-carry instruction).You don't need a full modulo to handle the wrapping on addition, because you know you've wrapped at most once.
This wastes a just over 2 bits of each 32-bit limb: 10^9 instead of
2^32 = 4.29... * 10^9
. Storing base-10 digits one per byte would be significantly less space efficient, and very much worse for performance, because an 8-bit binary addition costs the same as a 64-bit binary addition on a modern 64-bit CPU.I was aiming for code-size: for pure performance I would have used 64-bit limbs holding base-10^19 "digits". (
2^64 = 1.84... * 10^19
, so this wastes less than 1 bit per 64.) This lets you get twice as much work done with each hardwareadd
instruction. Hmm, actually this might be a problem: the sum of two limbs might wrap the 64-bit integer, so just checking for> 10^19
isn't sufficient anymore. You could work in base5*10^18
, or in base10^18
, or do more complicated carry-out detection that checks for binary carry as well as manual carry.Storing packed BCD with one digit per 4 bit nibble would be even worse for performance, because there isn't hardware support for blocking carry from one nibble to the next within a byte.
Overall, my version ran about 10x faster than the Python extended-precision version on the same hardware (but it had room for significant optimization for speed, by dividing less often). (70 seconds or 80 seconds vs. 12 minutes)
Still, I think for this particular implementation of that algorithm (where I only needed addition and division, and division happened after every few additions), the choice of base-10^9 limbs was very good. There are much more efficient algorithms for the Nth Fibonacci number that don't need to do 1 billion extended-precision additions.