无穷级数的并行计算
我只是有一个简单的问题,关于如何加速无穷级数的计算。 这只是其中之一: arctan(x) = x - x^3/3 + x^5/5 - x^7/7 + ....
假设您有一些可以处理大数的库,那么第一个明显的解决方案是开始添加/减去序列的每个元素,直到达到某个目标 N。
您还可以预先保存 X^n,这样对于每个下一个元素,您可以不用计算 x^(n+2) do lastX*(x^2)
但总的来说,这似乎是一个非常连续的任务,你可以做什么来利用多个处理器(8+)?
多谢!
编辑: 我需要计算 100k 到 1m 次迭代。这是基于 C++ 的应用程序,但我正在寻找抽象解决方案,所以这应该不重要。 感谢您的回复。
I just have a quick question, on how to speed up calculations of infinite series.
This is just one of the examples:
arctan(x) = x - x^3/3 + x^5/5 - x^7/7 + ....
Lets say you have some library which allow you to work with big numbers, then first obvious solution would be to start adding/subtracting each element of the sequence until you reach some target N.
You also can pre-save X^n so for each next element instead of calculating x^(n+2) you can do lastX*(x^2)
But over all it seems to be very sequential task, and what can you do to utilize multiple processors (8+)??.
Thanks a lot!
EDIT:
I will need to calculate something from 100k to 1m iterations. This is c++ based application, but I am looking for abstract solution, so it shouldn't matter.
Thanks for reply.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您需要分解问题以匹配您拥有的处理器或线程的数量。在您的情况下,您可以让一个处理器处理偶数项,另一个处理奇数项。您可以使用lastX*(x^4) 来跳过所有其他项,而不是预先计算x^2 并使用lastX*(x^2)。要使用 8 个处理器,请将前一项乘以 x^16 以跳过 8 项。
PS 大多数时候,当遇到这样的问题时,寻找一种更有效的方法来计算结果是值得的。大多数时候,更好的算法胜过更多的马力。
You need to break the problem down to match the number of processors or threads you have. In your case you could have for example one processor working on the even terms and another working on the odd terms. Instead of precalculating x^2 and using lastX*(x^2), you use lastX*(x^4) to skip every other term. To use 8 processors, multiply the previous term by x^16 to skip 8 terms.
P.S. Most of the time when presented with a problem like this, it's worthwhile to look for a more efficient way of calculating the result. Better algorithms beat more horsepower most of the time.
如果您试图计算 pi 的数百万位值或其他值,您首先要密切注意选择一个快速收敛且适合并行化的级数。然后,如果你有足够的数字,将它们分割到多个处理器上最终会变得具有成本效益;您必须找到或编写一个可以执行此操作的 bignum 库。
请注意,您可以通过多种方式分解变量;例如:
虽然第二行比第一行的简单实现更有效,但后者的计算仍然具有从开始到结束的线性依赖链。您可以通过成对组合项来提高并行性:
但是,这种加速并不像您想象的那么简单,因为每次计算所需的时间取决于保存它所需的精度。在设计算法时,您需要考虑到这一点;此外,你的代数也与你密切相关;即,对于上述情况,如果您定期除以常数,您将得到无限重复的分数,因此您需要找到某种方法来处理这个问题,无论是哪种方式。
If you're trying to calculate the value of pi to millions of places or something, you first want to pay close attention to choosing a series that converges quickly, and which is amenable to parallellization. Then, if you have enough digits, it will eventually become cost-effective to split them across multiple processors; you will have to find or write a bignum library that can do this.
Note that you can factor out the variables in various ways; e.g.:
Although the second line is more efficient than a naive implementation of the first line, the latter calculation still has a linear chain of dependencies from beginning to end. You can improve your parallellism by combining terms in pairs:
However, this speedup is not as simple as you might think, since the time taken by each computation depends on the precision needed to hold it. In designing your algorithm, you need to take this into account; also, your algebra is intimately involved; i.e., for the above case, you'll get infinitely repeating fractions if you do regular divisions by your constant numbers, so you need to figure some way to deal with that, one way or another.
好吧,对于这个例子,您可以对级数求和(如果我将括号放在正确的位置):
然后在处理器 1 of 8 上计算 i = 1, 9, 17, 25, ..然后
在处理器 2 of 8 上计算 i = 2, 11, 18, 26, ... 等项的总和
,最后将部分总和相加。
或者,您可以按照您(几乎)建议的方式进行操作,将 i = 1..16 (比如说)提供给处理器 1,将 i = 17..32 提供给处理器 2,依此类推,它们可以根据上一篇。如果您希望系列中的元素超过 8x16,请首先为每个处理器分配更多元素。
对于这个例子,我怀疑是否值得并行化,我怀疑当并行线程仍在唤醒时,您将在 1 个处理器上获得双精度精度;但这只是对这个示例的猜测,您可能可以在许多系列中进行并行化是值得付出努力的。
而且,正如 @Mark Ransom 已经说过的,更好的算法应该每次都能击败暴力和大量处理器。
Well, for this example, you might sum the series (if I've got the brackets in the right places):
Then on processor 1 of 8 compute the sum of the terms for i = 1, 9, 17, 25, ...
Then on processor 2 of 8 compute the sum of the terms for i = 2, 11, 18, 26, ...
and so on, finally adding up the partial sums.
Or, you could do as you (nearly) suggest, give i = 1..16 (say) to processor 1, i = 17..32 to processor 2 and so on, and they can compute each successive power of x from the previous one. If you want more than 8x16 elements in the series, then assign more to each processor in the first place.
I doubt whether, for this example, it is worth parallelising at all, I suspect that you will get to double-precision accuracy on 1 processor while the parallel threads are still waking up; but that's just a guess for this example, and you can probably many series for which parallelisation is worth the effort.
And, as @Mark Ransom has already said, a better algorithm ought to beat brute-force and a lot of processors every time.