C#:如何使阿特金筛增量
我不知道这是否可能,但我只是想问一下。我的数学和算法技能在这里有点让我失败:P
问题是我现在有这个类可以生成达到一定限制的素数:
public class Atkin : IEnumerable<ulong>
{
private readonly List<ulong> primes;
private readonly ulong limit;
public Atkin(ulong limit)
{
this.limit = limit;
primes = new List<ulong>();
}
private void FindPrimes()
{
var isPrime = new bool[limit + 1];
var sqrt = Math.Sqrt(limit);
for (ulong x = 1; x <= sqrt; x++)
for (ulong y = 1; y <= sqrt; y++)
{
var n = 4*x*x + y*y;
if (n <= limit && (n % 12 == 1 || n % 12 == 5))
isPrime[n] ^= true;
n = 3*x*x + y*y;
if (n <= limit && n % 12 == 7)
isPrime[n] ^= true;
n = 3*x*x - y*y;
if (x > y && n <= limit && n % 12 == 11)
isPrime[n] ^= true;
}
for (ulong n = 5; n <= sqrt; n++)
if (isPrime[n])
{
var s = n * n;
for (ulong k = s; k <= limit; k += s)
isPrime[k] = false;
}
primes.Add(2);
primes.Add(3);
for (ulong n = 5; n <= limit; n++)
if (isPrime[n])
primes.Add(n);
}
public IEnumerator<ulong> GetEnumerator()
{
if (!primes.Any())
FindPrimes();
foreach (var p in primes)
yield return p;
}
IEnumerator IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
}
现在,我想要的是摆脱限制,以便序列永远不会停止(理论上)。
我想事情可能是这样的:
- 从一些琐碎的限制开始
- 查找达到极限的所有素数
- 产生所有新发现的素数
- 增加限制(通过将旧限制加倍或平方或类似的方式)
- 转到第 2 步
最佳情况下,第二步应该只在旧极限之间起作用和新的。换句话说,它不应该一次又一次地找到最低的素数。
有办法做到这一点吗?我的主要问题是我不太明白该算法中的 x 和 y 是什么。就像,我可以使用相同的算法类型,但将 x
和 y
设置为 oldLimit
(最初为 1)并运行到 新限制?或者说这会如何运作?有聪明的头脑对此有一些启发吗?
这样做的目的是让我不必设置该限制。这样我就可以使用 Linq 并仅使用 Take()
无论我需要多少素数,而不用担心限制是否足够高等等。
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
下面是 C# 中无界素数筛选的解决方案,可以使用埃拉托斯特尼筛法 (SoE) 或阿特金筛法 (SoA) 算法实现;然而,我认为,与真正的 SoE 提供大约相同的性能而没有那么多复杂性相比,优化 SoA 解决方案的极端复杂性几乎不值得。因此,这可能只是部分答案,因为虽然它展示了如何实现更好的 SoA 算法并展示了如何使用 SoE 实现无界序列,但它仅暗示如何组合这些来编写相当高效的 SoA。
请注意,如果只需要讨论这些想法的最快实现,请跳到此答案的底部。
首先,我们应该评论此练习的要点,即生成无界素数序列以允许使用 IEnumerable 扩展方法,例如 Take()、TakeWhile()、Where()、Count() 等,因为这些方法由于增加了方法调用级别而牺牲了一些性能,每次调用至少增加 28 个机器周期并返回枚举一个值并为每个函数添加几级方法调用;也就是说,即使对枚举结果使用更多命令式过滤技术以获得更快的速度,拥有(有效)无限序列仍然有用。
请注意,在下面的更简单的示例中,我将范围限制为无符号 32 位数字 (uint),超过该范围基本 SoE 或 SoA 实现在效率方面并不真正合适,需要进行切换到“桶”或其他形式的增量筛以进行部分筛分以提高效率; 但是,对于完全优化的页分段筛选(如最快的实现),范围会增加到 64 位范围,尽管如所写,人们可能不希望筛选超过大约一百万亿(十的十四次方)因为整个范围的运行时间将需要长达数百年的时间。
由于问题选择 SoA 的原因可能是错误的,首先将 Trial Division (TD) 素数筛误认为是真正的 SoE,然后使用天真的 SoE在 TD 筛上的 SoA,让我们建立真正的有界 SoE,它可以作为一种方法在几行中实现(可以根据问题的实现编码风格转换为类),如下所示:
这将计算 200 万的素数在 i7-2700K (3.5 GHz) 上大约需要 16 毫秒,在同一台机器上大约 67 秒内将 203,280,221 个素数计算到 32 位数字范围内的 4,294,967,291(假设有 256 MB 的备用 RAM)。
现在,使用上面的版本与 SoA 进行比较并不公平,因为真正的 SoA 会自动忽略 2、3 和 5 的倍数,因此引入轮分解对 SoE 执行相同操作会产生以下代码:
上面的代码计算在与上述相同的机器上,在大约 10 毫秒内将质数变为 200 万,在大约 40 秒内将质数变为 32 位数字范围。
接下来,让我们确定我们可能在此处编写的 SoA 版本与上述代码中的 SoE 相比,在执行速度方面是否确实具有任何优势。下面的代码遵循上述 SoE 模型,并优化了 维基百科文章 至于按照该文章中的建议,对各个二次解使用单独的循环来调整“x”和“y”变量的范围,仅针对奇数解求解二次方程(以及平方自由消除),结合“3 *x^2" 二次方程可在同一遍中求解正第二项和负第二项,并消除计算量大的模运算,生成的代码比此处发布的伪代码的简单版本快三倍以上,如此处问题中所使用的。有界 SoA 的代码如下:
由于操作数量未完全优化,这仍然比发布的轮分解 SoE 算法慢两倍以上。可以对 SoA 代码进行进一步优化,如对原始(非伪代码)算法使用模 60 运算,并使用位打包仅处理除 3 和 5 之外的倍数,以使代码的性能与 SoE 相当接近甚至稍微超过它,但我们越来越接近 Berstein 实现实现此性能的复杂性。我的观点是“当人们非常努力地工作以获得与 SoE 相同的性能时,为什么选择 SoA?”。
现在,对于无界素数序列,最简单的方法是定义 Uint32.MaxValue 的 const top_number 并消除 primesSoE 或 primesSoA 方法中的参数。这有点低效,因为无论正在处理的实际素数值如何,剔除仍然在整个数字范围内完成,这使得确定小范围素数花费的时间比应有的时间长得多。除了这种情况和极端的内存使用之外,还有其他原因需要使用素数筛的分段版本:首先,使用主要位于 CPU L1 或 L2 数据缓存中的数据的算法由于内存访问效率更高而处理速度更快,并且其次,因为分段允许将作业轻松分割为多个页面,这些页面可以使用多核处理器在后台进行剔除,从而实现与所使用的核心数量成正比的加速。
为了提高效率,我们应该选择 CPU L1 或 L2 缓存大小的数组大小,对于大多数现代 CPU 而言,该数组大小至少为 16 KB,以避免在我们从数组中剔除素数组合时发生缓存抖动,这意味着 BitArray 可以具有大小是其大小的八倍(每字节 8 位)或 128 KB。由于我们只需要筛选奇数素数,这代表了超过 25 万的数字范围,这意味着分段版本将仅使用 8 个分段来筛选 200 万个,如欧拉问题 10 的要求。人们可以将结果保存在最后一段并从该点继续,但这将妨碍将此代码适应多核情况,在多核情况下,将剔除到多线程的后台以充分利用多核处理器。 (单线程)无界 SoE 的 C# 代码如下:
上面的代码大约需要 16 毫秒来筛选最多 200 万素数,大约需要 30 秒来筛选完整的 32 位数字范围。此代码比使用阶乘轮而不对大数字范围进行分段的类似版本更快,因为即使代码在计算方面更复杂,但由于不破坏 CPU 缓存,因此节省了大量时间。大部分复杂性在于计算每个段每个基素数的新起始索引,这可以通过保存每个段每个素数的状态来减少,但是上面的代码可以很容易地转换,以便在分离后台线程,使四核机器的速度进一步提高四倍,八核机器的速度甚至更快。不使用 BitArray 类并通过位掩码寻址各个位位置将提供大约两倍的进一步加速,并且阶乘轮的进一步增加将提供约三分之二的时间的另一次减少。对于通过轮分解消除的值,更好地将位数组打包在消除索引中会稍微提高大范围的效率,但也会在一定程度上增加位操作的复杂性。然而,所有这些 SoE 优化都不会达到 Berstein SoA 实现的极端复杂性,但运行速度大致相同。
要将上述代码从 SoE 转换为 SoA,我们只需要从有界情况更改为 SoA 剔除代码,但修改为每个页段重新计算起始地址,就像为 SoE 情况计算起始地址一样,但运算起来更加复杂,因为二次方程是通过数字的平方而不是简单的素数倍数来推进的。我将把所需的调整留给学生,尽管我并没有真正明白这一点,因为经过合理优化的 SoA 并不比当前版本的 SoE 更快,而且要复杂得多;)
EDIT_ADD:< /strong>
注意:下面的代码已被更正,虽然原始发布的代码对于提供的素数序列是正确的,但它比需要的速度慢了一半,因为它剔除了平方根以下的所有数字 最有效和最简单的优化是将每个段页面的剔除操作委托给后台线程,以便在有足够的 CPU 核心的
情况下,主要枚举素数的限制是枚举循环本身,在这种情况下,在上述参考机(C# 中)上,可以在大约十秒内枚举出 32 位数字范围内的所有素数,而无需其他优化,所有其他优化包括编写IEnumerable 接口实现而不是使用内置迭代器,将 32 位数字范围内的所有 203,280,221 个素数的时间减少到大约六秒(1.65 秒到十亿),同样对于大多数仅仅枚举结果所花费的时间。以下代码进行了许多修改,包括使用任务使用的 Dotnet Framework 4 线程池中的后台任务、使用作为查找表实现的状态机来实现进一步的位打包以使素数枚举更快,以及重写算法作为使用“自己滚动”迭代器来提高效率的可枚举类:
请注意,对于小范围的素数(最多一两百万),此代码并不比上一个版本更快,因为设置和初始化数组,但对于高达 40 亿以上的更大范围,速度要快得多。它比阿特金筛问题的实施快大约 8 倍,但对于高达 40 亿以上的更大范围,速度会快 20 到 50 倍。代码中定义了常量,用于设置基本缓存大小以及每个线程要剔除的缓存数量 (CHNKSZ),这可能会稍微调整性能。可以尝试一些进一步的优化,这可能会将大素数的执行时间减少多达两倍,例如针对 2,3,5,7 轮进行进一步的位打包,而不仅仅是 2,3,5 轮打包工作减少了约 25%(也许将执行时间减少到三分之二),并且通过轮阶乘预先剔除页面段高达 17 倍,进一步减少了约 20%,但这些只是程度与可以利用独特的本机代码优化的 C 代码相比,纯 C# 代码可以完成哪些工作。
无论如何,如果您打算按照问题的要求使用 IEnumberable 接口进行输出,则几乎不值得进一步优化,因为大约三分之二的时间用于枚举找到的素数,而只有大约三分之一的时间用于剔除合数。更好的方法是在类上编写方法来实现所需的结果,例如 CountTo、ElementAt、SumTo 等,从而完全避免枚举。
但我确实做了额外的优化作为附加答案,尽管复杂性增加,但它显示了额外的好处,并且进一步说明了为什么人们不想使用 SoA,因为完全优化的 SoE 实际上更好。
Here is a solution to unbounded prime sieving in C#, which can be implemented using the Sieve of Eratosthenes (SoE) or the Sieve of Atkin (SoA) algorithms; however, I maintain that it is hardly worth the extreme complexity of an optimized SoA solution given than the true SoE gives about the same performance without as much complexity. Thus, this is perhaps only a partial answer in that, while it shows how to implement a better SoA algorithm and shows how to implement an unbounded sequence using SoE, it only hints at how to combine these to write a reasonably efficient SoA.
Note that if only discussion on the very fastest implementations of these ideas is desired, jump to the bottom of this answer.
First we should comment on the point of this exercise in producing an unbounded sequence of primes to permit using the IEnumerable extension methods such as Take(), TakeWhile(), Where(), Count(), etcetera, as these methods give away some performance due to the added levels of method calling, adding at least 28 machine cycles for every call and return to enumerate one value and adding several levels of method calls per function; that said, having an (effectively) infinite sequence is still useful even if one uses more imperative filtering techniques on the results of the enumerations for better speed.
Note, in the the simpler examples following, I have limited the range to that of unsigned 32-bit numbers (uint) as much past that range the basic SoE or SoA implementations aren't really appropriate as to efficiency and one needs to switch over to a "bucket" or other form of incremental sieve for part of the sieving for efficiency; however, for a fully optimized page segmented sieve as in the fastest implementation, the range is increased to the 64-bit range although as written one likely would not want to sieve past about a hundred trillion (ten to the fourteenth power) as run time would take up to hundreds of years for the full range.
As the question chooses the SoA for probably the wrong reasons in first mistaking a Trial Division (TD) prime sieve for a true SoE and then using a naive SoA over the TD sieve, let's establish the true bounded SoE which can be implemented in a few lines as a method (which could be converted to a class as per the question's implementation coding style) as follows:
This calculates the primes to 2 million in about 16 milliseconds on an i7-2700K (3.5 GHz) and the 203,280,221 primes up to 4,294,967,291 in the 32-bit number range in about 67 seconds on the same machine (given a spare 256 MegaBytes of RAM memory).
Now, using the version above to compare to the SoA is hardly fair as the true SoA automatically ignores the multiples of 2, 3, and 5 so introducing wheel factorization to do the same for the SoE yields the following code:
The above code calculates the primes to 2 million in about 10 milliseconds and the primes to the 32-bit number range in about 40 seconds on the same machine as above.
Next, let's establish whether a version of the SoA that we are likely to write here actually has any benefit as compared to the SoE as per the above code as far as execution speed goes. The code below follows the model of the SoE above and optimizes the SoA pseudo-code from the Wikipedia article as to tuning the ranges of the 'x' and 'y' variables using separate loops for the individual quadratic solutions as suggested in that article, solving the quadratic equations (and the square free eliminations) only for odd solutions, combining the "3*x^2" quadratic to solve for both the positive and negative second terms in the same pass, and eliminating the computationally expensive modulo operations, to produce code that is over three times faster than the naive version of the pseudo-code posted there and as used in the question here. The code for the bounded SoA is as then follows:
This is still over twice as slow as the wheel factorization SoE algorithm posted due to the not fully optimized number of operations. Further optimizations can be made to the SoA code as in using modulo 60 operations as for the original (non-pseudo-code) algorithm and using bit packing to only deal with multiples excluding 3 and 5 to get the code fairly close in performance to SoE or even exceed it slightly, but we get closer and closer to the complexity of the Berstein implementation to achieve this performance. My point is "Why SoA, when one works very hard to get about the same performance as SoE?".
Now for the unbounded primes sequence, the very easiest way to do this is to define a const top_number of Uint32.MaxValue and eliminate the argument in the primesSoE or primesSoA methods. This is somewhat inefficient in that the culling is still done over the full number range no matter the actual prime value being processed, which makes the determination for small ranges of primes take much longer than it should. There are also other reasons to go to a segmented version of the primes sieve other than this and extreme memory use: First, algorithms that use data that is primarily within the CPU L1 or L2 data caches process faster because of more efficient memory access, and secondly because segmentation allows the job to be easily split into pages that can be culled in the background using multi-core processors for a speed-up that can be proportional to the number of cores used.
For efficiency, we should choose an array size of the CPU L1 or L2 cache size which is at least 16 Kilobytes for most modern CPU's in order to avoid cache thrashing as we cull composites of primes from the array, meaning that the BitArray can have a size of eight times that large (8 bits per byte) or 128 Kilobits. Since we only need to sieve odd primes, this represents a range of numbers of over a quarter million, meaning that a segmented version will only use eight segments to sieve to 2 million as required by Euler Problem 10. One could save the results from the last segment and continue on from that point, but that would preclude adapting this code to the multi-core case where one relegates culling to the background on multiple threads to take full advantage of multi-core processors. The C# code for an (single thread) unbounded SoE is as follows:
The above code takes about 16 milliseconds to sieve the primes up to two million and about 30 seconds to sieve the full 32-bit number range. This code is faster than the similar version using the factorial wheel without segmentation for large number ranges because, even though the code is more complex as to computations, a great deal of time is saved in not thrashing the CPU caches. Much of the complexity is in the computation of the new starting indexes per base prime per segment, which could have been reduced by saving the state of each per prime per segment but the above code can easily be converted so as to run the culling processes on separate background threads for a further four times speed up for a machine with four cores, even more with eight cores. Not using the BitArray class and addressing the individual bit locations by bit masks would provide a further speed up by about a factor of two, and further increases of the factorial wheel would provide about another reduction in time to about two thirds. Better packing of the bit array in eliminated indexes for values eliminated by the wheel factorization would slightly increase efficiency for large ranges, but would also add somewhat to the complexity of the bit manipulations. However, all of these SoE optimizations would not approach the extreme complexity of the Berstein SoA implementation, yet would run at about the same speed.
To convert the above code from SoE to SoA, we only need to change to the SoA culling code from the bounded case but with the modification that the starting addresses are recalculated for every page segment something like as the starting address is calculated for the SoE case, but even somewhat more complex in operation as the quadratics advance by squares of numbers rather than by simple multiples of primes. I'll leave the required adaptations to the student, although I don't really see the point given that the SoA with reasonable optimizations is not faster than the current version of the SoE and is considerably more complex ;)
EDIT_ADD:
Note: The below code has been corrected, as while the original posted code was correct as to the provided primes sequence, it was half again slower than it needed to be as it culled for all numbers below the square root of the range rather than only the found base primes up to the square root of the range.
The most effective and simplest optimization is relegating the culling operations per segment page to background threads so that, given enough CPU cores, the main limit in enumerating the primes is the enumeration loop itself, in which case all the primes in the 32-bit number range can be enumerated in about ten seconds on the above reference machine (in C#) without other optimizations, with all other optimizations including writing the IEnumerable interface implementations rather than using the built-in iterators reducing this to about six seconds for all the 203,280,221 primes in the 32-bit number range (1.65 seconds to one billion), again with most of the time spent just enumerating the results. The following code has many of those modifications including using background tasks from the Dotnet Framework 4 ThreadPool used by Tasks, using a state machine implemented as a look up table to implement further bit packing to make the primes enumeration quicker, and re-writing the algorithm as an enumerable class using "roll-your-own" iterators for increased efficiency:
Note that this code isn't faster than the last version for small ranges of primes as in up to one or two million due to the overhead of setting up and initializing arrays but is considerably faster for larger ranges up to four billion plus. It is about 8 times faster than the question's implementation of the Atkin sieve, but goes to from 20 to 50 times as fast for larger ranges up to four billion plus. There are defined constants in the code setting the base cache size and how many of these to cull per thread (CHNKSZ) that may slightly tweak the performance. Some slight further optimizations could be tried that might reduce the execution time for large primes by up to a factor of two such as further bit packing as for the 2,3,5,7 wheel rather than just the 2,3,5 wheel for a reduction of about 25% in packing (perhaps cutting the execution time to two thirds) and pre-culling the page segments by the wheel factorial up to the factor of 17 for a further reduction of about 20%, but these are about the extent of what can be done in pure C# code as compared to C code which can take advantage of unique native code optimizations.
At any rate, it is hardly worth further optimizations if one intends to use the IEnumberable interface for output as the question desires as about two thirds of the time is used just enumerating the found primes and only about one third of the time is spent in culling the composite numbers. A better approach would be to write methods on the class to implement the desired results such as CountTo, ElementAt, SumTo, etcetera so as to avoid enumeration entirely.
But I did do the extra optimizations as an additional answer which shows additional benefits in spite of additional complexity, and which further shows why one doesn't want to use the SoA because a fully optimized SoE is actually better.
以下代码执行我之前的答案底部讨论的优化,并包含以下功能:
范围 18,446,744,073,709,551,615 并进行范围溢出检查
删除,因为人们不太可能想要运行该程序
处理所有种类的物质需要数百年的时间
数量达到该限制。处理成本非常低
时间,因为分页可以使用 32 位页范围完成,并且仅
最终质数输出需要计算为 64 位数字。
2、3、5、7 素因子轮,带有额外的复合材料预剔除
使用附加素数 11、13 和 17 的数字,
大大减少了冗余合数剔除(现在只
每个合数平均剔除约1.5次)。到期的
这样做的(DotNet 相关的)计算开销(也
适用于2、3、5轮如之前版本)实际时间
节省剔除并不是那么好,但枚举答案
由于许多“简单”合数,速度稍快一些
在打包位表示中被跳过。
用于基于每页的线程池中的多线程。
随着更多基本素数的增加,该类包含的数组不断增长
需要作为线程安全方法而不是固定的预先计算
先前使用的基本素数数组。
基本素数表示已减少到每个基本一个字节
主要用于进一步减少内存占用;因此,总共
除了代码之外的内存占用是保存该基数的数组
素数的表示,直到平方根
当前正在处理的范围,以及打包的位页缓冲区
当前设置为低于 256 KB 的二级缓存大小
(最小页面大小 14,586 字节乘以 CHNKSZ 17 作为
提供)每个 CPU 核心加上一个额外的前台缓冲区
要处理的任务。使用此代码,大约三兆字节
足以处理十到十四的素数范围
力量。以及由于允许高效的多处理而带来的速度,
这减少了内存需求是使用
分页筛实现。
上面的代码大约需要 59 毫秒来找到 200 万的素数(由于初始化开销,比其他一些更简单的代码稍慢),但计算出 10 亿的素数和完整的数字范围分别需要 1.55 和 5.95 秒。这并不比上一个版本快多少,因为与剔除合数所花费的时间相比,DotNet 在枚举找到的素数时进行额外的数组边界检查的额外开销还不到枚举所用时间的三分之一,因此剔除复合材料中的节省被枚举中的额外时间(由于每个素数候选者的额外数组边界检查)所抵消。然而,对于许多涉及素数的任务,不需要枚举所有素数,只需计算答案即可,无需枚举。
由于上述原因,该类提供了示例静态方法“CountTo”、“SumTo”和“ElementAt”,分别对素数进行计数或求和至给定上限或输出从零开始的第 n 个素数。 “CountTo”方法将分别在0.32秒和1.29秒内生成到10亿且32位数字范围内的素数个数 >; "ElementAt" 方法将分别在大约 0.32 和 1.25 秒内生成这些范围内的最后一个元素,"SumTo" 方法生成这些范围内所有素数的总和分别约为0.49 和 1.98 秒。该程序将所有素数的总和计算为 40 亿以上,所用的时间比许多简单的实现将所有素数求和为 200 万的时间要短,如欧拉问题 10,超过实际范围的 2000 倍!
此代码仅比 primesieve 使用的高度优化的 C 代码慢大约四倍,并且它较慢的原因主要是由于DotNet,如下(讨论256 KB缓冲区的情况,这是L2缓存的大小):
循环,这是私有静态“剔除”中的最后一个“for 循环”
方法”,每个循环仅包含四个语句加上范围
查看。
循环,包括两个数组边界的大约 5 个时钟周期
每个循环的检查。
8.33 个时钟周期的优势约为 2.67 倍。
将每个循环执行工作的平均时间减少到大约 4.17
每个复合剔除循环的时钟周期,额外增益为 2
倍,总增益约为5.3倍。
以及效率较低的即时 (JIT) 编译器生成的代码
而且 primesieve 使用的 OpemMP 多线程也没有
似乎与使用 Thread 一样能很好地解决这个问题
这里有线程池,所以最终的多线程增益大约只有
四次。
数组边界检查,并且已尝试过,但 JIT 编译器没有
优化指针以及普通的基于数组的代码,因此增益
没有数组绑定检查会被效率较低的方法取消
代码(每次指针访问(重新)从内存加载指针地址
而不是使用已经指向该地址的寄存器,如
优化的数组情况)。
可用 L1 缓存的大小(多线程时为 16 KB)
对于 i3/i5/i7 CPU),因为其更高效的代码具有更多
优点是将平均内存访问时间减少到一个时钟周期
从大约四个时钟周期开始,这个优势使得
与 DotNet 代码的区别,后者事半功倍
每减少页数的处理。因此,primsieve 大约是
当每个人都使用最有效的缓冲区大小时,速度提高五倍。
此 DotNet 代码将在大约一个半小时内(经过测试)计算 (CountTo) 10 的 13 次方(十万亿)的素数数量,并在超过primesieve 需要半天(估计),而 primesieve 则分别需要 20 分钟和不到 4 小时。这在历史上很有趣,因为直到 1985 年,我们只知道 10 到 13 的范围内的素数计数,因为在当时的(昂贵的)超级计算机上需要太多的执行时间才能找到十倍大的范围;现在我们可以在普通台式计算机(本例中为 Intel i7-2700K - 3.5 GHz)上轻松计算这些范围内的素数个数!
为什么 Atkin 和 Bernstein 教授认为 SoA 比 SoE 更快 - 这个神话一直持续到今天,推理如下:
切换和无平方复合数剔除(最后可以是
使用与固有使用相同的 2、3、5 轮优化进行优化
由SoA算法)来确定两者的总数
对于 32 位数字范围,这些操作大约为 14 亿次。
实现(两者都没有像此代码那样优化),
它使用与 SoA 相同的 2、3、5 轮优化,将具有
相同剔除循环总共约 18.2 亿次剔除操作
计算复杂度。
与他对 SoE 的实施相比,大约是正确的,只是
基于等效操作的数量。然而,他的
SoE 的实施并没有采取“车轮分解”
max”,因为 SoA 对车轮进一步转动的角度没有太大反应
分解为 2、3、5 轮“烘焙”到基本轮中
算法。
对于 32 位数字,复合剔除操作约为 12 亿次
范围;因此,该算法使用这个轮数
因式分解的运行速度比同等版本快约 16.7%
SoA 的,因为剔除循环可以以相同的方式实现
每个算法。
等效的 SoA,因为只需要一个状态表查找数组
剔除基本质数而不是额外的状态外观
四个二次方程解法案例中每一个的 up 数组
产生有效的状态切换。
将等效地响应 SoA 的 C 编译器优化
primesieve 中使用的 SoE。
primesieve 使用的展开”优化与
用于 SoE 实施。
使用上述 SoE 代码技术的算法,
当输出
素数已被枚举,但使用时会慢约 16.7%
静态直接方法。
要求基素数的表示和相同数量的
段页缓冲区。
EDIT_ADD:有趣的是,此代码在 x86 32 位模式下的运行速度比 x64 64 位模式下快 30%,这可能是由于避免了将 uint32 数字扩展为 ulong 的轻微额外开销。以上所有时序均针对 64 位模式。 END_EDIT_ADD
总而言之:阿特金分段分页筛实际上比埃拉托斯特尼最大优化分段分页筛慢,而且没有节省内存需求!!!
我再说一遍:<强>“为什么使用阿特金筛?”。
The following code does the optimizations as discussed at the bottom of my previous answer and includes the following features:
range of 18,446,744,073,709,551,615 with the range overflow checks
removed since it is unlikely that one would want to run the program
for the hundreds of years it would take to process the full range of
numbers to that limit. This is at very little cost in processing
time as the paging can be done using 32-bit page ranges and only the
final prime output needs to be computed as a 64-bit number.
2,3,5,7 prime factor wheel with an additional pre-cull of composite
numbers using the the additional primes of 11, 13, and 17, to
greatly reduce the redundant composite number culling (now only
culling each composite number an average of about 1.5 times). Due
to the (DotNet related) computational overheads of doing this (also
applies for the 2,3,5 wheel as the previous version) the actual time
saving in culling isn't all that great but enumerating the answers
is somewhat faster due to many of the "simple" composite numbers
being skipped in the packed bit representation.
for multi-threading from the thread pool on a per page basis.
growing the array contained by this class as more base primes are
required as a thread safe method instead of the fixed pre-computed
base primes array used previously.
The base primes representation has been reduced to one byte per base
prime for a further reduction in memory footprint; thus, the total
memory footprint other than the code is the array to hold this base
primes representation for the primes up to the square root of the
current range being processed, and the packed bit page buffers which
are currently set at under the L2 cache size of 256 Kilobytes
(smallest page size of 14,586 bytes times the CHNKSZ of 17 as
supplied) each per CPU core plus one extra buffer for the foreground
task to process. With this code, about three Megabytes is
sufficient to process the prime range up to ten to the fourteenth
power. As well as speed due to allowing efficient multiprocessing,
this reduce memory requirement is the other advantage of using a
paged sieve implementation.
The above code takes about 59 milliseconds to find the primes to two million (slightly slower than some of the other simpler codes due to initialization overhead), but calculates the primes to one billion and the full number range in 1.55 and 5.95 seconds, respectively. This isn't much faster than the last version due to the DotNet extra overhead of an extra array bound check in the enumeration of found primes compared to the time expended in culling composite numbers which is less than a third of the time spent emumerating, so the saving in culling composites is cancelled out by the extra time (due to an extra array bounds check per prime candidate) in the enumeration. However, for many tasks involving primes, one does not need to enumerate all primes but can just compute the answers without enumeration.
For the above reasons, this class provides the example static methods "CountTo", "SumTo", and "ElementAt" to count or sum the primes to a given upper limit or to output the zero-based nth prime, respectively. The "CountTo" method will produce the number of primes to one billion and in the 32-bit number range in about 0.32 and 1.29 seconds, respectively; the "ElementAt" method will produce the last element in these ranges in about 0.32 and 1.25 seconds, respectively, and the "SumTo" method produces the sum of all the primes in these ranges in about 0.49 and 1.98 seconds respectively. This program calculates the sum of all the prime numbers to four billion plus as here in less time than many naive implementations can sum all the prime numbers to two million as in Euler Problem 10, for over 2000 times the practical range!
This code is only about four times slower than very highly optimized C code used by primesieve, and the reasons it is slower are mostly due to DotNet, as follows (discussing the case of a 256 Kilobyte buffer, which is the size of the L2 cache):
loop, which is the last "for loop" in the private static "cull"
method" and only contains four statements per loop plus the range
check.
loop, including about 5 clock cycles for the two array bounds
checks per loop.
8.33 clock cycles for an advantage of about 2.67 times.
reduce the average time to perform the work per loop to about 4.17
clock cycles per composite cull loop, for a additional gain of two
times and a total gain of about 5.3 times.
well as the less efficient Just In Time (JIT) compiler produced code
and as well, the OpemMP multi-threading used by primesieve doesn't
appear to be as well adapted to this problem as use of the Thread
Pool threads here, so the final multi-threaded gain in only about
four times.
array bounds checks, and it was tried, but the JIT compiler does not
optimize pointers as well as normal array based code, so the gain of
not having array bound checks is cancelled by the less efficient
code (every pointer access (re)loads the pointer address from memory
rather than using a register already pointing to that address as in
the optimized array case).
size of the available L1 cache (16 Kilobytes when multi-threading
for the i3/i5/i7 CPU's) as its more efficient code has more of an
advantage reducing the average memory access time to one clock cycle
from about four clock cycles, which advantage makes much less of a
difference to the DotNet code, which gains more from less
processing per reduced number of pages. Thus, primesieve is about
five times faster when each use their most efficient buffer size.
This DotNet code will count (CountTo) the number of primes to ten to the power thirteen (ten trillion) in about an hour and a half (tested) and the number of primes to a hundred trillion (ten to the fourteenth) in just over a half day (estimated) as compared to twenty minutes and under four hours for primesieve, respectively. This is interesting historically as until 1985 only the count of primes in the range to ten to the thirteenth was known since it would take too much execution time on the (expensive) supercomputers of that day to find the range ten times larger; now we can easily compute the number of primes in those ranges on a common desktop computer (in this case, an Intel i7-2700K - 3.5 GHz)!
Using this code, it is easy to understand why Professor Atkin and Bernstein thought that the SoA was faster than the SoE - a myth that persists to this day, with the reasoning as follows:
toggles and square free composite number culls (which last can be
optimized using the same 2,3,5 wheel optimization as used inherently
by the SoA algorithm) to determine that the total number of both of
these operations is about 1.4 billion for the 32-bit number range.
implementation (neither of which are as optimized as this code),
which uses the same 2,3,5 wheel optimization as the SoA, will have a
total of about 1.82 billion cull operations with the same cull loop
computational complexity.
compared to his implementation of the SoE is about right, just
based on the number of equivalent operations. However, his
implementation of the SoE did not take wheel factorization "to the
max" since the SoA does not much respond to further degrees of wheel
factorization as the 2,3,5 wheel is "baked in" to the basic
algorithm.
composite cull operations to about 1.2 billion for the 32-bit number
range; therefore, this algorithm using this degree of wheel
factorization will run about 16.7% faster than an equivalent version
of SoA since the culling loop(s) can be implemented about the same for
each algorithm.
equivalent SoA as there needs only be one state table look up array
to cull across the base primes instead of an additional state look
up array's for each of the four quadratic equation solution cases
that produce valid state toggles.
will respond equivalently to C compiler optimizations for SoA as for
the SoE used in primesieve.
unrolling" optimization used by primesieve just as effectively as
for the SoE implementation.
algorithm using the techniques as for the above SoE code, the
resulting SoA would only be a slight amount slower when the output
primes were enumerated but would be about 16.7% slower when using
the static direct methods.
require the representation of the base primes and the same number of
segment page buffers.
EDIT_ADD: Interestingly, this code runs 30% faster in x86 32-bit mode than in x64 64-bit mode, likely due to avoiding the slight extra overhead of extending the uint32 numbers to ulong's. All of the above timings are for 64-bit mode. END_EDIT_ADD
In summary: The segment paged Sieve of Atkin is actually slower than a maximally optimized segment paged Sieve of Eratosthenes with no saving in memory requirements!!!
I say again: "Why use the Sieve of Atkin?".
这是对提出的问题的更明确的回答,如下:
当然,阿特金筛法 (SoA) 可以实现为分段算法,实际上根本不需要使用建议的分段数组,而只需使用原始序列即可完成 我已经使用 F# 完成了功能,尽管这比使用 F# 所允许的额外效率要慢很多用于剔除的可变数组。
上述算法至少可以通过两种不同的方式实现: 1) 当序列“运行”当前段并开始时,为“x”的每个值保存“x”和“y”的状态再次使用下一段的这些值,或者 2) 计算适用于新段的“x”和“y”对值。虽然第一种方法更简单,但我推荐第二种方法有两个原因:1)它不使用内存来存储必须保存的 x 和 y 的所有(许多)值,并且只必须保存基本素数的表示2)它为使用多线程并为每个页段分配独立的线程操作打开了大门,以便在多处理器计算机上节省大量时间。
事实上,有必要更好地理解“x”和“y”:
有一个答案解决了这个问题,但可能还不够明确。将这些二次方程视为潜在的无限序列可能更容易,其中“x”或“y”之一从其最低值开始固定,另一个变量生成每个序列的所有元素。例如,可以将二次方程“4*x^2 + y^2”的奇数表达式视为以 5, 17, 37, 65, ... 开头的序列序列,并且每个序列都有元素如 {5, 13, 29, 53, ...}, {17, 25, 41, 65, ...}, {37, 45, 61, 85, ...}, {65, 73, 89, 125, ...}, ... 显然,其中一些元素不是素数,因为它们是 3 或 5 的复合,这就是为什么这些元素需要通过模测试或伯恩斯坦中的方法来消除在实现中,可以通过识别生成器的模算术中的模式来自动跳过这些,因此它们实际上永远不会出现。
实现第一种更简单的方法来生成 SoA 的分段版本,只需要保存每个序列序列的状态,这基本上就是 F# 增量实现中所做的事情(尽管使用折叠树结构来提高效率),这可以很容易地适应在数组页面范围内工作。对于在每个段页面的开头计算序列状态的情况,只需计算每个“活动”序列的新段页面中最低元素所表示的数量为止,有多少元素可以放入空间中,其中“active”表示起始元素低于段页起始索引所表示的数字的序列。
至于如何实现基于数组的 SoA 筛分的伪代码,我已经为 相关帖子,其中展示了如何实现这一点。
正如另一个答案中所述,您可以通过在代码中将最大“限制”设置为常量来实现此最终目标,但这对于小范围的素数来说效率相当低,因为剔除将在比该数组大得多的数组上进行必要的。如前所述,除了提高效率和大幅减少内存使用之外,分段还具有允许有效使用多处理的其他优点。然而,使用 Take()、TakeWhile()、Where()、Count() 等方法不会为大范围的素数提供非常好的性能,因为它们的使用涉及在许多时钟周期对每个元素进行许多堆栈方法调用每次调用和返回。但是您可以选择使用这些或更命令式的程序流形式,因此这并不是真正的反对意见。
This is a more explicit answer to the question(s) raised, as follows:
Of course the Sieve of Atkin (SoA) can be implemented as a segmented algorithm and in fact does not require the use of segmenting arrays as suggested at all but rather can be done just using raw sequences as I have done functionally using F#, although this is quite a bit slower than the extra efficiency permitted by using mutable arrays for the culling.
The above algorithm can be implemented in at least two different ways: 1) save the state of 'x' and 'y' for each value of 'x' when the sequences "run off" the current segment and start up again with those values for the next segment, or 2) calculate the applicable pair values of 'x' and 'y' to use for the new segment. Although the first way is simpler, I recommend the second method for two reasons: 1) it doesn't use memory for all the (many) values of x and y that must be saved and only a representation of the base primes must be saved in memory for the "squares free" culling step, and 2) it opens the door to using multi-threading and assigning independent thread operations for each page segment for large savings in time on a multi-processor computer.
And indeed, a better understanding of 'x' and 'y' is necessary:
There has been one answer addressing this but perhaps it isn't explicit enough. It may be easier to think of these quadratic equations as a potentially infinite sequence of sequences where one of 'x' or 'y' is fixed starting from their lowest values and the other variable produces all of the elements per sequence. For instance, one could consider the odd expression of the quadratic equation "4*x^2 + y^2" as the sequence of sequences starting with 5, 17, 37, 65, ..., and with each of these sequences have elements as in {5, 13, 29, 53, ...}, {17, 25, 41, 65, ...}, {37, 45, 61, 85, ...}, {65, 73, 89, 125, ...}, ... Obviously, some of these elements are not prime as they are composites of 3 or 5 and that is why these need to be eliminated by either the modulo test, or alternatively as in the Bernstein implementation, these can be skipped automatically by recognizing patterns in the modulo arithmetic for the generators so they never actually even appear.
Implementing the first easier way of producing a segmented version of the SoA then just requires saving the state of each of the sequence of sequences, which is basically what is done in the F# incremental implementation (although using a folded tree structure for efficiency), which could easily be adapted to working over an array page range. For the case where the sequence state is computed at the beginning of every segment page, one just needs to compute how many elements would fit into the space up to the number represented by the lowest element in the new segment page for each "active" sequence, where "active" means a sequence whose starting element is lower than the number represented by the start index of the segment page.
As for pseudo-code on how to implement the segmentation of an array based SoA sieve, I have written something up for a related post, which shows how this can be accomplished.
As stated in another answer, you could accomplish this end goal by just setting the maximum "limit" as a constant in your code, but this would be quite inefficient for small ranges of primes as culling would be taking place over a much larger array than necessary. As already stated, other than for better efficiency and reducing memory use by a huge factor, segmentation also has other advantages in permitting the efficient use of multi-processing. However, use of the Take(), TakeWhile(), Where(), Count(), etcetera, methods won't provide very good performance for large ranges of primes as their use involves many stacked method calls per element at many clock cycles per call and return. But you would have the option of either using these or more imperative forms of program flow, so this isn't a real objection.
我可以尝试解释 x 和 y 的作用,但我认为如果不从头开始重新启动循环,您就无法执行您所要求的操作。这对于任何“筛子”算法来说几乎都是一样的。
筛子的作用基本上是计算有多少个不同的二次方程(偶数或奇数)将每个数字作为解。根据 n % 12 的不同,针对每个数字检查的具体方程也有所不同。
例如,模 12 余数为 1 或 5 的数字 n 是素数,当且仅当解数为 4*x^2+y ^2=n 是奇数且无平方数。第一个循环只是循环遍历可以满足这些不同方程的所有可能的 x 和 y 值。每次我们找到该 n 的解时,通过翻转 isPrime[n],我们可以跟踪解的数量是奇数还是偶数。
问题是我们同时对所有可能的 n 进行计数,这比一次检查一个要高效得多。仅执行一些 n 会花费更多时间(因为您必须确保第一个循环中的 n >= lower_limit)并使第二个循环变得复杂,因为该循环需要了解所有素数小于 sqrt。
第二个循环检查该数字是否无平方(没有因数是素数的平方)。
另外,我认为您对阿特金筛的实施不一定比直接的埃拉托色尼筛更快。然而,最快的、最优化的阿特金筛将击败最快的、最优化的埃拉托色尼筛。
I can try to explain what x and y does, but I don't think you can do what you ask without restarting the loops from the beginning. This is pretty much the same for any "sieve"-algorithm.
What the sieve does is basically count how many different quadratic equations (even or odd) have each number as a solution. The specific equation checked for each number is different depending on what n % 12 is.
For example, numbers n which have a mod 12 remainder of 1 or 5 are prime if and only if the number of solutions to 4*x^2+y^2=n is odd and the number is square-free. The first loop simply loops through all possible values of x and y that could satisfy these different equations. By flipping the isPrime[n] each time we find a solution for that n, we can keep track of whether the number of solutions is odd or even.
The thing is that we count this for all possible n at the same time, which makes this much more efficient than checking one at the time. Doing it for just some n would take more time (because you would have to make sure that n >= lower_limit in the first loop) and complicate the second loop, since that one requires knowledge of all primes smaller than sqrt.
The second loop checks that the number is square-free (has no factor which is a square of a prime number).
Also, I don't think your implementation of the sieve of Atkin is necessarily faster than a straight-forward sieve of Eratosthenes would be. However, the fastest possible most optimized sieve of Atkin would beat the fastest possible most optimized sieve of Eratosthenes.
是的,阿特金筛可以增量。请参阅论文 “两个紧凑增量素筛”,作者:Jonathan Sorenson。
Yes, the sieve of Atkin can be made incremental. See the paper "Two compact incremental prime sieves" by Jonathan Sorenson.