C#:阿特金筛法的实现
我想知道这里是否有人愿意分享阿特金筛的良好实现。
我正在尝试实现它,但无法完全理解它。这是我到目前为止所拥有的。
public class Atkin : IEnumerable<ulong>
{
private readonly List<ulong> primes;
private readonly ulong limit;
public Atkin(ulong limit)
{
this.limit = limit;
primes = new List<ulong>();
}
private void FindPrimes()
{
var isPrime = new bool[limit + 1];
var sqrt = Math.Sqrt(limit);
for (ulong x = 1; x <= sqrt; x++)
for (ulong y = 1; y <= sqrt; y++)
{
var n = 4*x*x + y*y;
if (n <= limit && (n % 12 == 1 || n % 12 == 5))
isPrime[n] ^= true;
n = 3*x*x + y*y;
if (n <= limit && n % 12 == 7)
isPrime[n] ^= true;
n = 3*x*x - y*y;
if (x > y && n <= limit && n % 12 == 11)
isPrime[n] ^= true;
}
for (ulong n = 5; n <= sqrt; n++)
if (isPrime[n])
for (ulong k = n*n; k <= limit; k *= k)
isPrime[k] = false;
primes.Add(2);
primes.Add(3);
for (ulong n = 5; n <= limit; n++)
if (isPrime[n])
primes.Add(n);
}
public IEnumerator<ulong> GetEnumerator()
{
if (!primes.Any())
FindPrimes();
foreach (var p in primes)
yield return p;
}
IEnumerator IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
}
我几乎只是尝试“翻译” Wikipedia 中列出的伪代码,但它不是无法正常工作。所以要么我误解了什么,要么只是做错了什么。或者最有可能两者...
有一个前 500 个素数的列表,我将其用作测试,但我的实现在第 40(或 41?)号处失败。
索引 [40] 处的值不同
预计:179
但是是:175
你能发现我的错误吗?你有一个可以分享的实现吗?或者两者都有?
我正在使用的确切测试如下所示:
public abstract class AtkinTests
{
[Test]
public void GetEnumerator_FirstFiveHundredNumbers_AreCorrect()
{
var sequence = new Atkin(2000000);
var actual = sequence.Take(500).ToArray();
var expected = First500;
CollectionAssert.AreEqual(expected, actual);
}
private static readonly ulong[] First500 = new ulong[]
{
2, 3, 5, 7, 11, 13, 17, ...
};
}
I was wondering if someone here have a good implementation of the Sieve of Atkin that they would like to share.
I am trying to implement it, but can't quite wrap my head around it. Here is what I have so far.
public class Atkin : IEnumerable<ulong>
{
private readonly List<ulong> primes;
private readonly ulong limit;
public Atkin(ulong limit)
{
this.limit = limit;
primes = new List<ulong>();
}
private void FindPrimes()
{
var isPrime = new bool[limit + 1];
var sqrt = Math.Sqrt(limit);
for (ulong x = 1; x <= sqrt; x++)
for (ulong y = 1; y <= sqrt; y++)
{
var n = 4*x*x + y*y;
if (n <= limit && (n % 12 == 1 || n % 12 == 5))
isPrime[n] ^= true;
n = 3*x*x + y*y;
if (n <= limit && n % 12 == 7)
isPrime[n] ^= true;
n = 3*x*x - y*y;
if (x > y && n <= limit && n % 12 == 11)
isPrime[n] ^= true;
}
for (ulong n = 5; n <= sqrt; n++)
if (isPrime[n])
for (ulong k = n*n; k <= limit; k *= k)
isPrime[k] = false;
primes.Add(2);
primes.Add(3);
for (ulong n = 5; n <= limit; n++)
if (isPrime[n])
primes.Add(n);
}
public IEnumerator<ulong> GetEnumerator()
{
if (!primes.Any())
FindPrimes();
foreach (var p in primes)
yield return p;
}
IEnumerator IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
}
I have pretty much just tried to "translate" the pseudocode listed at Wikipedia, but it isn't working correctly. So either I have misunderstood something or just done something wrong. Or most likely both...
Have a list of the first 500 primes which I use as a test and my implementation fails at number 40(or 41?).
Values differ at index [40]
Expected: 179
But was: 175
Are you able to find my mistake, do you have an implementation laying around that you could share, or both?
The exact test I am using looks like this:
public abstract class AtkinTests
{
[Test]
public void GetEnumerator_FirstFiveHundredNumbers_AreCorrect()
{
var sequence = new Atkin(2000000);
var actual = sequence.Take(500).ToArray();
var expected = First500;
CollectionAssert.AreEqual(expected, actual);
}
private static readonly ulong[] First500 = new ulong[]
{
2, 3, 5, 7, 11, 13, 17, ...
};
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
此代码:
似乎不是此伪代码的忠实翻译:
您的代码看起来将运行 n * n、n ^ 4、n ^ 8 等,即每次平方而不是每次添加 n 平方。试试这个:
This code:
doesn't seem to be a faithful translation of this pseudocode:
Your code looks like it will run for n * n, n ^ 4, n ^ 8, etc. i.e. squaring each time instead of adding n-squared each time. Try this:
Aaron Mugatroyd 的最后一个答案来自翻译后的阿特金筛法 (SoA) 的 Python 源代码,这还不错,但它可以在几个方面进行改进,因为它错过了一些重要的优化,如下所示:
他的答案没有使用完整的 modulo 60 原始 Atkin 和 Bernstein 版本的 Sieve,而是使用了伪代码的稍微改进的变体维基百科文章因此使用了数值筛选范围组合切换/剔除操作的约 0.36 倍;我下面的代码使用相当有效的非页面段伪代码,按照我在评论阿特金筛的答案中的评论它使用大约 0.26 倍数值范围的因子将完成的工作量减少到大约七分之二。
他的代码通过仅具有奇数表示来减少缓冲区大小,而我的代码进一步进行位包以消除任何可被三和五整除的数字以及“仅奇数”所暗示的可被二整除的数字的表示;这将内存需求减少了近一半(至 8/15),并有助于更好地利用 CPU 缓存,从而由于减少了平均内存访问时间而进一步提高速度。
我的代码使用快速查找表 (LUT) 弹出计数技术来计算素数的数量,与使用他使用的逐位技术大约需要一秒的时间相比,几乎不需要花费时间;但是,在此示例代码中,甚至从代码的定时部分中取出了一小部分时间。
最后,我的代码优化了位操作操作,以减少每个内部循环的代码量。例如,它不使用连续右移一位来产生奇数表示索引,并且实际上通过将所有内部循环写入常数模(等于位位置)运算来产生很少的位移位。同样,Aaron 的翻译代码在操作上效率相当低,例如在素数平方自由剔除中,它将素数的平方添加到索引中,然后检查奇数结果,而不是仅仅添加平方的两倍,以便不需要检查;然后,在内部循环中执行剔除操作之前,它会通过将数字右移一(除以二)来使检查变得多余,就像对所有循环所做的那样。这种低效的代码不会对使用这种“大筛缓冲区阵列”技术的大范围执行时间产生太大影响,因为每个操作的大部分时间都用于 RAM 内存访问(对于一个操作来说,大约 37 个 CPU 时钟周期或更多)十亿的范围),但会使执行时间比适合 CPU 缓存的较小范围所需的时间慢得多;换句话说,它对每个操作的执行速度设置了过高的最低限制。
代码如下:
该代码的运行速度大约是 Aaron 代码的两倍(在 i7-2700K(3.5 GHz)上使用 64 位或 32 位模式大约 2.7 秒,缓冲区大约 16.5 MB,组合切换大约 2.58 亿个) /prime square 自由剔除操作(可以通过取消注释“++this.opcnt”语句来显示),筛选范围为 10 亿,而他的代码需要 5.4/6.2 秒(32 位/64 位)没有计数时间,并且使用大约 3.59 亿次组合切换/剔除操作来筛选多达 10 亿次内存使用量,
尽管它比他最优化的纯赔率实现的非分页埃拉托斯特尼筛法 (SoE) 更快。 ,这不会使阿特金筛法比埃拉托斯特尼筛法更快,就好像将上述 SoA 实现中使用的类似技术应用于 SoE 加上使用最大轮分解,SoE 将大约 。
分析:虽然完全优化的 SoE 的操作数量与 10 亿筛范围的 SoA 的操作数量大致相同,但这些非优化的主要瓶颈 - 一旦筛缓冲区大小超过 CPU 高速缓存大小(一个时钟周期访问时为 32 KB L1 高速缓存,约 4 个时钟周期访问时间为 256 KB L2 高速缓存,约 20 个时钟周期访问时间为 8 MB L3 高速缓存),分页实现就是内存访问对于我的 i7),此后内存访问可能会超过一百个时钟周期。
现在,当将算法应用于页面分段时,两者的内存访问速度都会提高约八倍,这样就可以筛选出不适合可用内存的范围。然而,由于剔除扫描的巨大进步,快速增长到数百倍,导致算法的“无素数平方”部分实现困难,筛分范围开始变得非常大,因此 SoE 继续超越 SoA页面缓冲区的大小。同样,也许更严重的是,计算“x”的每个值的新起点以及每个页面缓冲区的最低表示处的“y”的值需要非常大的内存和/或计算量,以便进一步计算与 SoE 相比,随着范围的增长,分页 SoA 的效率损失很大。
EDIT_ADD: Aaron Murgatroyd 使用的仅赔率 SoE 在 10 亿个筛选范围内使用了约 10.26 亿次剔除操作,因此操作数约为 SoA 的四倍,因此运行速度应该慢四倍左右,但是这里实现的 SoA 具有更复杂的内部循环,特别是由于纯赔率 SoE 剔除的比例要高得多,因此在剔除扫描中的步幅比 SoA 的步幅短得多,纯赔率 SoE 的步幅比 SoA 的步幅短得多。尽管筛缓冲区大大超过了 CPU 缓存大小(更好地利用缓存关联性),但平均内存访问时间要好得多。这解释了为什么上述 SoA 的速度仅为仅赔率 SoE 的两倍,尽管理论上它似乎只完成了四分之一的工作。
如果使用与上述 SoA 相同的恒定模内循环算法并实现相同的 2/3/5 轮分解,则 SoE 会将剔除操作数减少到约 4.05 亿次操作,因此仅多出约 50%操作比 SoA 更慢,理论上运行速度会比 SoA 稍慢,但由于这种“天真的”大内存缓冲区使用的平均剔除步幅仍然比 SoA 稍小,因此可能以大约相同的速度运行。将轮因子分解增加到 2/3/5/7 轮意味着对于 10 亿的剔除范围,SoE 剔除操作减少到大约 0.314,并且可能使该版本的 SoE 运行于此算法的速度大约相同。
可以通过预先剔除 2/3/5/7/11/13/17/19 质因数的筛阵列(以模式复制)来进一步使用轮分解,几乎无需花费执行时间来减少对于 10 亿的筛选范围,剔除操作的总数约为 2.51 亿,并且 SoE 的运行速度甚至比 SoA 的优化版本更快或大约相同,即使对于这些大内存缓冲区版本,SoE 仍然有很多比上面的代码复杂度更低。
因此,可以看出,SoE 的操作数量可以从朴素或仅偶数或 2/3/5 轮因式分解版本大大减少,使得操作数量与 SoA 大约相同,而同时,由于内部循环不太复杂且内存访问效率更高,每个操作的时间实际上可能会更少。 END_EDIT_ADD
EDIT_ADD2:我在这里添加了 SoE 的代码,根据伪代码 进一步查看上面链接的答案。尽管应用了高轮分解和预剔除,使得剔除操作的总数实际上少于 SoA 在筛选范围内的组合切换/剔除操作,但该代码比上述 SoA 简单得多约二十亿。代码如下:
EDIT_FINAL更正了下面的代码以及与其相关的注释END_EDIT_FINAL
此代码实际上比上面的 SoA 运行速度快了几个百分点,因为操作稍微少了一点这种规模达 10 亿的大型数组的主要瓶颈是内存访问时间,约为 40 到 100 多个 CPU 时钟周期,具体取决于 CPU 和内存规格;这意味着代码优化(除了减少操作总数之外)是无效的,因为大部分时间都花在等待内存访问上。无论如何,使用巨大的内存缓冲区并不是筛选大范围的最有效方法,使用具有相同最大轮因子分解的页面分段的 SoE 的提升高达约八倍(这也为多处理)。
正是在实现页面分段和多处理方面,与 SoE 相比,SoA 在远远超过 40 亿的范围内确实存在缺陷,因为由于 SoA 渐近复杂性降低而带来的任何收益都会很快被与以下相关的页面处理开销因素所吞噬:质数平方自由处理并计算更多数量的页面起始地址;另一种方法是通过将标记存储在 RAM 存储器中来克服这一问题,但这种方法会消耗大量内存,并且访问这些标记存储结构的效率也会进一步降低。 END_EDIT_ADD2
简而言之,与全轮因式分解 SoE 相比,SoA 并不是一个真正实用的筛子,因为随着渐近复杂性的增加开始使其性能接近完全优化的 SoE,由于实际实现的相对内存访问时间和页面分段复杂性的细节以及通常更加复杂和难以编写,它开始失去效率。在我看来,与 SoE 相比,它更像是一个有趣的智力概念和心理锻炼,而不是一个实用的筛子。
有一天,我会将这些技术应用到多线程页面分段埃拉托斯特尼筛法中,使其在 C# 中的速度与 Atkin 和 Bernstein 在“C”中实现 SoA 的“primegen”一样快,并将在大范围内将其从水中剔除。即使是单线程,速度也超过约 40 亿,在我的 i7 上进行多线程(包括超线程在内的八个核心)时,速度额外提升至约四倍。
The last answer by Aaron Mugatroyd as from the translated Python source for a Sieve of Atkin (SoA) isn't too bad, but it can be improved in several respects as it misses some important optimizations, as follows:
His answer doesn't use the full modulo 60 original Atkin and Bernstein version of the Sieve but rather a slightly improved variation of the pseudo code from the Wikipedia article so uses about a factor of 0.36 of the numerical sieve range combined toggle/cull operations; my code below uses the reasonably efficient non-page segment pseudo code as per my comments in an answer commenting on the Sieve of Atkin which uses a factor of about 0.26 times the numerical range to reduce the amount of work done to about a factor of about two sevenths.
His code reduces the buffer size by only having odd number representations, whereas my code further bit packs to eliminate any representation of the numbers divisible by three and five as well as those divisible by two implied by "odds-only"; this reduces the memory requirement by a further factor of almost half (to 8/15) and helps make better use of the CPU caches for a further increase in speed due to reduced average memory access time.
My code counts the number of primes using a fast Look Up Table (LUT) pop count technique to take almost no time to count as compared to the approximately one second using the bit-by-bit technique he uses; however, in this sample code even that small time is taken out of the timed portion of the code.
Finally, my code optimizes the bit manipulation operations for a minimum of code per inner loop. For instance, it does not use continual right shift by one to produce the odd representation index and in fact little bit shifting at all by writing all of the inner loops as constant modulo (equals bit position) operations. As well, Aaron's translated code is quite inefficient in operations as for instance in prime square free culling it adds the square of the prime to the index then checks for an odd result rather than just adding two times the square so as not to require the check; then it makes even the check redundant by shifting the number right by one (dividing by two) before doing the cull operation in the inner loop, just as it does for all the loops. This inefficient code won't make much of a difference in execution time for large ranges using this "large sieve buffer array" technique, as most of the time per operation is used in RAM memory access (about 37 CPU clock cycles or more for a range of one billion), but will make the execution time much slower than it needs to be for smaller ranges which fit into the CPU caches; in other words it sets a too high lowest limit in execution speed per operation.
The code is as follows:
This code runs about twice as fast as Aaron's code (about 2.7 seconds using 64-bit or 32-bit mode on an i7-2700K (3.5 GHz) with the buffer about 16.5 Megabytes and about 0.258 billion combined toggle/prime square free cull operations (which can be shown by uncommenting the "++this.opcnt" statements) for a sieve range of one billion, as compared to 5.4/6.2 seconds (32-bit/64-bit) for his code without the count time and almost twice the memory use using about 0.359 billion combined toggle/cull operations for sieving up to one billion.
Although it is faster than his most optimized naive odds-only implementation of the non-paged Sieve of Eratosthenes (SoE), that does not make the Sieve of Atkin faster than the Sieve of Eratosthenes, as if one applies similar techniques as used in the above SoA implementation to the SoE plus uses maximal wheel factorization, the SoE will about the same speed as this.
Analysis: Although the number of operations for the fully optimized SoE are about the same as the number of operations for the SoA for a sieve range of one billion, the main bottleneck for these non-paged implementations is memory access once the sieve buffer size exceeds the CPU cache sizes (32 KiloBytes L1 cache at one clock cycle access, 256 Kilobytes L2 cache at about four clock cycles access time and 8 Megabytes L3 cache at about 20 clock cycles access time for my i7), after which memory access can exceed a hundred clock cycles.
Now both have a factor of about eight improvement in memory access speeds when one adapts the algorithms to page segmentation so one can sieve ranges that would not otherwise fit into available memory. However, the SoE continues to gain over the SoA as the sieve range starts to get very large due to difficulties in implementing the "primes square free" part of the algorithm due to the huge strides in culling scans that quickly grow to many hundreds of times the size of the page buffers. As well, and perhaps more serious, it gets very memory and/or computationally intensive to compute the new start point for each value of 'x' as to the value of 'y' at the lowest representation of each page buffer for a further quite large loss in efficiency of the paged SoA comparaed to the SoE as the range grows.
EDIT_ADD: The odds-only SoE as used by Aaron Murgatroyd uses about 1.026 billion cull operations for a sieve range of one billion so about four times as many operations as the SoA and thus should run about four times slower, but the SoA even as implemented here has a more complex inner loop and especially due to a much higher proportion of the odds-only SoE culls have a much shorter stride in the culling scans than the strides of the SoA the naive odds-only SoE has much better average memory access times in spite of the sieve buffer greatly exceeding the CPU cache sizes (better use of cache associativity). This explains why the above SoA is only about twice as fast as the odds-only SoE even though it would theoretically seem to be doing only one quarter of the work.
If one were to use a similar algorithm using constant modulo inner loops as for the above SoA and implemented the same 2/3/5 wheel factorization, the SoE would reduce the number of cull operations to about 0.405 billion operations so only about 50% more operations than the SoA and would theoretically run just slightly slower than the SoA, but may run at about the same speed due to the cull strides still being a little smaller than for the SoA on the average for this "naive" large memory buffer use. Increasing the wheel factorization to the 2/3/5/7 wheel means the SoE cull operations are reduced to about 0.314 for a cull range of one billion and may make that version of the SoE run about the same speed for this algorithm.
Further use of wheel factorization can be made by pre-culling the sieve array (copying in a pattern) for the 2/3/5/7/11/13/17/19 prime factors at almost no cost in execution time to reduce the total number of cull operations to about 0.251 billion for a sieve range of one billion and the SoE will run faster or about the same speed than even this optimized version of the SoA, even for these large memory buffer versions, with the SoE still having much less code complexity than the above.
Thus, it can be seen that the number of operations for the SoE can be greatly reduced from a naive or even odds-only or 2/3/5 wheel factorization version such that the number of operations are about the same as for the SoA while at the same time the time per operation may actually be less due to both less complex inner loops and more efficient memory access. END_EDIT_ADD
EDIT_ADD2: I here add the code for a SoE using a similiar constant modulo/bit position technique for the innermost loops as for the SoA above according to the pseudo code further down the answer as linked above. The code is quite a bit less complex than the above SoA in spite of having high wheel factorization and pre-culling applied such that the total number of cull operations are actually less than the combined toggle/cull operations for the SoA up to a sieving rang of about two billion. The code as follows:
EDIT_FINAL Corrected code below and comments related to it END_EDIT_FINAL
This code actually runs a few percent faster than the above SoA as it should as there are slightly less operations and the main bottleneck for this large array size for a range of a billion is memory access time of something like 40 to over 100 CPU clock cycles depending on CPU and memory specifications; this means that code optimizations (other than reducing the total number of operations) are ineffective as most of the time is spend waiting on memory access. At any rate, using a huge memory buffer isn't the most efficient way to sieve large ranges, with a factor of up to about eight times improvement for the SoE using page segmentation with the same maximum wheel factorization (which also paves the way for multi-processing).
It is in implementing page segmentation and multi-processing that the SoA is really deficient for ranges much above four billion as compared to the SoE as any gains due to the reduced asymptotic complexity of the SoA rapidly get eaten up by page processing overhead factors related to the prime square free processing and calculating the much larger number of page start addresses; alternatively, one overcomes this by storing markers in RAM memory at a huge cost in memory consumption and further inefficiencies in accessing these marker store structures. END_EDIT_ADD2
In short, the SoA isn't really a practical sieve as compared to the the fully wheel factorized SoE since just as the gain in asymptotic complexity starts to bring it close in performance to the fully optimized SoE, it starts to lose efficiency due to the details of practical implementation as to relative memory access time and page segmentation complexities as well as generally being more complex and difficult to write. In my opinion it is more of an interesting intellectual concept and mental exercise than a practical sieve as compared to the SoE.
Some day I will adapt these techniques to a multi-threaded page segmented Sieve of Eratosthenes to be about as fast in C# as Atkin and Bernstein's "primegen" implementation of the SoA in 'C' and will blow it out of the water for large ranges above about four billion even single threaded, with an extra boost in speed of up to about four when multi-threading on my i7 (eight cores including Hyper Threading).
这是另一个实现。它使用 BitArray 来节省内存。 Parallel.For 需要 .NET Framework 4。
Here's another implementation. It uses BitArray to save memory. The Parallel.For needs .NET Framework 4.
这是阿特金筛法的更快实现,我从这里的Python脚本中窃取了算法(我不对该算法负责):
http://programmingpraxis.com/2010/02/19/sieve-of-atkin-improved/
它的速度接近我最优化的版本的埃拉托色尼筛,但仍然慢了大约 20%,可以在这里找到:
https://stackoverflow.com/a /9700790/738380
Here is a faster implementation of the Sieve of Atkin, I stole the algorithm from this Python script here (I take no credit for the algorithm):
http://programmingpraxis.com/2010/02/19/sieve-of-atkin-improved/
Its close in speed to my most optimised version of the Sieve of Eratosthenes, but its still slower by about 20%, it can be found here:
https://stackoverflow.com/a/9700790/738380
这是我的,它使用一个名为 CompartmentalizedParallel 的类,它允许您执行并行 for 循环,但控制线程数,以便将索引分组。然而,由于线程问题,您需要在每次更改时锁定 BitArray,或者为每个线程创建一个单独的 BitArray,然后在最后将它们异或在一起,第一个选项非常慢,因为锁的数量,第二个选项选择对我来说似乎更快!
CompartmentalizedParallel 类:
Primes 基类:
通过执行以下操作来使用它:
但是,我发现 Eratosthenes 在所有情况下都更快,即使在 Atkin 的多线程模式下运行四核 CPU:
如果您让 Atkin 运行快一点,请告诉我!
Heres mine, it uses a class called CompartmentalisedParallel which allows you to perform parallel for loops but control the number of threads so that the indexes are grouped up. However, due to the threading issues you need to either lock the BitArray each time it is altered or create a separate BitArray for each thread and then XOR them together at the end, the first option was pretty slow because the amount of locks, the second option seemed faster for me!
And the CompartmentalisedParallel class:
Primes base class:
Use it by doing the following:
However, i found the Eratosthenes to be quicker in all cases, even with a four core CPU running in multithreaded mode for the Atkin:
If you get the Atkin to run faster, please let me know!
这是对埃拉托色尼筛法的改进,使用自定义 FixBitArrays 和不安全代码来获得速度结果,这比我之前的埃拉托色尼算法快了约 225%,并且该类是独立的(这不是多线程的 - 埃拉托色尼与多线程不兼容),在 AMD Phenom II X4 965 处理器上,我可以在 9,261 毫秒内计算出 1,000,000,000 的素数限制:
在 1,000,000,000 中找到的素数:在 9,261 毫秒内找到 50,847,534
Heres an improvement of the Sieve of Eratosthenes using custom FixBitArrays and unsafe code for speed results, this is about 225% faster than my previous Eratosthenes algorithm, and the class is standalone (this is not multithreaded - Eratosthenes is not compatible with multi threading), On an AMD Phenom II X4 965 Processor I can calculate Primes to 1,000,000,000 limit in 9,261 ms:
Primes found in 1,000,000,000: 50,847,534 in 9,261 ms