C#:如何使阿特金筛增量

发布于 2024-08-08 00:13:25 字数 2213 浏览 7 评论 0 原文

我不知道这是否可能,但我只是想问一下。我的数学和算法技能在这里有点让我失败:P

问题是我现在有这个类可以生成达到一定限制的素数:

public class Atkin : IEnumerable<ulong>
{
    private readonly List<ulong> primes;
    private readonly ulong limit;

    public Atkin(ulong limit)
    {
        this.limit = limit;
        primes = new List<ulong>();
    }

    private void FindPrimes()
    {
        var isPrime = new bool[limit + 1];
        var sqrt = Math.Sqrt(limit);

        for (ulong x = 1; x <= sqrt; x++)
            for (ulong y = 1; y <= sqrt; y++)
            {
                var n = 4*x*x + y*y;
                if (n <= limit && (n % 12 == 1 || n % 12 == 5))
                    isPrime[n] ^= true;

                n = 3*x*x + y*y;
                if (n <= limit && n % 12 == 7)
                    isPrime[n] ^= true;

                n = 3*x*x - y*y;
                if (x > y && n <= limit && n % 12 == 11)
                    isPrime[n] ^= true;
            }

        for (ulong n = 5; n <= sqrt; n++)
            if (isPrime[n])
            {
                var s = n * n;
                for (ulong k = s; k <= limit; k += s)
                    isPrime[k] = false;
            }

        primes.Add(2);
        primes.Add(3);
        for (ulong n = 5; n <= limit; n++)
            if (isPrime[n])
                primes.Add(n);
    }


    public IEnumerator<ulong> GetEnumerator()
    {
        if (!primes.Any())
            FindPrimes();

        foreach (var p in primes)
            yield return p;
    }


    IEnumerator IEnumerable.GetEnumerator()
    {
        return GetEnumerator();
    }
}

现在,我想要的是摆脱限制,以便序列永远不会停止(理论上)。

我想事情可能是这样的:

  1. 从一些琐碎的限制开始
    • 查找达到极限的所有素数
    • 产生所有新发现的素数
    • 增加限制(通过将旧限制加倍或平方或类似的方式)
    • 转到第 2 步

最佳情况下,第二步应该只在旧极限之间起作用和新的。换句话说,它不应该一次又一次地找到最低的素数。

有办法做到这一点吗?我的主要问题是我不太明白该算法中的 x 和 y 是什么。就像,我可以使用相同的算法类型,但将 xy 设置为 oldLimit (最初为 1)并运行到 新限制?或者说这会如何运作?有聪明的头脑对此有一些启发吗?

这样做的目的是让我不必设置该限制。这样我就可以使用 Linq 并仅使用 Take() 无论我需要多少素数,而不用担心限制是否足够高等等。

I don't know if this is possible or not, but I just gotta ask. My mathematical and algorithmic skills are kind of failing me here :P

The thing is I now have this class that generates prime numbers up to a certain limit:

public class Atkin : IEnumerable<ulong>
{
    private readonly List<ulong> primes;
    private readonly ulong limit;

    public Atkin(ulong limit)
    {
        this.limit = limit;
        primes = new List<ulong>();
    }

    private void FindPrimes()
    {
        var isPrime = new bool[limit + 1];
        var sqrt = Math.Sqrt(limit);

        for (ulong x = 1; x <= sqrt; x++)
            for (ulong y = 1; y <= sqrt; y++)
            {
                var n = 4*x*x + y*y;
                if (n <= limit && (n % 12 == 1 || n % 12 == 5))
                    isPrime[n] ^= true;

                n = 3*x*x + y*y;
                if (n <= limit && n % 12 == 7)
                    isPrime[n] ^= true;

                n = 3*x*x - y*y;
                if (x > y && n <= limit && n % 12 == 11)
                    isPrime[n] ^= true;
            }

        for (ulong n = 5; n <= sqrt; n++)
            if (isPrime[n])
            {
                var s = n * n;
                for (ulong k = s; k <= limit; k += s)
                    isPrime[k] = false;
            }

        primes.Add(2);
        primes.Add(3);
        for (ulong n = 5; n <= limit; n++)
            if (isPrime[n])
                primes.Add(n);
    }


    public IEnumerator<ulong> GetEnumerator()
    {
        if (!primes.Any())
            FindPrimes();

        foreach (var p in primes)
            yield return p;
    }


    IEnumerator IEnumerable.GetEnumerator()
    {
        return GetEnumerator();
    }
}

Now, what I would like is to get rid of the limit so that the sequence would never stop (theoretically).

I am thinking it could go something like this:

  1. Start with some trivial limit
    • Find all the primes up to the limit
    • Yield all the newfound primes
    • Increase the limit (by doubling or squaring the old limit or something like that)
    • Goto step 2

And optimally that step two should only have to work between the old limit and the new one. In other words it shouldn't have to find the lowest primes again and again.

Is there a way this can be done? My main problem is that I don't quite understand what x and y for example is in this algorithm. Like, could I just use the same algorithm kind of but set x and y to oldLimit (initially 1) and run it up to newLimit? Or how would that work? Any bright minds with some light to shed on this?

The point of this is so that I won't have to set that limit. So that I can for example use Linq and just Take() however many primes I need, not worrying about if the limit is high enough, et cetera.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

栩栩如生 2024-08-15 00:13:25

下面是 C# 中无界素数筛选的解决方案,可以使用埃拉托斯特尼筛法 (SoE) 或阿特金筛法 (SoA) 算法实现;然而,我认为,与真正的 SoE 提供大约相同的性能而没有那么多复杂性相比,优化 SoA 解决方案的极端复杂性几乎不值得。因此,这可能只是部分答案,因为虽然它展示了如何实现更好的 SoA 算法并展示了如何使用 SoE 实现无界序列,但它仅暗示如何组合这些来编写相当高效的 SoA。

请注意,如果只需要讨论这些想法的最快实现,请跳到此答案的底部。

首先,我们应该评论此练习的要点,即生成无界素数序列以允许使用 IEnumerable 扩展方法,例如 Take()、TakeWhile()、Where()、Count() 等,因为这些方法由于增加了方法调用级别而牺牲了一些性能,每次调用至少增加 28 个机器周期并返回枚举一个值并为每个函数添加几级方法调用;也就是说,即使对枚举结果使用更多命令式过滤技术以获得更快的速度,拥有(有效)无限序列仍然有用。

请注意,在下面的更简单的示例中,我将范围限制为无符号 32 位数字 (uint),超过该范围基本 SoE 或 SoA 实现在效率方面并不真正合适,需要进行切换到“桶”或其他形式的增量筛以进行部分筛分以提高效率; 但是,对于完全优化的页分段筛选(如最快的实现),范围会增加到 64 位范围,尽管如所写,人们可能不希望筛选超过大约一百万亿(十的十四次方)因为整个范围的运行时间将需要长达数百年的时间。

由于问题选择 SoA 的原因可能是错误的,首先将 Trial Division (TD) 素数筛误认为是真正的 SoE,然后使用天真的 SoE在 TD 筛上的 SoA,让我们建立真正的有界 SoE,它可以作为一种方法在几行中实现(可以根据问题的实现编码风格转换为类),如下所示:

static IEnumerable<uint> primesSoE(uint top_number) {
  if (top_number < 2u) yield break;
  yield return 2u; if (top_number < 3u) yield break;
  var BFLMT = (top_number - 3u) / 2u;
  var SQRTLMT = ((uint)(Math.Sqrt((double)top_number)) - 3u) / 2u;
  var buf = new BitArray((int)BFLMT + 1,true);
  for (var i = 0u; i <= BFLMT; ++i) if (buf[(int)i]) {
      var p = 3u + i + i; if (i <= SQRTLMT) {
        for (var j = (p * p - 3u) / 2u; j <= BFLMT; j += p)
          buf[(int)j] = false; } yield return p; } }

这将计算 200 万的素数在 i7-2700K (3.5 GHz) 上大约需要 16 毫秒,在同一台机器上大约 67 秒内将 203,280,221 个素数计算到 32 位数字范围内的 4,294,967,291(假设有 256 MB 的备用 RAM)。

现在,使用上面的版本与 SoA 进行比较并不公平,因为真正的 SoA 会自动忽略 2、3 和 5 的倍数,因此引入轮分解对 SoE 执行相同操作会产生以下代码:

static IEnumerable<uint> primesSoE(uint top_number) {
  if (top_number < 2u) yield break;
  yield return 2u; if (top_number < 3u) yield break;
  yield return 3u; if (top_number < 5u) yield break;
  yield return 5u; if (top_number < 7u) yield break;
  var BFLMT = (top_number - 7u) / 2u;
  var SQRTLMT = ((uint)(Math.Sqrt((double)top_number)) - 7u) / 2u;
  var buf = new BitArray((int)BFLMT + 1,true);
  byte[] WHLPTRN = { 2, 1, 2, 1, 2, 3, 1, 3 };
  for (uint i = 0u, w = 0u; i <= BFLMT; i += WHLPTRN[w], w = (w < 7u) ? ++w : 0u)
    if (buf[(int)i]) { var p = 7u + i + i; if (i <= SQRTLMT) {
        var pX2 = p + p; uint[] pa = { p, pX2, pX2 + p };
        for (uint j = (p * p - 7u) / 2u, m = w; j <= BFLMT;
                               j += pa[WHLPTRN[m] - 1u], m = (m < 7u) ? ++m : 0u)
          buf[(int)j] = false; } yield return p; } }

上面的代码计算在与上述相同的机器上,在大约 10 毫秒内将质数变为 200 万,在大约 40 秒内将质数变为 32 位数字范围。

接下来,让我们确定我们可能在此处编写的 SoA 版本与上述代码中的 SoE 相比,在执行速度方面是否确实具有任何优势。下面的代码遵循上述 SoE 模型,并优化了 维基百科文章 至于按照该文章中的建议,对各个二次解使用单独的循环来调整“x”和“y”变量的范围,仅针对奇数解求解二次方程(以及平方自由消除),结合“3 *x^2" 二次方程可在同一遍中求解正第二项和负第二项,并消除计算量大的模运算,生成的代码比此处发布的伪代码的简单版本快三倍以上,如此处问题中所使用的。有界 SoA 的代码如下:

static IEnumerable<uint> primesSoA(uint top_number) {
  if (top_number < 2u) yield break;
  yield return 2u; if (top_number < 3u) yield break;
  yield return 3u; if (top_number < 5u) yield break;
  var BFLMT = (top_number - 3u) / 2u; var buf = new BitArray((int)BFLMT + 1, false);
  var SQRT = (uint)(Math.Sqrt((double)top_number)); var SQRTLMT = (SQRT - 3u) / 2u;
  for (uint x = 1u, s = 1u, mdx12 = 5u, dmdx12 = 0u; s <= BFLMT; ++x, s += ((x << 1) - 1u) << 1) {
    for (uint y = 1u, n = s, md12 = mdx12, dmd12 = 8u; n <= BFLMT; y += 2, n += (y - 1u) << 1) {
      if ((md12 == 1) || (md12 == 5)) buf[(int)n] = buf[(int)n] ^ true;
      md12 += dmd12; if (md12 >= 12) md12 -= 12; dmd12 += 8u; if (dmd12 >= 12u) dmd12 -= 12u; }
    mdx12 += dmdx12; if (mdx12 >= 12u) mdx12 -= 12u; dmdx12 += 8u; if (dmdx12 >= 12u) dmdx12 -= 12u; }
  for (uint x = 1u, s = 0u, mdx12 = 3u, dmdx12 = 8u; s <= BFLMT; ++x, s += x << 1) {
    int y = 1 - (int)x, n = (int)s, md12 = (int)mdx12, dmd12 = ((-y - 1) << 2) % 12;
    for (; (y < 0) && (uint)n <= BFLMT; y += 2, n += (-y + 1) << 1) {
      if (md12 == 11) buf[(int)n] = buf[(int)n] ^ true;
      md12 += dmd12; if (md12 >= 12) md12 -= 12; dmd12 += 4; if (dmd12 >= 12) dmd12 -= 12; }
    if (y < 1) { y = 2; n += 2; md12 += 4; dmd12 = 0; } else { n += 1; md12 += 2; dmd12 = 8; }
    if (md12 >= 12) md12 -= 12; for (; (uint)n <= BFLMT; y += 2, n += (y - 1) << 1) {
      if (md12 == 7) buf[(int)n] = buf[(int)n] ^ true;
      md12 += dmd12; if (md12 >= 12) md12 -= 12; dmd12 += 8; if (dmd12 >= 12) dmd12 -= 12; }
    mdx12 += dmdx12; if (mdx12 >= 12) mdx12 -= 12; dmdx12 += 4; if (dmdx12 >= 12) dmdx12 -= 12; }
  for (var i = 0u; i<=BFLMT; ++i) if (buf[(int)i]) { var p = 3u+i+i; if (i<=SQRTLMT) { var sqr = p*p;
        for (var j = (sqr - 3ul) / 2ul; j <= BFLMT; j += sqr) buf[(int)j] = false; } yield return p; } }

由于操作数量未完全优化,这仍然比发布的轮分解 SoE 算法慢两倍以上。可以对 SoA 代码进行进一步优化,如对原始(非伪代码)算法使用模 60 运算,并使用位打包仅处理除 3 和 5 之外的倍数,以使代码的性能与 SoE 相当接近甚至稍微超过它,但我们越来越接近 Berstein 实现实现此性能的复杂性。我的观点是“当人们非常努力地工作以获得与 SoE 相同的性能时,为什么选择 SoA?”。

现在,对于无界素数序列,最简单的方法是定义 Uint32.MaxValue 的 const top_number 并消除 primesSoE 或 primesSoA 方法中的参数。这有点低效,因为无论正在处理的实际素数值如何,剔除仍然在整个数字范围内完成,这使得确定小范围素数花费的时间比应有的时间长得多。除了这种情况和极端的内存使用之外,还有其他原因需要使用素数筛的分段版本:首先,使用主要位于 CPU L1 或 L2 数据缓存中的数据的算法由于内存访问效率更高而处理速度更快,并且其次,因为分段允许将作业轻松分割为多个页面,这些页面可以使用多核处理器在后台进行剔除,从而实现与所使用的核心数量成正比的加速。

为了提高效率,我们应该选择 CPU L1 或 L2 缓存大小的数组大小,对于大多数现代 CPU 而言,该数组大小至少为 16 KB,以避免在我们从数组中剔除素数组合时发生缓存抖动,这意味着 BitArray 可以具有大小是其大小的八倍(每字节 8 位)或 128 KB。由于我们只需要筛选奇数素数,这代表了超过 25 万的数字范围,这意味着分段版本将仅使用 8 个分段来筛选 200 万个,如欧拉问题 10 的要求。人们可以将结果保存在最后一段并从该点继续,但这将妨碍将此代码适应多核情况,在多核情况下,将剔除到多线程的后台以充分利用多核处理器。 (单线程)无界 SoE 的 C# 代码如下:

static IEnumerable<uint> primesSoE() { yield return 2u; yield return 3u; yield return 5u;
  const uint L1CACHEPOW = 14u + 3u, L1CACHESZ = (1u << (int)L1CACHEPOW); //for 16K in bits...
  const uint BUFSZ = L1CACHESZ / 15u * 15u; //an even number of wheel rotations
  var buf = new BitArray((int)BUFSZ);
  const uint MAXNDX = (uint.MaxValue - 7u) / 2u; //need maximum for number range
  var SQRTNDX = ((uint)Math.Sqrt(uint.MaxValue) - 7u) / 2u;
  byte[] WHLPTRN = { 2, 1, 2, 1, 2, 3, 1, 3 }; //the 2,3,5 factorial wheel, (sum) 15 elements long
  byte[] WHLPOS = { 0, 2, 3, 5, 6, 8, 11, 12 }; //get wheel position from index
  byte[] WHLNDX = { 0, 0, 1, 2, 2, 3, 4, 4, 5, 5, 5, 6, 7, 7, 7, //get index from position
                    0, 0, 1, 2, 2, 3, 4, 4, 5, 5, 5, 6, 7 }; //allow for overflow
  byte[] WHLRNDUP = { 0, 2, 2, 3, 5, 5, 6, 8, 8, 11, 11, 11, 12, 15, //allow for overflow...
                      15, 15, 17, 17, 18, 20, 20, 21, 23, 23, 26, 26, 26, 27 };
  uint BPLMT = (ushort.MaxValue - 7u) / 2u; var bpbuf = new BitArray((int)BPLMT + 1, true);
  for (var i = 0; i <= 124; ++i) if (bpbuf[i]) { var p = 7 + i + i; //initialize baseprimes array
      for (var j = (p * p - 7) / 2; j <= BPLMT; j += p) bpbuf[j] = false; } var pa = new uint[3];
  for (uint i = 0u, w = 0, si = 0; i <= MAXNDX;
        i += WHLPTRN[w], si += WHLPTRN[w], si = (si >= BUFSZ) ? 0u : si, w = (w < 7u) ? ++w : 0u) {
    if (si == 0) { buf.SetAll(true);
      for (uint j = 0u, bw = 0u; j <= BPLMT; j += WHLPTRN[bw], bw = (bw < 7u) ? ++bw : 0u)
        if (bpbuf[(int)j]) { var p = 7u+j+j; var pX2=p+p; var k = p * (j + 3u) + j;
          if (k >= i + BUFSZ) break; pa[0] = p; pa[1] = pX2; pa[2] = pX2 + p; var sw = bw; if (k < i) {
            k = (i - k) % (15u * p); if (k != 0) { var os = WHLPOS[bw]; sw = os + ((k + p - 1u) / p);
              sw = WHLRNDUP[sw]; k = (sw - os) * p - k; sw = WHLNDX[sw]; } } else k -= i;
          for (; k<BUFSZ; k+=pa[WHLPTRN[sw]-1], sw=(sw<7u) ? ++sw : 0u) buf[(int)k]=false; } }
    if (buf[(int)si]) yield return 7u + i + i; } }

上面的代码大约需要 16 毫秒来筛选最多 200 万素数,大约需要 30 秒来筛选完整的 32 位数字范围。此代码比使用阶乘轮而不对大数字范围进行分段的类似版本更快,因为即使代码在计算方面更复杂,但由于不破坏 CPU 缓存,因此节省了大量时间。大部分复杂性在于计算每个段每个基素数的新起始索引,这可以通过保存每个段每个素数的状态来减少,但是上面的代码可以很容易地转换,以便在分离后台线程,使四核机器的速度进一步提高四倍,八核机器的速度甚至更快。不使用 BitArray 类并通过位掩码寻址各个位位置将提供大约两倍的进一步加速,并且阶乘轮的进一步增加将提供约三分之二的时间的另一次减少。对于通过轮分解消除的值,更好地将位数组打包在消除索引中会稍微提高大范围的效率,但也会在一定程度上增加位操作的复杂性。然而,所有这些 SoE 优化都不会达到 Berstein SoA 实现的极端复杂性,但运行速度大致相同。

要将上述代码从 SoE 转换为 SoA,我们只需要从有界情况更改为 SoA 剔除代码,但修改为每个页段重新计算起始地址,就像为 SoE 情况计算起始地址一样,但运算起来更加复杂,因为二次方程是通过数字的平方而不是简单的素数倍数来推进的。我将把所需的调整留给学生,尽管我并没有真正明白这一点,因为经过合理优化的 SoA 并不比当前版本的 SoE 更快,而且要复杂得多;)

EDIT_ADD:< /strong>

注意:下面的代码已被更正,虽然原始发布的代码对于提供的素数序列是正确的,但它比需要的速度慢了一半,因为它剔除了平方根以下的所有数字 最有效和最简单的优化是将每个段页面的剔除操作委托给后台线程,以便在有足够的 CPU 核心的

情况下,主要枚举素数的限制是枚举循环本身,在这种情况下,在上述参考机(C# 中)上,可以在大约十秒内枚举出 32 位数字范围内的所有素数,而无需其他优化,所有其他优化包括编写IEnumerable 接口实现而不是使用内置迭代器,将 32 位数字范围内的所有 203,280,221 个素数的时间减少到大约六秒(1.65 秒到十亿),同样对于大多数仅仅枚举结果所花费的时间。以下代码进行了许多修改,包括使用任务使用的 Dotnet Framework 4 线程池中的后台任务、使用作为查找表实现的状态机来实现进一步的位打包以使素数枚举更快,以及重写算法作为使用“自己滚动”迭代器来提高效率的可枚举类:

class fastprimesSoE : IEnumerable<uint>, IEnumerable {
  struct procspc { public Task tsk; public uint[] buf; }
  struct wst { public byte msk; public byte mlt; public byte xtr; public byte nxt; }
  static readonly uint NUMPROCS = (uint)Environment.ProcessorCount + 1u; const uint CHNKSZ = 1u;
  const uint L1CACHEPOW = 14u, L1CACHESZ = (1u << (int)L1CACHEPOW), PGSZ = L1CACHESZ >> 2; //for 16K in bytes...
  const uint BUFSZ = CHNKSZ * PGSZ; //number of uints even number of caches in chunk
  const uint BUFSZBTS = 15u * BUFSZ << 2; //even in wheel rotations and uints (and chunks)
  static readonly byte[] WHLPTRN = { 2, 1, 2, 1, 2, 3, 1, 3 }; //the 2,3,5 factorial wheel, (sum) 15 elements long
  static readonly byte[] WHLPOS = { 0, 2, 3, 5, 6, 8, 11, 12 }; //get wheel position from index
  static readonly byte[] WHLNDX = { 0, 1, 1, 2, 3, 3, 4, 5, 5, 6, 6, 6, 7, 0, 0, 0 }; //get index from position
  static readonly byte[] WHLRNDUP = { 0, 2, 2, 3, 5, 5, 6, 8, 8, 11, 11, 11, 12, 15, 15, 15, //allow for overflow...
                                      17, 17, 18, 20, 20, 21, 23, 23, 26, 26, 26, 27, 30, 30, 30 }; //round multiplier up
  const uint BPLMT = (ushort.MaxValue - 7u) / 2u; const uint BPSZ = BPLMT / 60u + 1u;
  static readonly uint[] bpbuf = new uint[BPSZ]; static readonly wst[] WHLST = new wst[64];
  static void cullpg(uint i, uint[] b, int strt, int cnt) { Array.Clear(b, strt, cnt);
    for (uint j = 0u, wp = 0, bw = 0x31321212u, bi = 0u, v = 0xc0881000u, bm = 1u; j <= BPLMT;
      j += bw & 0xF, wp = wp < 12 ? wp += bw & 0xF : 0, bw = (bw > 3u) ? bw >>= 4 : 0x31321212u) {
      var p = 7u + j + j; var k = p * (j + 3u) + j; if (k >= (i + (uint)cnt * 60u)) break;
      if ((v & bm) == 0u) { if (k < i) { k = (i - k) % (15u * p); if (k != 0) {
            var sw = wp + ((k + p - 1u) / p); sw = WHLRNDUP[sw]; k = (sw - wp) * p - k; } }
        else k -= i; var pd = p / 15;
        for (uint l = k / 15u + (uint)strt * 4u, lw = ((uint)WHLNDX[wp] << 3) + WHLNDX[k % 15u];
               l < (uint)(strt + cnt) * 4u; ) { var st = WHLST[lw];
          b[l >> 2] |= (uint)st.msk << (int)((l & 3) << 3); l += st.mlt * pd + st.xtr; lw = st.nxt; } }
      if ((bm <<= 1) == 0u) { v = bpbuf[++bi]; bm = 1u; } } }
  static fastprimesSoE() {
    for (var x = 0; x < 8; ++x) { var p = 7 + 2 * WHLPOS[x]; var pr = p % 15;
      for (int y = 0, i = ((p * p - 7) / 2); y < 8; ++y) { var m = WHLPTRN[(x + y) % 8]; i %= 15;
        var n = WHLNDX[i]; i += m * pr; WHLST[x * 8 + n] = new wst { msk = (byte)(1 << n), mlt = m,
                                                                     xtr = (byte)(i / 15),
                                                                     nxt = (byte)(8 * x + WHLNDX[i % 15]) }; }
    } cullpg(0u, bpbuf, 0, bpbuf.Length);  } //init baseprimes
  class nmrtr : IEnumerator<uint>, IEnumerator, IDisposable {
    procspc[] ps = new procspc[NUMPROCS]; uint[] buf;
    Task dlycullpg(uint i, uint[] buf) {  return Task.Factory.StartNew(() => {
        for (var c = 0u; c < CHNKSZ; ++c) cullpg(i + c * PGSZ * 60, buf, (int)(c * PGSZ), (int)PGSZ); }); }
    public nmrtr() {
      for (var i = 0u; i < NUMPROCS; ++i) ps[i] = new procspc { buf = new uint[BUFSZ] };
      for (var i = 1u; i < NUMPROCS; ++i) { ps[i].tsk = dlycullpg((i - 1u) * BUFSZBTS, ps[i].buf); } buf = ps[0].buf;  }
    public uint Current { get { return this._curr; } } object IEnumerator.Current { get { return this._curr; } }
    uint _curr; int b = -4; uint i = 0, w = 0; uint v, msk = 0;
    public bool MoveNext() {
      if (b < 0) if (b == -1) { _curr = 7; b += (int)BUFSZ; }
        else { if (b++ == -4) this._curr = 2u; else this._curr = 7u + ((uint)b << 1); return true; }
      do {  i += w & 0xF; if ((w >>= 4) == 0) w = 0x31321212u; if ((this.msk <<= 1) == 0) {
          if (++b >= BUFSZ) { b = 0; for (var prc = 0; prc < NUMPROCS - 1; ++prc) ps[prc] = ps[prc + 1];
            ps[NUMPROCS - 1u].buf = buf; var low = i + (NUMPROCS - 1u) * BUFSZBTS;
            ps[NUMPROCS - 1u].tsk = dlycullpg(i + (NUMPROCS - 1u) * BUFSZBTS, buf);
            ps[0].tsk.Wait(); buf = ps[0].buf; } v = buf[b]; this.msk = 1; } }
      while ((v & msk) != 0u); if (_curr > (_curr = 7u + i + i)) return false; else return true;  }
    public void Reset() { throw new Exception("Primes enumeration reset not implemented!!!"); }
    public void Dispose() { }
  }
  public IEnumerator<uint> GetEnumerator() { return new nmrtr(); }
  IEnumerator IEnumerable.GetEnumerator() { return new nmrtr(); } }

请注意,对于小范围的素数(最多一两百万),此代码并不比上一个版本更快,因为设置和初始化数组,但对于高达 40 亿以上的更大范围,速度要快得多。它比阿特金筛问题的实施快大约 8 倍,但对于高达 40 亿以上的更大范围,速度会快 20 到 50 倍。代码中定义了常量,用于设置基本缓存大小以及每个线程要剔除的缓存数量 (CHNKSZ),这可能会稍微调整性能。可以尝试一些进一步的优化,这可能会将大素数的执行时间减少多达两倍,例如针对 2,3,5,7 轮进行进一步的位打包,而不仅仅是 2,3,5 轮打包工作减少了约 25%(也许将执行时间减少到三分之二),并且通过轮阶乘预先剔除页面段高达 17 倍,进一步减少了约 20%,但这些只是程度与可以利用独特的本机代码优化的 C 代码相比,纯 C# 代码可以完成哪些工作。

无论如何,如果您打算按照问题的要求使用 IEnumberable 接口进行输出,则几乎不值得进一步优化,因为大约三分之二的时间用于枚举找到的素数,而只有大约三分之一的时间用于剔除合数。更好的方法是在类上编写方法来实现所需的结果,例如 CountTo、ElementAt、SumTo 等,从而完全避免枚举。

但我确实做了额外的优化作为附加答案,尽管复杂性增加,但它显示了额外的好处,并且进一步说明了为什么人们不想使用 SoA,因为完全优化的 SoE 实际上更好。

Here is a solution to unbounded prime sieving in C#, which can be implemented using the Sieve of Eratosthenes (SoE) or the Sieve of Atkin (SoA) algorithms; however, I maintain that it is hardly worth the extreme complexity of an optimized SoA solution given than the true SoE gives about the same performance without as much complexity. Thus, this is perhaps only a partial answer in that, while it shows how to implement a better SoA algorithm and shows how to implement an unbounded sequence using SoE, it only hints at how to combine these to write a reasonably efficient SoA.

Note that if only discussion on the very fastest implementations of these ideas is desired, jump to the bottom of this answer.

First we should comment on the point of this exercise in producing an unbounded sequence of primes to permit using the IEnumerable extension methods such as Take(), TakeWhile(), Where(), Count(), etcetera, as these methods give away some performance due to the added levels of method calling, adding at least 28 machine cycles for every call and return to enumerate one value and adding several levels of method calls per function; that said, having an (effectively) infinite sequence is still useful even if one uses more imperative filtering techniques on the results of the enumerations for better speed.

Note, in the the simpler examples following, I have limited the range to that of unsigned 32-bit numbers (uint) as much past that range the basic SoE or SoA implementations aren't really appropriate as to efficiency and one needs to switch over to a "bucket" or other form of incremental sieve for part of the sieving for efficiency; however, for a fully optimized page segmented sieve as in the fastest implementation, the range is increased to the 64-bit range although as written one likely would not want to sieve past about a hundred trillion (ten to the fourteenth power) as run time would take up to hundreds of years for the full range.

As the question chooses the SoA for probably the wrong reasons in first mistaking a Trial Division (TD) prime sieve for a true SoE and then using a naive SoA over the TD sieve, let's establish the true bounded SoE which can be implemented in a few lines as a method (which could be converted to a class as per the question's implementation coding style) as follows:

static IEnumerable<uint> primesSoE(uint top_number) {
  if (top_number < 2u) yield break;
  yield return 2u; if (top_number < 3u) yield break;
  var BFLMT = (top_number - 3u) / 2u;
  var SQRTLMT = ((uint)(Math.Sqrt((double)top_number)) - 3u) / 2u;
  var buf = new BitArray((int)BFLMT + 1,true);
  for (var i = 0u; i <= BFLMT; ++i) if (buf[(int)i]) {
      var p = 3u + i + i; if (i <= SQRTLMT) {
        for (var j = (p * p - 3u) / 2u; j <= BFLMT; j += p)
          buf[(int)j] = false; } yield return p; } }

This calculates the primes to 2 million in about 16 milliseconds on an i7-2700K (3.5 GHz) and the 203,280,221 primes up to 4,294,967,291 in the 32-bit number range in about 67 seconds on the same machine (given a spare 256 MegaBytes of RAM memory).

Now, using the version above to compare to the SoA is hardly fair as the true SoA automatically ignores the multiples of 2, 3, and 5 so introducing wheel factorization to do the same for the SoE yields the following code:

static IEnumerable<uint> primesSoE(uint top_number) {
  if (top_number < 2u) yield break;
  yield return 2u; if (top_number < 3u) yield break;
  yield return 3u; if (top_number < 5u) yield break;
  yield return 5u; if (top_number < 7u) yield break;
  var BFLMT = (top_number - 7u) / 2u;
  var SQRTLMT = ((uint)(Math.Sqrt((double)top_number)) - 7u) / 2u;
  var buf = new BitArray((int)BFLMT + 1,true);
  byte[] WHLPTRN = { 2, 1, 2, 1, 2, 3, 1, 3 };
  for (uint i = 0u, w = 0u; i <= BFLMT; i += WHLPTRN[w], w = (w < 7u) ? ++w : 0u)
    if (buf[(int)i]) { var p = 7u + i + i; if (i <= SQRTLMT) {
        var pX2 = p + p; uint[] pa = { p, pX2, pX2 + p };
        for (uint j = (p * p - 7u) / 2u, m = w; j <= BFLMT;
                               j += pa[WHLPTRN[m] - 1u], m = (m < 7u) ? ++m : 0u)
          buf[(int)j] = false; } yield return p; } }

The above code calculates the primes to 2 million in about 10 milliseconds and the primes to the 32-bit number range in about 40 seconds on the same machine as above.

Next, let's establish whether a version of the SoA that we are likely to write here actually has any benefit as compared to the SoE as per the above code as far as execution speed goes. The code below follows the model of the SoE above and optimizes the SoA pseudo-code from the Wikipedia article as to tuning the ranges of the 'x' and 'y' variables using separate loops for the individual quadratic solutions as suggested in that article, solving the quadratic equations (and the square free eliminations) only for odd solutions, combining the "3*x^2" quadratic to solve for both the positive and negative second terms in the same pass, and eliminating the computationally expensive modulo operations, to produce code that is over three times faster than the naive version of the pseudo-code posted there and as used in the question here. The code for the bounded SoA is as then follows:

static IEnumerable<uint> primesSoA(uint top_number) {
  if (top_number < 2u) yield break;
  yield return 2u; if (top_number < 3u) yield break;
  yield return 3u; if (top_number < 5u) yield break;
  var BFLMT = (top_number - 3u) / 2u; var buf = new BitArray((int)BFLMT + 1, false);
  var SQRT = (uint)(Math.Sqrt((double)top_number)); var SQRTLMT = (SQRT - 3u) / 2u;
  for (uint x = 1u, s = 1u, mdx12 = 5u, dmdx12 = 0u; s <= BFLMT; ++x, s += ((x << 1) - 1u) << 1) {
    for (uint y = 1u, n = s, md12 = mdx12, dmd12 = 8u; n <= BFLMT; y += 2, n += (y - 1u) << 1) {
      if ((md12 == 1) || (md12 == 5)) buf[(int)n] = buf[(int)n] ^ true;
      md12 += dmd12; if (md12 >= 12) md12 -= 12; dmd12 += 8u; if (dmd12 >= 12u) dmd12 -= 12u; }
    mdx12 += dmdx12; if (mdx12 >= 12u) mdx12 -= 12u; dmdx12 += 8u; if (dmdx12 >= 12u) dmdx12 -= 12u; }
  for (uint x = 1u, s = 0u, mdx12 = 3u, dmdx12 = 8u; s <= BFLMT; ++x, s += x << 1) {
    int y = 1 - (int)x, n = (int)s, md12 = (int)mdx12, dmd12 = ((-y - 1) << 2) % 12;
    for (; (y < 0) && (uint)n <= BFLMT; y += 2, n += (-y + 1) << 1) {
      if (md12 == 11) buf[(int)n] = buf[(int)n] ^ true;
      md12 += dmd12; if (md12 >= 12) md12 -= 12; dmd12 += 4; if (dmd12 >= 12) dmd12 -= 12; }
    if (y < 1) { y = 2; n += 2; md12 += 4; dmd12 = 0; } else { n += 1; md12 += 2; dmd12 = 8; }
    if (md12 >= 12) md12 -= 12; for (; (uint)n <= BFLMT; y += 2, n += (y - 1) << 1) {
      if (md12 == 7) buf[(int)n] = buf[(int)n] ^ true;
      md12 += dmd12; if (md12 >= 12) md12 -= 12; dmd12 += 8; if (dmd12 >= 12) dmd12 -= 12; }
    mdx12 += dmdx12; if (mdx12 >= 12) mdx12 -= 12; dmdx12 += 4; if (dmdx12 >= 12) dmdx12 -= 12; }
  for (var i = 0u; i<=BFLMT; ++i) if (buf[(int)i]) { var p = 3u+i+i; if (i<=SQRTLMT) { var sqr = p*p;
        for (var j = (sqr - 3ul) / 2ul; j <= BFLMT; j += sqr) buf[(int)j] = false; } yield return p; } }

This is still over twice as slow as the wheel factorization SoE algorithm posted due to the not fully optimized number of operations. Further optimizations can be made to the SoA code as in using modulo 60 operations as for the original (non-pseudo-code) algorithm and using bit packing to only deal with multiples excluding 3 and 5 to get the code fairly close in performance to SoE or even exceed it slightly, but we get closer and closer to the complexity of the Berstein implementation to achieve this performance. My point is "Why SoA, when one works very hard to get about the same performance as SoE?".

Now for the unbounded primes sequence, the very easiest way to do this is to define a const top_number of Uint32.MaxValue and eliminate the argument in the primesSoE or primesSoA methods. This is somewhat inefficient in that the culling is still done over the full number range no matter the actual prime value being processed, which makes the determination for small ranges of primes take much longer than it should. There are also other reasons to go to a segmented version of the primes sieve other than this and extreme memory use: First, algorithms that use data that is primarily within the CPU L1 or L2 data caches process faster because of more efficient memory access, and secondly because segmentation allows the job to be easily split into pages that can be culled in the background using multi-core processors for a speed-up that can be proportional to the number of cores used.

For efficiency, we should choose an array size of the CPU L1 or L2 cache size which is at least 16 Kilobytes for most modern CPU's in order to avoid cache thrashing as we cull composites of primes from the array, meaning that the BitArray can have a size of eight times that large (8 bits per byte) or 128 Kilobits. Since we only need to sieve odd primes, this represents a range of numbers of over a quarter million, meaning that a segmented version will only use eight segments to sieve to 2 million as required by Euler Problem 10. One could save the results from the last segment and continue on from that point, but that would preclude adapting this code to the multi-core case where one relegates culling to the background on multiple threads to take full advantage of multi-core processors. The C# code for an (single thread) unbounded SoE is as follows:

static IEnumerable<uint> primesSoE() { yield return 2u; yield return 3u; yield return 5u;
  const uint L1CACHEPOW = 14u + 3u, L1CACHESZ = (1u << (int)L1CACHEPOW); //for 16K in bits...
  const uint BUFSZ = L1CACHESZ / 15u * 15u; //an even number of wheel rotations
  var buf = new BitArray((int)BUFSZ);
  const uint MAXNDX = (uint.MaxValue - 7u) / 2u; //need maximum for number range
  var SQRTNDX = ((uint)Math.Sqrt(uint.MaxValue) - 7u) / 2u;
  byte[] WHLPTRN = { 2, 1, 2, 1, 2, 3, 1, 3 }; //the 2,3,5 factorial wheel, (sum) 15 elements long
  byte[] WHLPOS = { 0, 2, 3, 5, 6, 8, 11, 12 }; //get wheel position from index
  byte[] WHLNDX = { 0, 0, 1, 2, 2, 3, 4, 4, 5, 5, 5, 6, 7, 7, 7, //get index from position
                    0, 0, 1, 2, 2, 3, 4, 4, 5, 5, 5, 6, 7 }; //allow for overflow
  byte[] WHLRNDUP = { 0, 2, 2, 3, 5, 5, 6, 8, 8, 11, 11, 11, 12, 15, //allow for overflow...
                      15, 15, 17, 17, 18, 20, 20, 21, 23, 23, 26, 26, 26, 27 };
  uint BPLMT = (ushort.MaxValue - 7u) / 2u; var bpbuf = new BitArray((int)BPLMT + 1, true);
  for (var i = 0; i <= 124; ++i) if (bpbuf[i]) { var p = 7 + i + i; //initialize baseprimes array
      for (var j = (p * p - 7) / 2; j <= BPLMT; j += p) bpbuf[j] = false; } var pa = new uint[3];
  for (uint i = 0u, w = 0, si = 0; i <= MAXNDX;
        i += WHLPTRN[w], si += WHLPTRN[w], si = (si >= BUFSZ) ? 0u : si, w = (w < 7u) ? ++w : 0u) {
    if (si == 0) { buf.SetAll(true);
      for (uint j = 0u, bw = 0u; j <= BPLMT; j += WHLPTRN[bw], bw = (bw < 7u) ? ++bw : 0u)
        if (bpbuf[(int)j]) { var p = 7u+j+j; var pX2=p+p; var k = p * (j + 3u) + j;
          if (k >= i + BUFSZ) break; pa[0] = p; pa[1] = pX2; pa[2] = pX2 + p; var sw = bw; if (k < i) {
            k = (i - k) % (15u * p); if (k != 0) { var os = WHLPOS[bw]; sw = os + ((k + p - 1u) / p);
              sw = WHLRNDUP[sw]; k = (sw - os) * p - k; sw = WHLNDX[sw]; } } else k -= i;
          for (; k<BUFSZ; k+=pa[WHLPTRN[sw]-1], sw=(sw<7u) ? ++sw : 0u) buf[(int)k]=false; } }
    if (buf[(int)si]) yield return 7u + i + i; } }

The above code takes about 16 milliseconds to sieve the primes up to two million and about 30 seconds to sieve the full 32-bit number range. This code is faster than the similar version using the factorial wheel without segmentation for large number ranges because, even though the code is more complex as to computations, a great deal of time is saved in not thrashing the CPU caches. Much of the complexity is in the computation of the new starting indexes per base prime per segment, which could have been reduced by saving the state of each per prime per segment but the above code can easily be converted so as to run the culling processes on separate background threads for a further four times speed up for a machine with four cores, even more with eight cores. Not using the BitArray class and addressing the individual bit locations by bit masks would provide a further speed up by about a factor of two, and further increases of the factorial wheel would provide about another reduction in time to about two thirds. Better packing of the bit array in eliminated indexes for values eliminated by the wheel factorization would slightly increase efficiency for large ranges, but would also add somewhat to the complexity of the bit manipulations. However, all of these SoE optimizations would not approach the extreme complexity of the Berstein SoA implementation, yet would run at about the same speed.

To convert the above code from SoE to SoA, we only need to change to the SoA culling code from the bounded case but with the modification that the starting addresses are recalculated for every page segment something like as the starting address is calculated for the SoE case, but even somewhat more complex in operation as the quadratics advance by squares of numbers rather than by simple multiples of primes. I'll leave the required adaptations to the student, although I don't really see the point given that the SoA with reasonable optimizations is not faster than the current version of the SoE and is considerably more complex ;)

EDIT_ADD:

Note: The below code has been corrected, as while the original posted code was correct as to the provided primes sequence, it was half again slower than it needed to be as it culled for all numbers below the square root of the range rather than only the found base primes up to the square root of the range.

The most effective and simplest optimization is relegating the culling operations per segment page to background threads so that, given enough CPU cores, the main limit in enumerating the primes is the enumeration loop itself, in which case all the primes in the 32-bit number range can be enumerated in about ten seconds on the above reference machine (in C#) without other optimizations, with all other optimizations including writing the IEnumerable interface implementations rather than using the built-in iterators reducing this to about six seconds for all the 203,280,221 primes in the 32-bit number range (1.65 seconds to one billion), again with most of the time spent just enumerating the results. The following code has many of those modifications including using background tasks from the Dotnet Framework 4 ThreadPool used by Tasks, using a state machine implemented as a look up table to implement further bit packing to make the primes enumeration quicker, and re-writing the algorithm as an enumerable class using "roll-your-own" iterators for increased efficiency:

class fastprimesSoE : IEnumerable<uint>, IEnumerable {
  struct procspc { public Task tsk; public uint[] buf; }
  struct wst { public byte msk; public byte mlt; public byte xtr; public byte nxt; }
  static readonly uint NUMPROCS = (uint)Environment.ProcessorCount + 1u; const uint CHNKSZ = 1u;
  const uint L1CACHEPOW = 14u, L1CACHESZ = (1u << (int)L1CACHEPOW), PGSZ = L1CACHESZ >> 2; //for 16K in bytes...
  const uint BUFSZ = CHNKSZ * PGSZ; //number of uints even number of caches in chunk
  const uint BUFSZBTS = 15u * BUFSZ << 2; //even in wheel rotations and uints (and chunks)
  static readonly byte[] WHLPTRN = { 2, 1, 2, 1, 2, 3, 1, 3 }; //the 2,3,5 factorial wheel, (sum) 15 elements long
  static readonly byte[] WHLPOS = { 0, 2, 3, 5, 6, 8, 11, 12 }; //get wheel position from index
  static readonly byte[] WHLNDX = { 0, 1, 1, 2, 3, 3, 4, 5, 5, 6, 6, 6, 7, 0, 0, 0 }; //get index from position
  static readonly byte[] WHLRNDUP = { 0, 2, 2, 3, 5, 5, 6, 8, 8, 11, 11, 11, 12, 15, 15, 15, //allow for overflow...
                                      17, 17, 18, 20, 20, 21, 23, 23, 26, 26, 26, 27, 30, 30, 30 }; //round multiplier up
  const uint BPLMT = (ushort.MaxValue - 7u) / 2u; const uint BPSZ = BPLMT / 60u + 1u;
  static readonly uint[] bpbuf = new uint[BPSZ]; static readonly wst[] WHLST = new wst[64];
  static void cullpg(uint i, uint[] b, int strt, int cnt) { Array.Clear(b, strt, cnt);
    for (uint j = 0u, wp = 0, bw = 0x31321212u, bi = 0u, v = 0xc0881000u, bm = 1u; j <= BPLMT;
      j += bw & 0xF, wp = wp < 12 ? wp += bw & 0xF : 0, bw = (bw > 3u) ? bw >>= 4 : 0x31321212u) {
      var p = 7u + j + j; var k = p * (j + 3u) + j; if (k >= (i + (uint)cnt * 60u)) break;
      if ((v & bm) == 0u) { if (k < i) { k = (i - k) % (15u * p); if (k != 0) {
            var sw = wp + ((k + p - 1u) / p); sw = WHLRNDUP[sw]; k = (sw - wp) * p - k; } }
        else k -= i; var pd = p / 15;
        for (uint l = k / 15u + (uint)strt * 4u, lw = ((uint)WHLNDX[wp] << 3) + WHLNDX[k % 15u];
               l < (uint)(strt + cnt) * 4u; ) { var st = WHLST[lw];
          b[l >> 2] |= (uint)st.msk << (int)((l & 3) << 3); l += st.mlt * pd + st.xtr; lw = st.nxt; } }
      if ((bm <<= 1) == 0u) { v = bpbuf[++bi]; bm = 1u; } } }
  static fastprimesSoE() {
    for (var x = 0; x < 8; ++x) { var p = 7 + 2 * WHLPOS[x]; var pr = p % 15;
      for (int y = 0, i = ((p * p - 7) / 2); y < 8; ++y) { var m = WHLPTRN[(x + y) % 8]; i %= 15;
        var n = WHLNDX[i]; i += m * pr; WHLST[x * 8 + n] = new wst { msk = (byte)(1 << n), mlt = m,
                                                                     xtr = (byte)(i / 15),
                                                                     nxt = (byte)(8 * x + WHLNDX[i % 15]) }; }
    } cullpg(0u, bpbuf, 0, bpbuf.Length);  } //init baseprimes
  class nmrtr : IEnumerator<uint>, IEnumerator, IDisposable {
    procspc[] ps = new procspc[NUMPROCS]; uint[] buf;
    Task dlycullpg(uint i, uint[] buf) {  return Task.Factory.StartNew(() => {
        for (var c = 0u; c < CHNKSZ; ++c) cullpg(i + c * PGSZ * 60, buf, (int)(c * PGSZ), (int)PGSZ); }); }
    public nmrtr() {
      for (var i = 0u; i < NUMPROCS; ++i) ps[i] = new procspc { buf = new uint[BUFSZ] };
      for (var i = 1u; i < NUMPROCS; ++i) { ps[i].tsk = dlycullpg((i - 1u) * BUFSZBTS, ps[i].buf); } buf = ps[0].buf;  }
    public uint Current { get { return this._curr; } } object IEnumerator.Current { get { return this._curr; } }
    uint _curr; int b = -4; uint i = 0, w = 0; uint v, msk = 0;
    public bool MoveNext() {
      if (b < 0) if (b == -1) { _curr = 7; b += (int)BUFSZ; }
        else { if (b++ == -4) this._curr = 2u; else this._curr = 7u + ((uint)b << 1); return true; }
      do {  i += w & 0xF; if ((w >>= 4) == 0) w = 0x31321212u; if ((this.msk <<= 1) == 0) {
          if (++b >= BUFSZ) { b = 0; for (var prc = 0; prc < NUMPROCS - 1; ++prc) ps[prc] = ps[prc + 1];
            ps[NUMPROCS - 1u].buf = buf; var low = i + (NUMPROCS - 1u) * BUFSZBTS;
            ps[NUMPROCS - 1u].tsk = dlycullpg(i + (NUMPROCS - 1u) * BUFSZBTS, buf);
            ps[0].tsk.Wait(); buf = ps[0].buf; } v = buf[b]; this.msk = 1; } }
      while ((v & msk) != 0u); if (_curr > (_curr = 7u + i + i)) return false; else return true;  }
    public void Reset() { throw new Exception("Primes enumeration reset not implemented!!!"); }
    public void Dispose() { }
  }
  public IEnumerator<uint> GetEnumerator() { return new nmrtr(); }
  IEnumerator IEnumerable.GetEnumerator() { return new nmrtr(); } }

Note that this code isn't faster than the last version for small ranges of primes as in up to one or two million due to the overhead of setting up and initializing arrays but is considerably faster for larger ranges up to four billion plus. It is about 8 times faster than the question's implementation of the Atkin sieve, but goes to from 20 to 50 times as fast for larger ranges up to four billion plus. There are defined constants in the code setting the base cache size and how many of these to cull per thread (CHNKSZ) that may slightly tweak the performance. Some slight further optimizations could be tried that might reduce the execution time for large primes by up to a factor of two such as further bit packing as for the 2,3,5,7 wheel rather than just the 2,3,5 wheel for a reduction of about 25% in packing (perhaps cutting the execution time to two thirds) and pre-culling the page segments by the wheel factorial up to the factor of 17 for a further reduction of about 20%, but these are about the extent of what can be done in pure C# code as compared to C code which can take advantage of unique native code optimizations.

At any rate, it is hardly worth further optimizations if one intends to use the IEnumberable interface for output as the question desires as about two thirds of the time is used just enumerating the found primes and only about one third of the time is spent in culling the composite numbers. A better approach would be to write methods on the class to implement the desired results such as CountTo, ElementAt, SumTo, etcetera so as to avoid enumeration entirely.

But I did do the extra optimizations as an additional answer which shows additional benefits in spite of additional complexity, and which further shows why one doesn't want to use the SoA because a fully optimized SoE is actually better.

水染的天色ゝ 2024-08-15 00:13:25

以下代码执行我之前的答案底部讨论的优化,并包含以下功能:

  1. 可用范围已增加到 64 但无符号数
    范围 18,446,744,073,709,551,615 并进行范围溢出检查
    删除,因为人们不太可能想要运行该程序
    处理所有种类的物质需要数百年的时间
    数量达到该限制。处理成本非常低
    时间,因为分页可以使用 32 位页范围完成,并且仅
    最终质数输出需要计算为 64 位数字。
  2. 它将车轮分解从 2、3、5 车轮增加到使用
    2、3、5、7 素因子轮,带有额外的复合材料预剔除
    使用附加素数 11、13 和 17 的数字,
    大大减少了冗余合数剔除(现在只
    每个合数平均剔除约1.5次)。到期的
    这样做的(DotNet 相关的)计算开销(也
    适用于2、3、5轮如之前版本)实际时间
    节省剔除并不是那么好,但枚举答案
    由于许多“简单”合数,速度稍快一些
    在打包位表示中被跳过。
  3. 它仍然使用 DotNet 4 及更高版本的任务并行库 (TPL)
    用于基于每页的线程池中的多线程。
  4. 它现在使用自动支持的基本素数表示
    随着更多基本素数的增加,该类包含的数组不断增长
    需要作为线程安全方法而不是固定的预先计算
    先前使用的基本素数数组。
  5. 基本素数表示已减少到每个基本一个字节
    主要用于进一步减少内存占用;因此,总共
    除了代码之外的内存占用是保存该基数的数组
    素数的表示,直到平方根
    当前正在处理的范围,以及打包的位页缓冲区
    当前设置为低于 256 KB 的二级缓存大小
    (最小页面大小 14,586 字节乘以 CHNKSZ 17 作为
    提供)每个 CPU 核心加上一个额外的前台缓冲区
    要处理的任务。使用此代码,大约三兆字节
    足以处理十到十四的素数范围
    力量。以及由于允许高效的多处理而带来的速度,
    这减少了内存需求是使用
    分页筛实现。

    class UltimatePrimesSoE : IEnumerable; {
      静态只读 uint NUMPRCSPCS = (uint)Environment.ProcessorCount + 1;常量 CHNKSZ = 17;
      const int L1CACHEPOW = 14,L1CACHESZ = (1 << L1CACHEPOW),MXPGSZ = L1CACHESZ / 2; //对于缓冲区ushort[]
      //2,3,57阶乘轮增量模式,(总和)48个元素长,从素数19位置开始
      静态只读字节[] WHLPTRN = { 2,3,1,3,2,1,2,3,3,1,3,2,1,3,2,3,4,2,1,2,1, 2,4,3,
                                                                               2,3,1,2,3,1,3,3,2,1,2,3,1,3,2,1,2,1,5,1,5,1,2,1 };常量 FSTCP = 11;
      静态只读字节[] WHLPOS;静态只读字节[] WHLNDX; //从位置索引查找车轮索引
      静态只读字节[] WHLRNDUP; //查找轮子四舍五入的索引位置值,允许溢出
      静态只读 uint WCRC = WHLPTRN.Aggregate(0u, (acc, n) => acc + n);
      静态只读 uint WHTS = (uint)WHLPTRN.Length;静态只读 uint WPC = WHTS >> 4;
      静态只读字节[] BWHLPRMS = { 2, 3, 5, 7, 11, 13, 17 };常量 uint FSTBP = 19;
      静态只读 uint BWHLWRDS = BWHLPRMS.Aggregate(1u, (acc, p) => acc * p) / 2 / WCRC * WHTS / 16;
      静态只读 uint PGSZ = MXPGSZ / BWHLWRDS * BWHLWRDS;静态只读 uint PGRNG = PGSZ * 16 / WHTS * WCRC;
      静态只读 uint BFSZ = CHNKSZ * PGSZ,BFRNG = CHNKSZ * PGRNG; //块中的uints数量偶数缓存
      静态只读 ushort[] MCPY; //用于保存页面的较低基数预剔除版本的主副本页面
      struct Wst { 公共 ushort msk;公共字节 MLT;公共字节xtr;公共 ushort nxt; }
      静态只读字节[] PRLUT; /*轮索引查找表*/ static readonly Wst[] WSLUT; //轮子状态查找表
      静态只读字节[] CLUT; // 用于快速计数素数的计数查找表
      static int count(uint bitlim, ushort[] buf) { //非常快的计数
        if (bitlim < BFRNG) { var addr = (bitlim - 1) / WCRC; var 位 = WHLNDX[bitlim - 地址 * WCRC] - 1;地址 *= WPC;
          for (var i = 0; i < 3; ++i) buf[addr++] |= (ushort)((unchecked((ulong)-2) << 位)>> (i << 4)); }
        变量 ACC = 0; for (uint i = 0, w = 0; i < bitlim; i += WCRC)
          acc += CLUT[buf[w++]] + CLUT[buf[w++]] + CLUT[buf[w++]];返回帐户; }
      静态无效剔除(ulong lwi,ushort [] b){ ulong nlwi = lwi;
        for (var i = 0u; i < b.Length; nlwi += PGRNG, i += PGSZ) MCPY.CopyTo(b, i); //复制预先剔除的较低基数素数。
        for (uint i = 0, pd = 0; ; ++i) { pd += (uint)baseprms[i] >>> 6;
          var wi = baseprms[i] & 0x3Fu; var wp = (uint)WHLPOS[wi]; var p = pd * WCRC + PLRUT[wi];
          var pp = (p - FSTBP)>> 1; var k = (ulong)p * (pp + ((FSTBP - 1) >> 1)) + pp;
          if (k >= nlwi) 中断; if (k < lwi) { k = (lwi - k) % (WCRC * p);
            if (k != 0) { var nwp = wp + (uint)((k + p - 1) / p); k = (WHLRNDUP[nwp] - wp) * p - k;
              如果(nwp>=WCRC)wp=0;否则 wp = nwp; } }
          否则 k -= lwi; var kd = k / WCRC; var kn = WHLNDX[k - kd * WCRC];
          for (uint wrd = (uint)kd * WPC + (uint)(kn >> 4), ndx = wi * WHTS + kn; wrd < b.Length; ) {
            var st = WSLUT[ndx]; b[wrd] |= st.msk; wrd += st.mlt * pd + st.xtr; ndx = st.nxt; } } }
      静态任务 cullbf(ulong lwi, ushort[] b, Action f) {
        return Task.Factory.StartNew(() => { cull(lwi, b); f(b); }); }
      class Bpa { //非常高效的自动调整大小的线程安全只读索引器类来保存基本素数数组
        byte[] sa = 新字节[0]; uint lwi = 0,lpd = 0;对象 lck = 新对象();
        public uint this[uint i] { get { if (i >= this.sa.Length) lock (this.lck) {
                var lngth = this.sa.Length; while (i >= lngth) {
                  var bf = (ushort[])MCPY.Clone();如果(长度== 0){
                    for (uint bi = 0, wi = 0, w = 0, msk = 0x8000, v = 0; w < bf.Length;
                        bi += WHLPTRN[wi++], wi = (wi >= WHTS) ? 0:wi){
                      if (msk >= 0x8000) { msk = 1; v = bf[w++]; } 否则 msk <<= 1;
                      if ((v & msk) == 0) { var p = FSTBP + (bi + bi); var k = (p * p - FSTBP)>> 1;
                        if (k >= PGRNG) 中断; var pd = p / WCRC; var kd = k / WCRC; var kn = WHLNDX[k - kd * WCRC];
                        for (uint wrd = kd * WPC + (uint)(kn >> 4), ndx = wi * WHTS + kn; wrd < bf.Length; ) {
                          var st = WSLUT[ndx]; bf[wrd] |= st.msk; wrd += st.mlt * pd + st.xtr; ndx = st.nxt; } } } }
                  否则 { this.lwi += PGRNG;剔除(this.lwi, bf); }
                  var c = 计数(PGRNG, bf); var na = 新字节[lngth + c]; sa.CopyTo(na, 0);
                  for (uint p = FSTBP + (this.lwi << 1), wi = 0, w = 0, msk = 0x8000, v = 0;
                      长度< na.长度; p += (uint)(WLPPTRN[wi++] << 1), wi = (wi >= WHTS) ? 0:wi){
                    if (msk >= 0x8000) { msk = 1; v = bf[w++]; } 否则 msk <<= 1; if ((v & msk) == 0) {
                      var pd = p / WCRC; na[lngth++] = (byte)(((pd - this.lpd) << 6) + wi); this.lpd = pd; }
                  } this.sa = na;返回 this.sa[i]; } } }
      静态只读 Bpa baseprms = new Bpa();
      静态 UltimatePrimesSoE() {
        WHLPOS = 新字节[WHLPTRN.Length + 1]; //从车轮索引查找车轮位置索引
        for (字节 i = 0, acc = 0; i < WHLPTRN.Length; ++i) { acc += WHLPTRN[i]; WHLPOS[i + 1] = acc; }
        WHLNDX = 新字节[WCRC + 1]; for (字节 i = 1; i < WHLPOS.Length; ++i) {
          for (字节 j = (字节)(WHLPOS[i - 1] + 1); j <= WHLPOS[i]; ++j) WHLNDX[j] = i; }
        WHLRNDUP = 新字节[WCRC * 2]; for (字节 i = 1; i < WHLRNDUP.Length; ++i) {
          如果 (i > WCRC) WHLRNDUP[i] = (字节)(WCRC + WHLPOS[WHLNDX[i - WCRC]]);否则 WHLRNDUP[i] = WHLPOS[WHLNDX[i]]; }
        Func nmbts = (v) => { var ACC = 0; while (v != 0) { acc += (int)v & 1; v>>=1;返回 acc; };
        CLUT = 新字节[1 << 16]; for (var i = 0; i < CLUT.Length; ++i) CLUT[i] = (byte)nmbts((ushort)(i ^ -1));
        PRLUT = 新字节[WHTS]; for (var i = 0; i < PRLUT.Length; ++i) {
          var t = (uint)(WHLPOS[i] * 2) + FSTBP;如果 (t >= WCRC) t -= WCRC;如果(t>=WCRC)t-=WCRC; PRLUT[i] = (字节)t; }
        WSLUT = 新 Wst[WHTS * WHTS]; for (var x = 0u; x < WHTS; ++x) {
          var p = FSTBP + 2u * WHLPOS[x]; var pr = p % WCRC;
          for (uint y = 0, pos = (p * p - FSTBP) / 2; y < WHTS; ++y) {
            var m = WHLPTRN[(x + y) % WHTS];
            位置%=WCRC; var posn = WHLNDX[位置]; pos += m * pr; var nposd = pos / WCRC; var nposn = WHLNDX[pos - nposd * WCRC];
            WSLUT[x * WHTS + posn] = 新 Wst { msk = (ushort)(1 << (int)(posn & 0xF)), mlt = (byte)(m * WPC),
                                               xtr = (字节)(WPC * nposd + (nposn >> 4) - (posn >> 4)),
                                               nxt = (ushort)(WHTS * x + nposn) }; } }
        MCPY = 新的 ushort[PGSZ]; foreach (var lp in BWHLPRMS.SkipWhile(p => p < FSTCP)) { var p = (uint)lp;
          var k = (p * p - FSTBP)>> 1; var pd = p / WCRC; var kd = k / WCRC; var kn = WHLNDX[k - kd * WCRC];
          for (uint w = kd * WPC + (uint)(kn >> 4), ndx = WHLNDX[(2 * WCRC + p - FSTBP) / 2] * WHTS + kn; w < MCPY.Length; ) {
            var st = WSLUT[ndx]; MCPY[w] |= st.msk; w += st.mlt * pd + st.xtr; ndx = st.nxt; } } }
      struct PrcsSpc { 公共任务 tsk;公共 ushort[] buf; }
      nmrtr 类:IEnumerator、IEnumerator、IDisposable {
        PrcsSpc[] ps = 新 PrcsSpc[NUMPRCSPCS]; ushort[] buf;
        public nmrtr() { for (var s = 0u; s < NUMPRCSPCS; ++s) ps[s] = new PrcsSpc { buf = new ushort[BFSZ] };
          for (var s = 1u; s < NUMPRCSPCS; ++s) {
            ps[s].tsk = cullbf((s - 1u) * BFRNG, ps[s].buf, (bfr) => { }); } buf = ps[0].buf; }
        ulong _curr, i = (ulong)-WHLPTRN[WHTS - 1]; int b = -BWHLPRMS.Length - 1; uint wi = WHTS - 1; ushort v,msk = 0;
        公共ulong当前{获取{返回this._curr; } } 对象 IEnumerator.Current { get { return this._curr; } }
        公共布尔 MoveNext() {
          if (b < 0) { if (b == -1) b += buf.Length; //没有产量!!!所以自动又出现了
            else { this._curr = (ulong)BWHLPRMS[BWHLPRMS.Length + (++b)];返回真; } }
          做 {
            i += WHLPTRN[wi++];如果 (wi >= WHTS) wi = 0; if ((this.msk <<= 1) == 0) {
              if (++b >= BFSZ) { b = 0; for (var prc = 0; prc < NUMPRCSPCS - 1; ++prc) ps[prc] = ps[prc + 1];
                ps[NUMPRCSPCS - 1u].buf = buf;
                ps[NUMPRCSPCS - 1u].tsk = cullbf(i + (NUMPRCSPCS - 1u) * BFRNG, buf, (bfr) => { });
                ps[0].tsk.Wait(); buf = ps[0].buf; } v = buf[b];这个.msk = 1; } }
          while ((v & msk) != 0u); _curr = FSTBP + i + i;返回真; }
        public void Reset() { throw new Exception("素数枚举重置未实现!!!"); }
        公共无效处理(){}}
      公共 IEnumerator; GetEnumerator() { 返回新 nmrtr(); }
      IEnumerator IEnumerable.GetEnumerator() { 返回新 nmrtr(); }
      static void IterateTo(ulong top_number, Action actn) {
        PrcsSpc[] ps = 新 PrcsSpc[NUMPRCSPCS]; for (var s = 0u; s < NUMPRCSPCS; ++s) ps[s] = new PrcsSpc {
          buf = new ushort[BFSZ], tsk = Task.Factory.StartNew(() => { }) };
        var topndx = (top_number - FSTBP) >>> 1; for (ulong ndx = 0; ndx <= topndx;) {
          ps[0].tsk.Wait(); var buf = ps[0].buf; for (var s = 0u; s < NUMPRCSPCS - 1; ++s) ps[s] = ps[s + 1];
          var lowi = ndx; var nxtndx = ndx + BFRNG; var lim = topndx <;恩xtndx? (uint)(topndx - ndx + 1) : BFRNG;
          ps[NUMPRCSPCS - 1] = new PrcsSpc { buf = buf, tsk = cullbf(ndx, buf, (b) => actn(lowi, lim, b)) };
          ndx = nxtndx; for (var s = 0u; s < NUMPRCSPCS; ++s) ps[s].tsk.Wait(); }
      公共静态长 CountTo(ulong top_number) {
        if (top_number < FSTBP) return BWHLPRMS.TakeWhile(p => p <= top_number).Count();
        var cnt = (long)BWHLPRMS.Length;
        IterateTo(top_number, (lowi, lim, b) => { Interlocked.Add(ref cnt, count(lim, b)); });返回cnt; }
      公共静态 ulong SumTo(uint top_number) {
        if (top_number < FSTBP) return (ulong)BWHLPRMS.TakeWhile(p => p <= top_number).Aggregate(0u, (acc, p) => acc += p);
        var sum = (long)BWHLPRMS.Aggregate(0u, (acc, p) => acc += p);
        Func sumbf = (lowi, bitlim, buf) => {
          var ACC = 0L; for (uint i = 0, wi = 0, msk = 0x8000, w = 0, v = 0; i < bitlim;
              i += WHLPTRN[wi++], wi = wi >= WHTS ? 0:wi){
            if (msk >= 0x8000) { msk = 1; v = buf[w++]; } 否则 msk <<= 1;
            if ((v & msk) == 0) acc += (long)(FSTBP + ((lowi + i) << 1));返回 acc; };
        IterateTo(top_number, (pos, lim, b) => { Interlocked.Add(ref sum, sumbf(pos, lim, b)); });返回(ulong)总和; }
      static void IterateUntil(Func prdct) {
        PrcsSpc[] ps = 新 PrcsSpc[NUMPRCSPCS];
        for (var s = 0u; s < NUMPRCSPCS; ++s) { var buf = new ushort[BFSZ];
          ps[s] = new PrcsSpc { buf = buf, tsk = cullbf(s * BFRNG, buf, (bfr) => { }) }; }
        for (var ndx = 0UL; ; ndx += BFRNG) {
          ps[0].tsk.Wait(); var buf = ps[0].buf; var lowi = ndx;如果(prdct(lowi,buf))打破;
          for (var s = 0u; s < NUMPRCSPCS - 1; ++s) ps[s] = ps[s + 1];
          ps[NUMPRCSPCS - 1] = 新 PrcsSpc { buf = buf,
                                             tsk = cullbf(ndx + NUMPRCSPCS * BFRNG, buf, (bfr) => { }) }; } }
      公共静态 ulong ElementAt(long n) {
        if (n < BWHLPRMS.Length) return (ulong)BWHLPRMS.ElementAt((int)n);
        长cnt = BWHLPRMS.长度; var ndx = 0UL; var 循环 = 0u;变量位 = 0u; IterateUntil((lwi, bfr) => {
          var c = 计数(BFRNG, bfr); if ((cnt += c) < n) 返回 false; ndx = lwi; cnt -= c; c = 0;
          做 { var w = cycl++ * WPC; c = CLUT[bfr[w++]] + CLUT[bfr[w++]] + CLUT[bfr[w]]; cnt += c; while (cnt < n);
          cnt -= c; var y = (--cycle) * WPC; ulong v = ((ulong)bfr[y + 2] << 32) + ((ulong)bfr[y + 1] << 16) + bfr[y];
          do { if ((v & (1UL << ((int)bit++))) == 0) ++cnt; while (cnt <= n); - 少量;返回真;
        });返回 FSTBP + ((ndx + cycl * WCRC + WHLPOS[位]) << 1); } }
    

上面的代码大约需要 59 毫秒来找到 200 万的素数(由于初始化开销,比其他一些更简单的代码稍慢),但计算出 10 亿的素数和完整的数字范围分别需要 1.55 和 5.95 秒。这并不比上一个版本快多少,因为与剔除合数所花费的时间相比,DotNet 在枚举找到的素数时进行额外的数组边界检查的额外开销还不到枚举所用时间的三分之一,因此剔除复合材料中的节省被枚举中的额外时间(由于每个素数候选者的额外数组边界检查)所抵消。然而,对于许多涉及素数的任务,不需要枚举所有素数,只需计算答案即可,无需枚举。

由于上述原因,该类提供了示例静态方法“CountTo”、“SumTo”和“ElementAt”,分别对素数进行计数或求和至给定上限或输出从零开始的第 n 个素数。 “CountTo”方法将分别在0.32秒和1.29秒内生成到10亿且32位数字范围内的素数个数 >; "ElementAt" 方法将分别在大约 0.32 和 1.25 秒内生成这些范围内的最后一个元素,"SumTo" 方法生成这些范围内所有素数的总和分别约为0.49 和 1.98 秒。该程序将所有素数的总和计算为 40 亿以上,所用的时间比许多简单的实现将所有素数求和为 200 万的时间要短,如欧拉问题 10,超过实际范围的 2000 倍!

此代码仅比 primesieve 使用的高度优化的 C 代码慢大约四倍,并且它较慢的原因主要是由于DotNet,如下(讨论256 KB缓冲区的情况,这是L2缓存的大小):

  1. 大部分执行时间都花在了主要的复合剔除上
    循环,这是私有静态“剔除”中的最后一个“for 循环”
    方法”,每个循环仅包含四个语句加上范围
    查看。
  2. 在 DotNet 中,每个编译过程大约需要 21.83 个 CPU 时钟周期
    循环,包括两个数组边界的大约 5 个时钟周期
    每个循环的检查。
  3. 非常高效的 C 编译器将此循环转换为大约
    8.33 个时钟周期的优势约为 2.67 倍。
  4. Primesieve 还使用极端的手动“循环展开”优化来
    将每个循环执行工作的平均时间减少到大约 4.17
    每个复合剔除循环的时钟周期,额外增益为 2
    倍,总增益约为5.3倍。
  5. 现在,高度优化的 C 代码既不像超线程 (HT)
    以及效率较低的即时 (JIT) 编译器生成的代码
    而且 primesieve 使用的 OpemMP 多线程也没有
    似乎与使用 Thread 一样能很好地解决这个问题
    这里有线程池,所以最终的多线程增益大约只有
    四次。
  6. 人们可能会考虑使用“不安全”指针来消除
    数组边界检查,并且已尝试过,但 JIT 编译器没有
    优化指针以及普通的基于数组的代码,因此增益
    没有数组绑定检查会被效率较低的方法取消
    代码(每次指针访问(重新)从内存加载指针地址
    而不是使用已经指向该地址的寄存器,如
    优化的数组情况)。
  7. 当使用较小的缓冲区大小时,Primesieve 甚至更快
    可用 L1 缓存的大小(多线程时为 16 KB)
    对于 i3/i5/i7 CPU),因为其更高效的代码具有更多
    优点是将平均内存访问时间减少到一个时钟周期
    从大约四个时钟周期开始,这个优势使得
    与 DotNet 代码的区别,后者事半功倍
    每减少页数的处理。因此,primsieve 大约是
    当每个人都使用最有效的缓冲区大小时,速度提高五倍。

此 DotNet 代码将在大约一个半小时内(经过测试)计算 (CountTo) 10 的 13 次方(十万亿)的素数数量,并在超过primesieve 需要半天(估计),而 primesieve 则分别需要 20 分钟和不到 4 小时。这在历史上很有趣,因为直到 1985 年,我们只知道 10 到 13 的范围内的素数计数,因为在当时的(昂贵的)超级计算机上需要太多的执行时间才能找到十倍大的范围;现在我们可以在普通台式计算机(本例中为 Intel i7-2700K - 3.5 GHz)上轻松计算这些范围内的素数个数!

为什么 Atkin 和 Bernstein 教授认为 SoA 比 SoE 更快 - 这个神话一直持续到今天,推理如下:

  1. 使用此代码,很容易理解 任何 SoA 实现都会计算状态数
    切换和无平方复合数剔除(最后可以是
    使用与固有使用相同的 2、3、5 轮优化进行优化
    由SoA算法)来确定两者的总数
    对于 32 位数字范围,这些操作大约为 14 亿次。
  2. Bernstein 的 SoA 的“等效”SoE 实现
    实现(两者都没有像此代码那样优化),
    它使用与 SoA 相同的 2、3、5 轮优化,将具有
    相同剔除循环总共约 18.2 亿次剔除操作
    计算复杂度。
  3. 因此,Bernstein 的结果表现优于 30% 左右
    他对 SoE 的实施相比,大约是正确的,只是
    基于等效操作的数量。然而,他的
    SoE 的实施并没有采取“车轮分解”
    max”,因为 SoA 对车轮进一步转动的角度没有太大反应
    分解为 2、3、5 轮“烘焙”到基本轮中
    算法。
  4. 这里使用的轮分解优化减少了
    对于 32 位数字,复合剔除操作约为 12 亿次
    范围;因此,该算法使用这个轮数
    因式分解的运行速度比同等版本快约 16.7%
    SoA 的,因为剔除循环可以以相同的方式实现
    每个算法。
  5. 具有这种优化级别的 SoE 比
    等效的 SoA,因为只需要一个状态表查找数组
    剔除基本质数而不是额外的状态外观
    四个二次方程解法案例中每一个的 up 数组
    产生有效的状态切换。
  6. 当使用与此代码等效的实现编写时,代码
    将等效地响应 SoA 的 C 编译器优化
    primesieve 中使用的 SoE。
  7. SoA 实现还将响应极端的手动“循环”
    primesieve 使用的展开”优化与
    用于 SoE 实施。
  8. 因此,如果我完成了实现 SoA 的所有工作
    使用上述 SoE 代码技术的算法,
    当输出
    素数已被枚举,但使用时会慢约 16.7%
    静态直接方法。
  9. 两种算法的内存占用没有什么不同
    要求基素数的表示和相同数量的
    段页缓冲区。

EDIT_ADD:有趣的是,此代码在 x86 32 位模式下的运行速度比 x64 64 位模式下快 30%,这可能是由于避免了将 uint32 数字扩展为 ulong 的轻微额外开销。以上所有时序均针对 64 位模式。 END_EDIT_ADD

总而言之:阿特金分段分页筛实际上比埃拉托斯特尼最大优化分段分页筛慢,而且没有节省内存需求!!!

我再说一遍:<强>“为什么使用阿特金筛?”。

The following code does the optimizations as discussed at the bottom of my previous answer and includes the following features:

  1. The usable range has been increased to the 64-but unsigned number
    range of 18,446,744,073,709,551,615 with the range overflow checks
    removed since it is unlikely that one would want to run the program
    for the hundreds of years it would take to process the full range of
    numbers to that limit. This is at very little cost in processing
    time as the paging can be done using 32-bit page ranges and only the
    final prime output needs to be computed as a 64-bit number.
  2. It has increased the wheel factorization from a 2,3,5 wheel to use a
    2,3,5,7 prime factor wheel with an additional pre-cull of composite
    numbers using the the additional primes of 11, 13, and 17, to
    greatly reduce the redundant composite number culling (now only
    culling each composite number an average of about 1.5 times). Due
    to the (DotNet related) computational overheads of doing this (also
    applies for the 2,3,5 wheel as the previous version) the actual time
    saving in culling isn't all that great but enumerating the answers
    is somewhat faster due to many of the "simple" composite numbers
    being skipped in the packed bit representation.
  3. It still uses the Task Parallel Library (TPL) from DotNet 4 and up
    for multi-threading from the thread pool on a per page basis.
  4. It now uses a base primes representation that supports automatically
    growing the array contained by this class as more base primes are
    required as a thread safe method instead of the fixed pre-computed
    base primes array used previously.
  5. The base primes representation has been reduced to one byte per base
    prime for a further reduction in memory footprint; thus, the total
    memory footprint other than the code is the array to hold this base
    primes representation for the primes up to the square root of the
    current range being processed, and the packed bit page buffers which
    are currently set at under the L2 cache size of 256 Kilobytes
    (smallest page size of 14,586 bytes times the CHNKSZ of 17 as
    supplied) each per CPU core plus one extra buffer for the foreground
    task to process. With this code, about three Megabytes is
    sufficient to process the prime range up to ten to the fourteenth
    power. As well as speed due to allowing efficient multiprocessing,
    this reduce memory requirement is the other advantage of using a
    paged sieve implementation.

    class UltimatePrimesSoE : IEnumerable<ulong> {
      static readonly uint NUMPRCSPCS = (uint)Environment.ProcessorCount + 1; const uint CHNKSZ = 17;
      const int L1CACHEPOW = 14, L1CACHESZ = (1 << L1CACHEPOW), MXPGSZ = L1CACHESZ / 2; //for buffer ushort[]
      //the 2,3,57 factorial wheel increment pattern, (sum) 48 elements long, starting at prime 19 position
      static readonly byte[] WHLPTRN = { 2,3,1,3,2,1,2,3,3,1,3,2,1,3,2,3,4,2,1,2,1,2,4,3,
                                                                               2,3,1,2,3,1,3,3,2,1,2,3,1,3,2,1,2,1,5,1,5,1,2,1 }; const uint FSTCP = 11;
      static readonly byte[] WHLPOS; static readonly byte[] WHLNDX; //to look up wheel indices from position index
      static readonly byte[] WHLRNDUP; //to look up wheel rounded up index positon values, allow for overfolw
      static readonly uint WCRC = WHLPTRN.Aggregate(0u, (acc, n) => acc + n);
      static readonly uint WHTS = (uint)WHLPTRN.Length; static readonly uint WPC = WHTS >> 4;
      static readonly byte[] BWHLPRMS = { 2, 3, 5, 7, 11, 13, 17 }; const uint FSTBP = 19;
      static readonly uint BWHLWRDS = BWHLPRMS.Aggregate(1u, (acc, p) => acc * p) / 2 / WCRC * WHTS / 16;
      static readonly uint PGSZ = MXPGSZ / BWHLWRDS * BWHLWRDS; static readonly uint PGRNG = PGSZ * 16 / WHTS * WCRC;
      static readonly uint BFSZ = CHNKSZ * PGSZ, BFRNG = CHNKSZ * PGRNG; //number of uints even number of caches in chunk
      static readonly ushort[] MCPY; //a Master Copy page used to hold the lower base primes preculled version of the page
      struct Wst { public ushort msk; public byte mlt; public byte xtr; public ushort nxt; }
      static readonly byte[] PRLUT; /*Wheel Index Look Up Table */ static readonly Wst[] WSLUT; //Wheel State Look Up Table
      static readonly byte[] CLUT; // a Counting Look Up Table for very fast counting of primes
      static int count(uint bitlim, ushort[] buf) { //very fast counting
        if (bitlim < BFRNG) { var addr = (bitlim - 1) / WCRC; var bit = WHLNDX[bitlim - addr * WCRC] - 1; addr *= WPC;
          for (var i = 0; i < 3; ++i) buf[addr++] |= (ushort)((unchecked((ulong)-2) << bit) >> (i << 4)); }
        var acc = 0; for (uint i = 0, w = 0; i < bitlim; i += WCRC)
          acc += CLUT[buf[w++]] + CLUT[buf[w++]] + CLUT[buf[w++]]; return acc; }
      static void cull(ulong lwi, ushort[] b) { ulong nlwi = lwi;
        for (var i = 0u; i < b.Length; nlwi += PGRNG, i += PGSZ) MCPY.CopyTo(b, i); //copy preculled lower base primes.
        for (uint i = 0, pd = 0; ; ++i) { pd += (uint)baseprms[i] >> 6;
          var wi = baseprms[i] & 0x3Fu; var wp = (uint)WHLPOS[wi]; var p = pd * WCRC + PRLUT[wi];
          var pp = (p - FSTBP) >> 1; var k = (ulong)p * (pp + ((FSTBP - 1) >> 1)) + pp;
          if (k >= nlwi) break; if (k < lwi) { k = (lwi - k) % (WCRC * p);
            if (k != 0) { var nwp = wp + (uint)((k + p - 1) / p); k = (WHLRNDUP[nwp] - wp) * p - k;
              if (nwp >= WCRC) wp = 0; else wp = nwp; } }
          else k -= lwi; var kd = k / WCRC; var kn = WHLNDX[k - kd * WCRC];
          for (uint wrd = (uint)kd * WPC + (uint)(kn >> 4), ndx = wi * WHTS + kn; wrd < b.Length; ) {
            var st = WSLUT[ndx]; b[wrd] |= st.msk; wrd += st.mlt * pd + st.xtr; ndx = st.nxt; } } }
      static Task cullbf(ulong lwi, ushort[] b, Action<ushort[]> f) {
        return Task.Factory.StartNew(() => { cull(lwi, b); f(b); }); }
      class Bpa {   //very efficient auto-resizing thread-safe read-only indexer class to hold the base primes array
        byte[] sa = new byte[0]; uint lwi = 0, lpd = 0; object lck = new object();
        public uint this[uint i] { get { if (i >= this.sa.Length) lock (this.lck) {
                var lngth = this.sa.Length; while (i >= lngth) {
                  var bf = (ushort[])MCPY.Clone(); if (lngth == 0) {
                    for (uint bi = 0, wi = 0, w = 0, msk = 0x8000, v = 0; w < bf.Length;
                        bi += WHLPTRN[wi++], wi = (wi >= WHTS) ? 0 : wi) {
                      if (msk >= 0x8000) { msk = 1; v = bf[w++]; } else msk <<= 1;
                      if ((v & msk) == 0) { var p = FSTBP + (bi + bi); var k = (p * p - FSTBP) >> 1;
                        if (k >= PGRNG) break; var pd = p / WCRC; var kd = k / WCRC; var kn = WHLNDX[k - kd * WCRC];
                        for (uint wrd = kd * WPC + (uint)(kn >> 4), ndx = wi * WHTS + kn; wrd < bf.Length; ) {
                          var st = WSLUT[ndx]; bf[wrd] |= st.msk; wrd += st.mlt * pd + st.xtr; ndx = st.nxt; } } } }
                  else { this.lwi += PGRNG; cull(this.lwi, bf); }
                  var c = count(PGRNG, bf); var na = new byte[lngth + c]; sa.CopyTo(na, 0);
                  for (uint p = FSTBP + (this.lwi << 1), wi = 0, w = 0, msk = 0x8000, v = 0;
                      lngth < na.Length; p += (uint)(WHLPTRN[wi++] << 1), wi = (wi >= WHTS) ? 0 : wi) {
                    if (msk >= 0x8000) { msk = 1; v = bf[w++]; } else msk <<= 1; if ((v & msk) == 0) {
                      var pd = p / WCRC; na[lngth++] = (byte)(((pd - this.lpd) << 6) + wi); this.lpd = pd; }
                  } this.sa = na; } } return this.sa[i]; } } }
      static readonly Bpa baseprms = new Bpa();
      static UltimatePrimesSoE() {
        WHLPOS = new byte[WHLPTRN.Length + 1]; //to look up wheel position index from wheel index
        for (byte i = 0, acc = 0; i < WHLPTRN.Length; ++i) { acc += WHLPTRN[i]; WHLPOS[i + 1] = acc; }
        WHLNDX = new byte[WCRC + 1]; for (byte i = 1; i < WHLPOS.Length; ++i) {
          for (byte j = (byte)(WHLPOS[i - 1] + 1); j <= WHLPOS[i]; ++j) WHLNDX[j] = i; }
        WHLRNDUP = new byte[WCRC * 2]; for (byte i = 1; i < WHLRNDUP.Length; ++i) {
          if (i > WCRC) WHLRNDUP[i] = (byte)(WCRC + WHLPOS[WHLNDX[i - WCRC]]); else WHLRNDUP[i] = WHLPOS[WHLNDX[i]]; }
        Func<ushort, int> nmbts = (v) => { var acc = 0; while (v != 0) { acc += (int)v & 1; v >>= 1; } return acc; };
        CLUT = new byte[1 << 16]; for (var i = 0; i < CLUT.Length; ++i) CLUT[i] = (byte)nmbts((ushort)(i ^ -1));
        PRLUT = new byte[WHTS]; for (var i = 0; i < PRLUT.Length; ++i) {
          var t = (uint)(WHLPOS[i] * 2) + FSTBP; if (t >= WCRC) t -= WCRC; if (t >= WCRC) t -= WCRC; PRLUT[i] = (byte)t; }
        WSLUT = new Wst[WHTS * WHTS]; for (var x = 0u; x < WHTS; ++x) {
          var p = FSTBP + 2u * WHLPOS[x]; var pr = p % WCRC;
          for (uint y = 0, pos = (p * p - FSTBP) / 2; y < WHTS; ++y) {
            var m = WHLPTRN[(x + y) % WHTS];
            pos %= WCRC; var posn = WHLNDX[pos]; pos += m * pr; var nposd = pos / WCRC; var nposn = WHLNDX[pos - nposd * WCRC];
            WSLUT[x * WHTS + posn] = new Wst { msk = (ushort)(1 << (int)(posn & 0xF)), mlt = (byte)(m * WPC),
                                               xtr = (byte)(WPC * nposd + (nposn >> 4) - (posn >> 4)),
                                               nxt = (ushort)(WHTS * x + nposn) }; } }
        MCPY = new ushort[PGSZ]; foreach (var lp in BWHLPRMS.SkipWhile(p => p < FSTCP)) { var p = (uint)lp;
          var k = (p * p - FSTBP) >> 1; var pd = p / WCRC; var kd = k / WCRC; var kn = WHLNDX[k - kd * WCRC];
          for (uint w = kd * WPC + (uint)(kn >> 4), ndx = WHLNDX[(2 * WCRC + p - FSTBP) / 2] * WHTS + kn; w < MCPY.Length; ) {
            var st = WSLUT[ndx]; MCPY[w] |= st.msk; w += st.mlt * pd + st.xtr; ndx = st.nxt; } } }
      struct PrcsSpc { public Task tsk; public ushort[] buf; }
      class nmrtr : IEnumerator<ulong>, IEnumerator, IDisposable {
        PrcsSpc[] ps = new PrcsSpc[NUMPRCSPCS]; ushort[] buf;
        public nmrtr() { for (var s = 0u; s < NUMPRCSPCS; ++s) ps[s] = new PrcsSpc { buf = new ushort[BFSZ] };
          for (var s = 1u; s < NUMPRCSPCS; ++s) {
            ps[s].tsk = cullbf((s - 1u) * BFRNG, ps[s].buf, (bfr) => { }); } buf = ps[0].buf; }
        ulong _curr, i = (ulong)-WHLPTRN[WHTS - 1]; int b = -BWHLPRMS.Length - 1; uint wi = WHTS - 1; ushort v, msk = 0;
        public ulong Current { get { return this._curr; } } object IEnumerator.Current { get { return this._curr; } }
        public bool MoveNext() {
          if (b < 0) { if (b == -1) b += buf.Length; //no yield!!! so automatically comes around again
            else { this._curr = (ulong)BWHLPRMS[BWHLPRMS.Length + (++b)]; return true; } }
          do {
            i += WHLPTRN[wi++]; if (wi >= WHTS) wi = 0; if ((this.msk <<= 1) == 0) {
              if (++b >= BFSZ) { b = 0; for (var prc = 0; prc < NUMPRCSPCS - 1; ++prc) ps[prc] = ps[prc + 1];
                ps[NUMPRCSPCS - 1u].buf = buf;
                ps[NUMPRCSPCS - 1u].tsk = cullbf(i + (NUMPRCSPCS - 1u) * BFRNG, buf, (bfr) => { });
                ps[0].tsk.Wait(); buf = ps[0].buf; } v = buf[b]; this.msk = 1; } }
          while ((v & msk) != 0u); _curr = FSTBP + i + i; return true; }
        public void Reset() { throw new Exception("Primes enumeration reset not implemented!!!"); }
        public void Dispose() { } }
      public IEnumerator<ulong> GetEnumerator() { return new nmrtr(); }
      IEnumerator IEnumerable.GetEnumerator() { return new nmrtr(); }
      static void IterateTo(ulong top_number, Action<ulong, uint, ushort[]> actn) {
        PrcsSpc[] ps = new PrcsSpc[NUMPRCSPCS]; for (var s = 0u; s < NUMPRCSPCS; ++s) ps[s] = new PrcsSpc {
          buf = new ushort[BFSZ], tsk = Task.Factory.StartNew(() => { }) };
        var topndx = (top_number - FSTBP) >> 1; for (ulong ndx = 0; ndx <= topndx; ) {
          ps[0].tsk.Wait(); var buf = ps[0].buf; for (var s = 0u; s < NUMPRCSPCS - 1; ++s) ps[s] = ps[s + 1];
          var lowi = ndx; var nxtndx = ndx + BFRNG; var lim = topndx < nxtndx ? (uint)(topndx - ndx + 1) : BFRNG;
          ps[NUMPRCSPCS - 1] = new PrcsSpc { buf = buf, tsk = cullbf(ndx, buf, (b) => actn(lowi, lim, b)) };
          ndx = nxtndx; } for (var s = 0u; s < NUMPRCSPCS; ++s) ps[s].tsk.Wait(); }
      public static long CountTo(ulong top_number) {
        if (top_number < FSTBP) return BWHLPRMS.TakeWhile(p => p <= top_number).Count();
        var cnt = (long)BWHLPRMS.Length;
        IterateTo(top_number, (lowi, lim, b) => { Interlocked.Add(ref cnt, count(lim, b)); }); return cnt; }
      public static ulong SumTo(uint top_number) {
        if (top_number < FSTBP) return (ulong)BWHLPRMS.TakeWhile(p => p <= top_number).Aggregate(0u, (acc, p) => acc += p);
        var sum = (long)BWHLPRMS.Aggregate(0u, (acc, p) => acc += p);
        Func<ulong, uint, ushort[], long> sumbf = (lowi, bitlim, buf) => {
          var acc = 0L; for (uint i = 0, wi = 0, msk = 0x8000, w = 0, v = 0; i < bitlim;
              i += WHLPTRN[wi++], wi = wi >= WHTS ? 0 : wi) {
            if (msk >= 0x8000) { msk = 1; v = buf[w++]; } else msk <<= 1;
            if ((v & msk) == 0) acc += (long)(FSTBP + ((lowi + i) << 1)); } return acc; };
        IterateTo(top_number, (pos, lim, b) => { Interlocked.Add(ref sum, sumbf(pos, lim, b)); }); return (ulong)sum; }
      static void IterateUntil(Func<ulong, ushort[], bool> prdct) {
        PrcsSpc[] ps = new PrcsSpc[NUMPRCSPCS];
        for (var s = 0u; s < NUMPRCSPCS; ++s) { var buf = new ushort[BFSZ];
          ps[s] = new PrcsSpc { buf = buf, tsk = cullbf(s * BFRNG, buf, (bfr) => { }) }; }
        for (var ndx = 0UL; ; ndx += BFRNG) {
          ps[0].tsk.Wait(); var buf = ps[0].buf; var lowi = ndx; if (prdct(lowi, buf)) break;
          for (var s = 0u; s < NUMPRCSPCS - 1; ++s) ps[s] = ps[s + 1];
          ps[NUMPRCSPCS - 1] = new PrcsSpc { buf = buf,
                                             tsk = cullbf(ndx + NUMPRCSPCS * BFRNG, buf, (bfr) => { }) }; } }
      public static ulong ElementAt(long n) {
        if (n < BWHLPRMS.Length) return (ulong)BWHLPRMS.ElementAt((int)n);
        long cnt = BWHLPRMS.Length; var ndx = 0UL; var cycl = 0u; var bit = 0u; IterateUntil((lwi, bfr) => {
          var c = count(BFRNG, bfr); if ((cnt += c) < n) return false; ndx = lwi; cnt -= c; c = 0;
          do { var w = cycl++ * WPC; c = CLUT[bfr[w++]] + CLUT[bfr[w++]] + CLUT[bfr[w]]; cnt += c; } while (cnt < n);
          cnt -= c; var y = (--cycl) * WPC; ulong v = ((ulong)bfr[y + 2] << 32) + ((ulong)bfr[y + 1] << 16) + bfr[y];
          do { if ((v & (1UL << ((int)bit++))) == 0) ++cnt; } while (cnt <= n); --bit; return true;
        }); return FSTBP + ((ndx + cycl * WCRC + WHLPOS[bit]) << 1); } }
    

The above code takes about 59 milliseconds to find the primes to two million (slightly slower than some of the other simpler codes due to initialization overhead), but calculates the primes to one billion and the full number range in 1.55 and 5.95 seconds, respectively. This isn't much faster than the last version due to the DotNet extra overhead of an extra array bound check in the enumeration of found primes compared to the time expended in culling composite numbers which is less than a third of the time spent emumerating, so the saving in culling composites is cancelled out by the extra time (due to an extra array bounds check per prime candidate) in the enumeration. However, for many tasks involving primes, one does not need to enumerate all primes but can just compute the answers without enumeration.

For the above reasons, this class provides the example static methods "CountTo", "SumTo", and "ElementAt" to count or sum the primes to a given upper limit or to output the zero-based nth prime, respectively. The "CountTo" method will produce the number of primes to one billion and in the 32-bit number range in about 0.32 and 1.29 seconds, respectively; the "ElementAt" method will produce the last element in these ranges in about 0.32 and 1.25 seconds, respectively, and the "SumTo" method produces the sum of all the primes in these ranges in about 0.49 and 1.98 seconds respectively. This program calculates the sum of all the prime numbers to four billion plus as here in less time than many naive implementations can sum all the prime numbers to two million as in Euler Problem 10, for over 2000 times the practical range!

This code is only about four times slower than very highly optimized C code used by primesieve, and the reasons it is slower are mostly due to DotNet, as follows (discussing the case of a 256 Kilobyte buffer, which is the size of the L2 cache):

  1. Most of the execution time is spent in the main composite culling
    loop, which is the last "for loop" in the private static "cull"
    method" and only contains four statements per loop plus the range
    check.
  2. In DotNet, this compiles to take about 21.83 CPU clock cycles per
    loop, including about 5 clock cycles for the two array bounds
    checks per loop.
  3. The very efficient C compiler converts this loop into only about
    8.33 clock cycles for an advantage of about 2.67 times.
  4. Primesieve also uses extreme manual "loop unrolling" optimizations to
    reduce the average time to perform the work per loop to about 4.17
    clock cycles per composite cull loop, for a additional gain of two
    times and a total gain of about 5.3 times.
  5. Now, the highly optimized C code both doesn't Hyper Thread (HT) as
    well as the less efficient Just In Time (JIT) compiler produced code
    and as well, the OpemMP multi-threading used by primesieve doesn't
    appear to be as well adapted to this problem as use of the Thread
    Pool threads here, so the final multi-threaded gain in only about
    four times.
  6. One might consider the use of "unsafe" pointers to eliminate the
    array bounds checks, and it was tried, but the JIT compiler does not
    optimize pointers as well as normal array based code, so the gain of
    not having array bound checks is cancelled by the less efficient
    code (every pointer access (re)loads the pointer address from memory
    rather than using a register already pointing to that address as in
    the optimized array case).
  7. Primesieve is even faster when using smaller buffer sizes as in the
    size of the available L1 cache (16 Kilobytes when multi-threading
    for the i3/i5/i7 CPU's) as its more efficient code has more of an
    advantage reducing the average memory access time to one clock cycle
    from about four clock cycles, which advantage makes much less of a
    difference to the DotNet code, which gains more from less
    processing per reduced number of pages. Thus, primesieve is about
    five times faster when each use their most efficient buffer size.

This DotNet code will count (CountTo) the number of primes to ten to the power thirteen (ten trillion) in about an hour and a half (tested) and the number of primes to a hundred trillion (ten to the fourteenth) in just over a half day (estimated) as compared to twenty minutes and under four hours for primesieve, respectively. This is interesting historically as until 1985 only the count of primes in the range to ten to the thirteenth was known since it would take too much execution time on the (expensive) supercomputers of that day to find the range ten times larger; now we can easily compute the number of primes in those ranges on a common desktop computer (in this case, an Intel i7-2700K - 3.5 GHz)!

Using this code, it is easy to understand why Professor Atkin and Bernstein thought that the SoA was faster than the SoE - a myth that persists to this day, with the reasoning as follows:

  1. It is easy to have any SoA implementation count the number of state
    toggles and square free composite number culls (which last can be
    optimized using the same 2,3,5 wheel optimization as used inherently
    by the SoA algorithm) to determine that the total number of both of
    these operations is about 1.4 billion for the 32-bit number range.
  2. Bernstein's "equivalent" SoE implementation to his SoA
    implementation (neither of which are as optimized as this code),
    which uses the same 2,3,5 wheel optimization as the SoA, will have a
    total of about 1.82 billion cull operations with the same cull loop
    computational complexity.
  3. Therefore, Bernstein's results of about 30% better performance as
    compared to his implementation of the SoE is about right, just
    based on the number of equivalent operations. However, his
    implementation of the SoE did not take wheel factorization "to the
    max" since the SoA does not much respond to further degrees of wheel
    factorization as the 2,3,5 wheel is "baked in" to the basic
    algorithm.
  4. The wheel factorization optimizations used here reduces the number of
    composite cull operations to about 1.2 billion for the 32-bit number
    range; therefore, this algorithm using this degree of wheel
    factorization will run about 16.7% faster than an equivalent version
    of SoA since the culling loop(s) can be implemented about the same for
    each algorithm.
  5. The SoE with this level of optimizations is easier to write than an
    equivalent SoA as there needs only be one state table look up array
    to cull across the base primes instead of an additional state look
    up array's for each of the four quadratic equation solution cases
    that produce valid state toggles.
  6. When written using equivalent implementations as this code, the code
    will respond equivalently to C compiler optimizations for SoA as for
    the SoE used in primesieve.
  7. The SoA implementation will also respond to the extreme manual "loop
    unrolling" optimization used by primesieve just as effectively as
    for the SoE implementation.
  8. Therefore, if I went though all the work to implement the SoA
    algorithm using the techniques as for the above SoE code, the
    resulting SoA would only be a slight amount slower when the output
    primes were enumerated but would be about 16.7% slower when using
    the static direct methods
    .
  9. Memory footprint of the two algorithms is no different as both
    require the representation of the base primes and the same number of
    segment page buffers.

EDIT_ADD: Interestingly, this code runs 30% faster in x86 32-bit mode than in x64 64-bit mode, likely due to avoiding the slight extra overhead of extending the uint32 numbers to ulong's. All of the above timings are for 64-bit mode. END_EDIT_ADD

In summary: The segment paged Sieve of Atkin is actually slower than a maximally optimized segment paged Sieve of Eratosthenes with no saving in memory requirements!!!

I say again: "Why use the Sieve of Atkin?".

絕版丫頭 2024-08-15 00:13:25

这是对提出的问题的更明确的回答,如下:

有办法(阿特金增量筛 - GBG)吗?

当然,阿特金筛法 (SoA) 可以实现为分段算法,实际上根本不需要使用建议的分段数组,而只需使用原始序列即可完成 我已经使用 F# 完成了功能,尽管这比使用 F# 所允许的额外效率要慢很多用于剔除的可变数组。

我想事情可能会是这样的:
1. 从一些琐碎的限制开始
2. 找出所有达到极限的素数
3. 找出所有达到极限的素数
4. 产生所有新发现的素数
5. 增加限制(通过将旧限制加倍或平方或类似的方式)
6. 转到步骤 2

上述算法至少可以通过两种不同的方式实现: 1) 当序列“运行”当前段并开始时,为“x”的每个值保存“x”和“y”的状态再次使用下一段的这些值,或者 2) 计算适用于新段的“x”和“y”对值。虽然第一种方法更简单,但我推荐第二种方法有两个原因:1)它不使用内存来存储必须保存的 x 和 y 的所有(许多)值,并且只必须保存基本素数的表示2)它为使用多线程并为每个页段分配独立的线程操作打开了大门,以便在多处理器计算机上节省大量时间。

事实上,有必要更好地理解“x”和“y”:

我的主要问题是我不太明白这个算法中的 x 和 y 是什么。比如,我可以使用相同的算法,但将 x 和 y 设置为 oldLimit (最初为 1)并运行到 newLimit 吗?

有一个答案解决了这个问题,但可能还不够明确。将这些二次方程视为潜在的无限序列可能更容易,其中“x”或“y”之一从其最低值开始固定,另一个变量生成每个序列的所有元素。例如,可以将二次方程“4*x^2 + y^2”的奇数表达式视为以 5, 17, 37, 65, ... 开头的序列序列,并且每个序列都有元素如 {5, 13, 29, 53, ...}, {17, 25, 41, 65, ...}, {37, 45, 61, 85, ...}, {65, 73, 89, 125, ...}, ... 显然,其中一些元素不是素数,因为它们是 3 或 5 的复合,这就是为什么这些元素需要通过模测试或伯恩斯坦中的方法来消除在实现中,可以通过识别生成器的模算术中的模式来自动跳过这些,因此它们实际上永远不会出现。

实现第一种更简单的方法来生成 SoA 的分段版本,只需要保存每个序列序列的状态,这基本上就是 F# 增量实现中所做的事情(尽管使用折叠树结构来提高效率),这可以很容易地适应在数组页面范围内工作。对于在每个段页面的开头计算序列状态的情况,只需计算每个“活动”序列的新段页面中最低元素所表示的数量为止,有多少元素可以放入空间中,其中“active”表示起始元素低于段页起始索引所表示的数字的序列。

至于如何实现基于数组的 SoA 筛分的伪代码,我已经为 相关帖子,其中展示了如何实现这一点。

这样做的目的是让我不必设置该限制。这样我就可以使用 Linq 并只需要 Take() 无论我需要多少素数,而不用担心限制是否足够高等等。

正如另一个答案中所述,您可以通过在代码中将最大“限制”设置为常量来实现此最终目标,但这对于小范围的素数来说效率相当低,因为剔除将在比该数组大得多的数组上进行必要的。如前所述,除了提高效率和大幅减少内存使用之外,分段还具有允许有效使用多处理的其他优点。然而,使用 Take()、TakeWhile()、Where()、Count() 等方法不会为大范围的素数提供非常好的性能,因为它们的使用涉及在许多时钟周期对每个元素进行许多堆栈方法调用每次调用和返回。但是您可以选择使用这些或更命令式的程序流形式,因此这并不是真正的反对意见。

This is a more explicit answer to the question(s) raised, as follows:

Is there a way (the incremental Sieve of Atkin - GBG) can be done?

Of course the Sieve of Atkin (SoA) can be implemented as a segmented algorithm and in fact does not require the use of segmenting arrays as suggested at all but rather can be done just using raw sequences as I have done functionally using F#, although this is quite a bit slower than the extra efficiency permitted by using mutable arrays for the culling.

I am thinking it could go something like this:
1. Start with some trivial limit
2. Find all the primes up to the limit
3. Find all the primes up to the limit
4. Yield all the newfound primes
5. Increase the limit (by doubling or squaring the old limit or something like that)
6. Goto step 2

The above algorithm can be implemented in at least two different ways: 1) save the state of 'x' and 'y' for each value of 'x' when the sequences "run off" the current segment and start up again with those values for the next segment, or 2) calculate the applicable pair values of 'x' and 'y' to use for the new segment. Although the first way is simpler, I recommend the second method for two reasons: 1) it doesn't use memory for all the (many) values of x and y that must be saved and only a representation of the base primes must be saved in memory for the "squares free" culling step, and 2) it opens the door to using multi-threading and assigning independent thread operations for each page segment for large savings in time on a multi-processor computer.

And indeed, a better understanding of 'x' and 'y' is necessary:

My main problem is that I don't quite understand what x and y for example is in this algorithm. Like, could I just use the same algorithm kind of but set x and y to oldLimit (initially 1) and run it up to newLimit?

There has been one answer addressing this but perhaps it isn't explicit enough. It may be easier to think of these quadratic equations as a potentially infinite sequence of sequences where one of 'x' or 'y' is fixed starting from their lowest values and the other variable produces all of the elements per sequence. For instance, one could consider the odd expression of the quadratic equation "4*x^2 + y^2" as the sequence of sequences starting with 5, 17, 37, 65, ..., and with each of these sequences have elements as in {5, 13, 29, 53, ...}, {17, 25, 41, 65, ...}, {37, 45, 61, 85, ...}, {65, 73, 89, 125, ...}, ... Obviously, some of these elements are not prime as they are composites of 3 or 5 and that is why these need to be eliminated by either the modulo test, or alternatively as in the Bernstein implementation, these can be skipped automatically by recognizing patterns in the modulo arithmetic for the generators so they never actually even appear.

Implementing the first easier way of producing a segmented version of the SoA then just requires saving the state of each of the sequence of sequences, which is basically what is done in the F# incremental implementation (although using a folded tree structure for efficiency), which could easily be adapted to working over an array page range. For the case where the sequence state is computed at the beginning of every segment page, one just needs to compute how many elements would fit into the space up to the number represented by the lowest element in the new segment page for each "active" sequence, where "active" means a sequence whose starting element is lower than the number represented by the start index of the segment page.

As for pseudo-code on how to implement the segmentation of an array based SoA sieve, I have written something up for a related post, which shows how this can be accomplished.

The point of this is so that I won't have to set that limit. So that I can for example use Linq and just Take() however many primes I need, not worrying about if the limit is high enough, et cetera.

As stated in another answer, you could accomplish this end goal by just setting the maximum "limit" as a constant in your code, but this would be quite inefficient for small ranges of primes as culling would be taking place over a much larger array than necessary. As already stated, other than for better efficiency and reducing memory use by a huge factor, segmentation also has other advantages in permitting the efficient use of multi-processing. However, use of the Take(), TakeWhile(), Where(), Count(), etcetera, methods won't provide very good performance for large ranges of primes as their use involves many stacked method calls per element at many clock cycles per call and return. But you would have the option of either using these or more imperative forms of program flow, so this isn't a real objection.

山川志 2024-08-15 00:13:25

我可以尝试解释 x 和 y 的作用,但我认为如果不从头开始重新启动循环,您就无法执行您所要求的操作。这对于任何“筛子”算法来说几乎都是一样的。

筛子的作用基本上是计算有多少个不同的二次方程(偶数或奇数)将每个数字作为解。根据 n % 12 的不同,针对每个数字检查的具体方程也有所不同。

例如,模 12 余数为 1 或 5 的数字 n 是素数,当且仅当解数为 4*x^2+y ^2=n 是奇数且无平方数。第一个循环只是循环遍历可以满足这些不同方程的所有可能的 x 和 y 值。每次我们找到该 n 的解时,通过翻转 isPrime[n],我们可以跟踪解的数量是奇数还是偶数。

问题是我们同时对所有可能的 n 进行计数,这比一次检查一个要高效得多。仅执行一些 n 会花费更多时间(因为您必须确保第一个循环中的 n >= lower_limit)并使第二个循环变得复杂,因为该循环需要了解所有素数小于 sqrt。

第二个循环检查该数字是否无平方(没有因数是素数的平方)。

另外,我认为您对阿特金筛的实施不一定比直接的埃拉托色尼筛更快。然而,最快的、最优化的阿特金筛将击败最快的、最优化的埃拉托色尼筛。

I can try to explain what x and y does, but I don't think you can do what you ask without restarting the loops from the beginning. This is pretty much the same for any "sieve"-algorithm.

What the sieve does is basically count how many different quadratic equations (even or odd) have each number as a solution. The specific equation checked for each number is different depending on what n % 12 is.

For example, numbers n which have a mod 12 remainder of 1 or 5 are prime if and only if the number of solutions to 4*x^2+y^2=n is odd and the number is square-free. The first loop simply loops through all possible values of x and y that could satisfy these different equations. By flipping the isPrime[n] each time we find a solution for that n, we can keep track of whether the number of solutions is odd or even.

The thing is that we count this for all possible n at the same time, which makes this much more efficient than checking one at the time. Doing it for just some n would take more time (because you would have to make sure that n >= lower_limit in the first loop) and complicate the second loop, since that one requires knowledge of all primes smaller than sqrt.

The second loop checks that the number is square-free (has no factor which is a square of a prime number).

Also, I don't think your implementation of the sieve of Atkin is necessarily faster than a straight-forward sieve of Eratosthenes would be. However, the fastest possible most optimized sieve of Atkin would beat the fastest possible most optimized sieve of Eratosthenes.

百合的盛世恋 2024-08-15 00:13:25

是的,阿特金筛可以增量。请参阅论文 “两个紧凑增量素筛”,作者:Jonathan Sorenson

Yes, the sieve of Atkin can be made incremental. See the paper "Two compact incremental prime sieves" by Jonathan Sorenson.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文