如何使用多线程 C# 实现埃拉托斯特尼筛法?
我正在尝试使用多线程实现埃拉托斯特尼筛法。这是我的实现:
using System;
using System.Collections.Generic;
using System.Threading;
namespace Sieve_Of_Eratosthenes
{
class Controller
{
public static int upperLimit = 1000000000;
public static bool[] primeArray = new bool[upperLimit];
static void Main(string[] args)
{
DateTime startTime = DateTime.Now;
Initialize initial1 = new Initialize(0, 249999999);
Initialize initial2 = new Initialize(250000000, 499999999);
Initialize initial3 = new Initialize(500000000, 749999999);
Initialize initial4 = new Initialize(750000000, 999999999);
initial1.thread.Join();
initial2.thread.Join();
initial3.thread.Join();
initial4.thread.Join();
int sqrtLimit = (int)Math.Sqrt(upperLimit);
Sieve sieve1 = new Sieve(249999999);
Sieve sieve2 = new Sieve(499999999);
Sieve sieve3 = new Sieve(749999999);
Sieve sieve4 = new Sieve(999999999);
for (int i = 3; i < sqrtLimit; i += 2)
{
if (primeArray[i] == true)
{
int squareI = i * i;
if (squareI <= 249999999)
{
sieve1.set(i);
sieve2.set(i);
sieve3.set(i);
sieve4.set(i);
sieve1.thread.Join();
sieve2.thread.Join();
sieve3.thread.Join();
sieve4.thread.Join();
}
else if (squareI > 249999999 & squareI <= 499999999)
{
sieve2.set(i);
sieve3.set(i);
sieve4.set(i);
sieve2.thread.Join();
sieve3.thread.Join();
sieve4.thread.Join();
}
else if (squareI > 499999999 & squareI <= 749999999)
{
sieve3.set(i);
sieve4.set(i);
sieve3.thread.Join();
sieve4.thread.Join();
}
else if (squareI > 749999999 & squareI <= 999999999)
{
sieve4.set(i);
sieve4.thread.Join();
}
}
}
int count = 0;
primeArray[2] = true;
for (int i = 2; i < upperLimit; i++)
{
if (primeArray[i])
{
count++;
}
}
Console.WriteLine("Total: " + count);
DateTime endTime = DateTime.Now;
TimeSpan elapsedTime = endTime - startTime;
Console.WriteLine("Elapsed time: " + elapsedTime.Seconds);
}
public class Initialize
{
public Thread thread;
private int lowerLimit;
private int upperLimit;
public Initialize(int lowerLimit, int upperLimit)
{
this.lowerLimit = lowerLimit;
this.upperLimit = upperLimit;
thread = new Thread(this.InitializeArray);
thread.Priority = ThreadPriority.Highest;
thread.Start();
}
private void InitializeArray()
{
for (int i = this.lowerLimit; i <= this.upperLimit; i++)
{
if (i % 2 == 0)
{
Controller.primeArray[i] = false;
}
else
{
Controller.primeArray[i] = true;
}
}
}
}
public class Sieve
{
public Thread thread;
public int i;
private int upperLimit;
public Sieve(int upperLimit)
{
this.upperLimit = upperLimit;
}
public void set(int i)
{
this.i = i;
thread = new Thread(this.primeGen);
thread.Start();
}
public void primeGen()
{
for (int j = this.i * this.i; j <= this.upperLimit; j += i)
{
Controller.primeArray[j] = false;
}
}
}
}
}
这需要 30 秒才能产生输出,有什么办法可以加快速度吗?
编辑: 这是 TPL 实施:
public LinkedList<int> GetPrimeList(int limit) {
LinkedList<int> primeList = new LinkedList<int>();
bool[] primeArray = new bool[limit];
Console.WriteLine("Initialization started...");
Parallel.For(0, limit, i => {
if (i % 2 == 0) {
primeArray[i] = false;
} else {
primeArray[i] = true;
}
}
);
Console.WriteLine("Initialization finished...");
/*for (int i = 0; i < limit; i++) {
if (i % 2 == 0) {
primeArray[i] = false;
} else {
primeArray[i] = true;
}
}*/
int sqrtLimit = (int)Math.Sqrt(limit);
Console.WriteLine("Operation started...");
Parallel.For(3, sqrtLimit, i => {
lock (this) {
if (primeArray[i]) {
for (int j = i * i; j < limit; j += i) {
primeArray[j] = false;
}
}
}
}
);
Console.WriteLine("Operation finished...");
/*for (int i = 3; i < sqrtLimit; i += 2) {
if (primeArray[i]) {
for (int j = i * i; j < limit; j += i) {
primeArray[j] = false;
}
}
}*/
//primeList.AddLast(2);
int count = 1;
Console.WriteLine("Counting started...");
Parallel.For(3, limit, i => {
lock (this) {
if (primeArray[i]) {
//primeList.AddLast(i);
count++;
}
}
}
);
Console.WriteLine("Counting finished...");
Console.WriteLine(count);
/*for (int i = 3; i < limit; i++) {
if (primeArray[i]) {
primeList.AddLast(i);
}
}*/
return primeList;
}
谢谢。
I am trying to implement Sieve Of Eratosthenes using Mutithreading. Here is my implementation:
using System;
using System.Collections.Generic;
using System.Threading;
namespace Sieve_Of_Eratosthenes
{
class Controller
{
public static int upperLimit = 1000000000;
public static bool[] primeArray = new bool[upperLimit];
static void Main(string[] args)
{
DateTime startTime = DateTime.Now;
Initialize initial1 = new Initialize(0, 249999999);
Initialize initial2 = new Initialize(250000000, 499999999);
Initialize initial3 = new Initialize(500000000, 749999999);
Initialize initial4 = new Initialize(750000000, 999999999);
initial1.thread.Join();
initial2.thread.Join();
initial3.thread.Join();
initial4.thread.Join();
int sqrtLimit = (int)Math.Sqrt(upperLimit);
Sieve sieve1 = new Sieve(249999999);
Sieve sieve2 = new Sieve(499999999);
Sieve sieve3 = new Sieve(749999999);
Sieve sieve4 = new Sieve(999999999);
for (int i = 3; i < sqrtLimit; i += 2)
{
if (primeArray[i] == true)
{
int squareI = i * i;
if (squareI <= 249999999)
{
sieve1.set(i);
sieve2.set(i);
sieve3.set(i);
sieve4.set(i);
sieve1.thread.Join();
sieve2.thread.Join();
sieve3.thread.Join();
sieve4.thread.Join();
}
else if (squareI > 249999999 & squareI <= 499999999)
{
sieve2.set(i);
sieve3.set(i);
sieve4.set(i);
sieve2.thread.Join();
sieve3.thread.Join();
sieve4.thread.Join();
}
else if (squareI > 499999999 & squareI <= 749999999)
{
sieve3.set(i);
sieve4.set(i);
sieve3.thread.Join();
sieve4.thread.Join();
}
else if (squareI > 749999999 & squareI <= 999999999)
{
sieve4.set(i);
sieve4.thread.Join();
}
}
}
int count = 0;
primeArray[2] = true;
for (int i = 2; i < upperLimit; i++)
{
if (primeArray[i])
{
count++;
}
}
Console.WriteLine("Total: " + count);
DateTime endTime = DateTime.Now;
TimeSpan elapsedTime = endTime - startTime;
Console.WriteLine("Elapsed time: " + elapsedTime.Seconds);
}
public class Initialize
{
public Thread thread;
private int lowerLimit;
private int upperLimit;
public Initialize(int lowerLimit, int upperLimit)
{
this.lowerLimit = lowerLimit;
this.upperLimit = upperLimit;
thread = new Thread(this.InitializeArray);
thread.Priority = ThreadPriority.Highest;
thread.Start();
}
private void InitializeArray()
{
for (int i = this.lowerLimit; i <= this.upperLimit; i++)
{
if (i % 2 == 0)
{
Controller.primeArray[i] = false;
}
else
{
Controller.primeArray[i] = true;
}
}
}
}
public class Sieve
{
public Thread thread;
public int i;
private int upperLimit;
public Sieve(int upperLimit)
{
this.upperLimit = upperLimit;
}
public void set(int i)
{
this.i = i;
thread = new Thread(this.primeGen);
thread.Start();
}
public void primeGen()
{
for (int j = this.i * this.i; j <= this.upperLimit; j += i)
{
Controller.primeArray[j] = false;
}
}
}
}
}
This takes 30 seconds to produce the output, is there any way to speed this up?
Edit:
Here is the TPL implementation:
public LinkedList<int> GetPrimeList(int limit) {
LinkedList<int> primeList = new LinkedList<int>();
bool[] primeArray = new bool[limit];
Console.WriteLine("Initialization started...");
Parallel.For(0, limit, i => {
if (i % 2 == 0) {
primeArray[i] = false;
} else {
primeArray[i] = true;
}
}
);
Console.WriteLine("Initialization finished...");
/*for (int i = 0; i < limit; i++) {
if (i % 2 == 0) {
primeArray[i] = false;
} else {
primeArray[i] = true;
}
}*/
int sqrtLimit = (int)Math.Sqrt(limit);
Console.WriteLine("Operation started...");
Parallel.For(3, sqrtLimit, i => {
lock (this) {
if (primeArray[i]) {
for (int j = i * i; j < limit; j += i) {
primeArray[j] = false;
}
}
}
}
);
Console.WriteLine("Operation finished...");
/*for (int i = 3; i < sqrtLimit; i += 2) {
if (primeArray[i]) {
for (int j = i * i; j < limit; j += i) {
primeArray[j] = false;
}
}
}*/
//primeList.AddLast(2);
int count = 1;
Console.WriteLine("Counting started...");
Parallel.For(3, limit, i => {
lock (this) {
if (primeArray[i]) {
//primeList.AddLast(i);
count++;
}
}
}
);
Console.WriteLine("Counting finished...");
Console.WriteLine(count);
/*for (int i = 3; i < limit; i++) {
if (primeArray[i]) {
primeList.AddLast(i);
}
}*/
return primeList;
}
Thank you.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
已编辑:
我对问题的回答是:是的,您绝对可以使用任务并行库 (TPL) 更快地找到十亿级素数。问题中给定的代码速度很慢,因为它没有有效地使用内存或多处理,并且最终输出也效率不高。
因此,除了多处理之外,您还可以执行大量操作来加速埃拉托斯特尼筛法,如下所示:
(十亿字节范围为十亿)并且速度较慢
进行不必要的处理。仅使用两个是的事实
仅偶数素数,因此使数组仅表示奇数素数
内存需求减半并减少复合数量
剔除操作的数量超过两倍,以便操作
在您的机器上可能需要 20 秒才能启动
十亿。
内存阵列之所以慢是因为它大大超过了CPU缓存
大小,以便许多内存访问以某种方式对主内存进行
随机方式意味着剔除给定的合数
表示可能需要一百多个 CPU 时钟周期,而如果
它们都在 L1 缓存中,只需要一个周期,并且在
L2缓存只有大约4个周期;并非所有访问都采取最坏的情况
案例时间,但这肯定会减慢处理速度。使用一点
表示主要候选者的压缩数组将减少使用
内存增加八倍,并减少最坏情况下的访问次数
常见的。虽然访问会产生计算开销
各个位,您会发现随着时间的推移会有净增益
减少平均内存访问时间所节省的时间将大于
这个费用。实现这一点的简单方法是使用 BitArray
而不是布尔数组。使用编写您自己的位访问
移位和“与”运算将比使用
位数组类。你会发现使用 BitArray 可以节省一点点
两个为单个进行自己的位操作的另一个因素
线程性能可能大约十或十二秒
这个变化。
每个候选素数需要一个数组访问和一个 if 条件。
一旦你将筛子缓冲区作为数组填充的字数组
位,您可以通过计数查找更有效地完成此操作
表(LUT)消除了if条件,只需要两个
数组访问按位打包的字。这样做的时候,
与时间相比,计数成为工作中可以忽略不计的部分
剔除合数,进一步节省下来
将素数数到 10 亿可能需要 8 秒。
是应用轮分解的结果,它消除了
素数 2、3 和 5 的因数经过处理并通过
调整位打包方法也可以增加有效
给定大小的位缓冲区的范围大约是两倍。
这可以减少合数剔除操作的数量
另一个高达三倍以上的巨大因素,尽管代价是
进一步的计算复杂性。
更高效,并为每页多处理做好准备
分段,可以将作品分成不大于的页面
比 L1 或 L2 缓存大小。这需要一个人保持一个基础
所有素数的素数表,直到最大值的平方根
主要候选并重新计算起始地址参数
用于剔除给定页段的每个基本素数,
但这仍然比使用巨大的剔除数组更有效。一个
实现此页面分段的另一个好处是
不必预先指定筛分上限,但
而可以根据需要将基础质数扩展到更上层
页面已处理。通过到目前为止的所有优化,
你可能可以产生多达十亿的素数
约2.5秒。
使用 TPL 或线程的段,其使用的缓冲区大小约为
L2 缓存大小(每个核心)将产生额外增益
双核非超线程 (HT) 较旧版本的两倍
处理器为 Intel E7500 Core2Duo,用于查找执行时间
素数个数到十亿大约需要1.25秒左右。
我已经实现了多线程埃拉托色尼筛作为 回答另一个线程以表明阿特金筛法相对于埃拉托斯特尼筛法没有任何优势。它使用任务并行库 (TPL),如任务和 TaskFactory 中一样,因此至少需要 DotNet Framework 4。我使用上面讨论的所有优化进一步调整了该代码,如 同一问题的替代答案。我在此处重新发布了经过调整的代码,并添加了注释和更易于阅读的格式,如下所示:
上面的代码将在四核(八个线程)上在大约 1.55 秒内枚举素数到 10 亿包括 HT) i7-2700K (3.5 GHz) 和您的 E7500 由于线程较少且时钟速度稍低,速度可能会慢四倍。大约四分之三的时间只是运行枚举 MoveNext() 方法和 Current 属性的时间,因此我提供了公共静态方法“CountTo”、“SumTo”和“ElementAt”来计算分别是 range 和第 n 个从零开始的素数,而不使用枚举。在我的计算机上使用 UltimatePrimesSoE.CountTo(1000000000) 静态方法会在大约 0.32 秒内生成 50847534,因此在 Intel E7500 上花费的时间不应超过大约 1.28 秒。
EDIT_ADD:有趣的是,此代码在 x86 32 位模式下的运行速度比 x64 64 位模式下快 30%,这可能是由于避免了将 uint32 数字扩展为 ulong 的轻微额外开销。以上所有时序均针对 64 位模式。 END_EDIT_ADD
由于有近 300 行(密集)代码,此实现并不简单,但这就是执行所有所描述的优化的成本,这些优化使此代码如此高效。 Aaron Murgatroyd 的其他答案 并没有那么多代码行;尽管他的代码密度较低,但速度也慢了大约四倍。事实上,几乎所有的执行时间都花在我的代码的私有静态“cullbf”方法的最后一个“for循环”中,该方法只有四个语句长加上范围条件检查;所有其余的代码只是为了支持该循环的重复应用。
该代码比其他答案更快的原因与该代码比您的代码更快的原因相同,除了他执行仅处理奇数素数候选的步骤(1)优化之外。他对多处理的使用几乎完全无效,因为只有 30% 的优势,而不是正确应用时在真正的四核 CPU 上应该可能实现的四倍,因为它按每个素数线程而不是小页面上的所有素数线程,并且他的与直接使用包含边界检查的数组相比,使用不安全的指针数组访问作为消除每个循环的数组边界检查的 DotNet 计算成本的方法实际上会减慢代码速度,因为 DotNet Just In Time (JIT) 编译器生成的代码效率相当低用于指针访问。此外,他的代码枚举素数就像我的代码一样,每个枚举素数的枚举需要 10 个 CPU 时钟周期成本,这在他的情况下也稍差一些,因为他使用内置的 C# 迭代器,该迭代器的消耗要少一些。比我的“自己动手”的 IEnumerator 界面更高效。然而,为了获得最大速度,我们应该完全避免枚举;然而,即使他提供的“Count”实例方法也使用“foreach”循环,这意味着枚举。
总之,此答案代码生成素数答案的速度比您在 E7500 CPU 上的问题代码快约 25 倍(在具有更多内核/线程的 CPU 上快很多倍),使用更少的内存,并且不限于约32 位数字范围,但代价是增加了代码复杂性。
Edited:
My answer to the question is: Yes, you can definitely use the Task Parallel Library (TPL) to find the primes to one billion faster. The given code(s) in the question is slow because it isn't efficiently using memory or multiprocessing, and final output also isn't efficient.
So other than just multiprocessing, there are a huge number of things you can do to speed up the Sieve of Eratosthenese, as follows:
(one billion bytes for your range of one billion) and is slower due
to the unnecessary processing. Just using the fact that two is the
only even prime so making the array represent only odd primes would
half the memory requirements and reduce the number of composite
number cull operations by over a factor of two so that the operation
might take something like 20 seconds on your machine for primes to a
billion.
memory array is so slow is that it greatly exceeds the CPU cache
sizes so that many memory accesses are to main memory in a somewhat
random fashion meaning that culling a given composite number
representation can take over a hundred CPU clock cycles, whereas if
they were all in the L1 cache it would only take one cycle and in
the L2 cache only about four cycles; not all accesses take the worst
case times, but this definitely slows the processing. Using a bit
packed array to represent the prime candidates will reduce the use
of memory by a factor of eight and make the worst case accesses less
common. While there will be a computational overhead to accessing
individual bits, you will find there is a net gain as the time
saving in reducing average memory access time will be greater than
this cost. The simple way to implement this is to use a BitArray
rather than an array of bool. Writing your own bit accesses using
shift and "and" operations will be more efficient than use of the
BitArray class. You will find a slight saving using BitArray and
another factor of two doing your own bit operations for a single
threaded performance of perhaps about ten or twelve seconds with
this change.
requires an array access and an if condition per candidate prime.
Once you have the sieve buffer as an array packed word array of
bits, you can do this much more efficiently with a counting Look Up
Table (LUT) which eliminates the if condition and only needs two
array accesses per bit packed word. Doing this, the time to
count becomes a negligible part of the work as compared to the time
to cull composite numbers, for a further saving to get down to
perhaps eight seconds for the count of the primes to one billion.
be the result of applying wheel factorization, which removes say the
factors of the primes 2, 3, and 5 from the processing and by
adjusting the method of bit packing can also increase the effective
range of a given size bit buffer by a factor of another about two.
This can reduce the number of composite number culling operations by
another huge factor of up to over three times, although at the cost
of further computational complexity.
more efficient, and preparing the way for multiprocessing per page
segment, one can divide the work into pages that are no larger
than the L1 or L2 cache sizes. This requires that one keep a base
primes table of all the primes up to the square root of the maximum
prime candidate and recomputes the starting address parameters of
each of the base primes used in culling across a given page segment,
but this is still more efficient than using huge culling arrays. An
added benefit to implementing this page segmenting is that one then
does not have to specify the upper sieving limit in advance but
rather can just extend the base primes as necessary as further upper
pages are processed. With all of the optimizations to this point,
you can likely produce the count of primes up to one billion in
about 2.5 seconds.
segments using TPL or Threads, which using a buffer size of about
the L2 cache size (per core) will produce an addition gain of a
factor of two on your dual core non Hyper Threaded (HT) older
processor as the Intel E7500 Core2Duo for an execute time to find
the number of primes to one billion of about 1.25 seconds or so.
I have implemented a multi-threaded Sieve of Eratosthenes as an answer to another thread to show there isn't any advantage to the Sieve of Atkin over the Sieve of Eratosthenes. It uses the Task Parallel Library (TPL) as in Tasks and TaskFactory so requires at least DotNet Framework 4. I have further tweaked that code using all of the optimizations discussed above as an alternate answer to the same quesion. I re-post that tweaked code here with added comments and easier-to-read formatting, as follows:
The above code will enumerate the primes to one billion in about 1.55 seconds on a four core (eight threads including HT) i7-2700K (3.5 GHz) and your E7500 will be perhaps up to four times slower due to less threads and slightly less clock speed. About three quarters of that time is just the time to run the enumeration MoveNext() method and Current property, so I provide the public static methods "CountTo", "SumTo" and "ElementAt" to compute the number or sum of primes in a range and the nth zero-based prime, respectively, without using enumeration. Using the UltimatePrimesSoE.CountTo(1000000000) static method produces 50847534 in about 0.32 seconds on my machine, so shouldn't take longer than about 1.28 seconds on the Intel E7500.
EDIT_ADD: Interestingly, this code runs 30% faster in x86 32-bit mode than in x64 64-bit mode, likely due to avoiding the slight extra overhead of extending the uint32 numbers to ulong's. All of the above timings are for 64-bit mode. END_EDIT_ADD
At almost 300 (dense) lines of code, this implementation isn't simple, but that's the cost of doing all of the described optimizations that make this code so efficient. It isn't all that many more lines of code that the other answer by Aaron Murgatroyd; although his code is less dense, his code is also about four times as slow. In fact, almost all of the execution time is spent in the final "for loop" of the my code's private static "cullbf" method, which is only four statements long plus the range condition check; all the rest of the code is just in support of repeated applications of that loop.
The reasons that this code is faster than that other answer are for the same reasons that this code is faster than your code other than he does the Step (1) optimization of only processing odd prime candidates. His use of multiprocessing is almost completely ineffective as in only a 30% advantage rather than the factor of four that should be possible on a true four core CPU when applied correctly as it threads per prime rather than for all primes over small pages, and his use of unsafe pointer array access as a method of eliminating the DotNet computational cost of an array bound check per loop actually slows the code compared to just using arrays directly including the bounds check as the DotNet Just In Time (JIT) compiler produces quite inefficient code for pointer access. In addition, his code enumerates the primes just as my code can do, which enumeration has a 10's of CPU clock cycle cost per enumerated prime, which is also slightly worse in his case as he uses the built-in C# iterators which are somewhat less efficient than my "roll-your-own" IEnumerator interface. However, for maximum speed, we should avoid enumeration entirely; however even his supplied "Count" instance method uses a "foreach" loop which means enumeration.
In summary, this answer code produces prime answers about 25 times faster than your question's code on your E7500 CPU (many more times faster on a CPU with more cores/threads) uses much less memory, and is not limited to smaller prime ranges of about the 32-bit number range, but at a cost of increased code complexity.
我的多线程实现(需要 .NET 4.0):
多线程通过对最内部循环进行线程化来工作,这样就不存在数据锁定问题,因为多个线程处理数组的子集,并且对于完成的每个作业不会重叠。
看起来相当快,可以在 AMD Phenom II X4 965 处理器上在 5.8 秒内生成最高 1,000,000,000 个极限的所有素数。像阿特金斯筛这样的特殊实现更快,但这对于埃拉托色尼筛来说更快。
My implementation with multi threading (.NET 4.0 required):
The multi-threading works by threading the inner most loop, this way there are no data locking issues because the multiple threads work with a subset of the array and dont overlap for each job done.
Seems to be quite fast, can generate all primes up to a limit of 1,000,000,000 on an AMD Phenom II X4 965 processor in 5.8 seconds. Special implementations like the Atkins are faster, but this is fast for the Sieve of Eratosthenes.
不久前,我尝试并行实现阿特金筛法。这是一次失败。我还没有做任何更深入的研究,但似乎埃拉托斯特尼筛法和阿特金筛法都很难在多个 CPU 上扩展,因为我见过的实现使用共享的整数列表。当您尝试在多个 CPU 上进行扩展时,共享状态是一个需要承担的重担。
A while back I tried to implement The Sieve of Atkin in parallell. It was a failure. I haven't done any deeper research but it seems that both Sieve Of Eratosthenes and The Sieve of Atkin are hard to scale over multiple CPUs because the implementations I've seen uses a list of integers that is shared. Shared state is a heavy anchor to carry when you try to scale over multiple CPUs.