多核机器上 .NET 操作的非线性扩展
我在 .NET 应用程序中遇到了一种奇怪的行为,该应用程序对一组内存数据执行一些高度并行的处理。
当在多核处理器(IntelCore2 Quad Q6600 2.4GHz)上运行时,它会在启动多个线程来处理数据时表现出非线性扩展。
当在单核上作为非多线程循环运行时,该进程每秒能够完成大约 240 万次计算。当作为四个线程运行时,您会期望四倍的吞吐量 - 大约每秒 900 万次计算 - 但遗憾的是,没有。实际上,它每秒只能完成大约 410 万个数据……与预期吞吐量相比还差很多。
此外,无论我使用 PLINQ、线程池还是四个显式创建的线程,都会发生该行为。很奇怪...
机器上没有使用 CPU 时间运行其他任何东西,计算中也没有涉及任何锁或其他同步对象...它应该只是提前处理数据。我已经通过在进程运行时查看 perfmon 数据来确认这一点(在可能的范围内)......并且没有报告线程争用或垃圾收集活动。
我目前的理论:
- 所有技术(线程上下文切换等)的开销压倒了计算
- 线程没有被分配给四个核心中的每一个,并且花费一些时间等待同一个处理器核心..不确定如何测试这个理论....
- NET CLR 线程没有以预期的优先级运行或者有一些隐藏的内部开销。
下面是应该表现出相同行为的代码的代表性摘录:
var evaluator = new LookupBasedEvaluator();
// find all ten-vertex polygons that are a subset of the set of points
var ssg = new SubsetGenerator<PolygonData>(Points.All, 10);
const int TEST_SIZE = 10000000; // evaluate the first 10 million records
// materialize the data into memory...
var polygons = ssg.AsParallel()
.Take(TEST_SIZE)
.Cast<PolygonData>()
.ToArray();
var sw1 = Stopwatch.StartNew();
// for loop completes in about 4.02 seconds... ~ 2.483 million/sec
foreach( var polygon in polygons )
evaluator.Evaluate(polygon);
s1.Stop();
Console.WriteLine( "Linear, single core loop: {0}", s1.ElapsedMilliseconds );
// now attempt the same thing in parallel using Parallel.ForEach...
// MS documentation indicates this internally uses a worker thread pool
// completes in 2.61 seconds ... or ~ 3.831 million/sec
var sw2 = Stopwatch.StartNew();
Parallel.ForEach(polygons, p => evaluator.Evaluate(p));
sw2.Stop();
Console.WriteLine( "Parallel.ForEach() loop: {0}", s2.ElapsedMilliseconds );
// now using PLINQ, er get slightly better results, but not by much
// completes in 2.21 seconds ... or ~ 4.524 million/second
var sw3 = Stopwatch.StartNew();
polygons.AsParallel(Environment.ProcessorCount)
.AsUnordered() // no sure this is necessary...
.ForAll( h => evalautor.Evaluate(h) );
sw3.Stop();
Console.WriteLine( "PLINQ.AsParallel.ForAll: {0}", s3.EllapsedMilliseconds );
// now using four explicit threads:
// best, still short of expectations at 1.99 seconds = ~ 5 million/sec
ParameterizedThreadStart tsd = delegate(object pset) { foreach (var p in (IEnumerable<Card[]>) pset) evaluator.Evaluate(p); };
var t1 = new Thread(tsd);
var t2 = new Thread(tsd);
var t3 = new Thread(tsd);
var t4 = new Thread(tsd);
var sw4 = Stopwatch.StartNew();
t1.Start(hands);
t2.Start(hands);
t3.Start(hands);
t4.Start(hands);
t1.Join();
t2.Join();
t3.Join();
t4.Join();
sw.Stop();
Console.WriteLine( "Four Explicit Threads: {0}", s4.EllapsedMilliseconds );
I've encountered a strange behavior in a .NET application that performs some highly parallel processing on a set of in-memory data.
When run on a multi-core processor (IntelCore2 Quad Q6600 2.4GHz) it exhibits non-linear scaling as multiple threads are kicked off to process the data.
When run as a non-multithreaded loop on a single core, the process is able to complete approximately 2.4 million computations per second. When run as four threads you would expect four times as much throughput - somewhere in the neighborhood of 9 million computations per second - but alas, no. In practice it only completes about 4.1 millions per second ... quite a bit short from the expected throughput.
Furthermore, the behavior occurs no matter whether I use PLINQ, a thread pool, or four explicitly created threads. Quite strange...
Nothing else is running on the machine using CPU time, nor are there any locks or other synchronization objects involved in the computation ... it should just tear ahead through the data. I've confirmed this (to the extent possible) by looking at perfmon data while the process runs ... and there are no reported thread contentions or garbage collection activity.
My theories at the moment:
- The overhead of all of the techniques (thread context switches, etc) is overwhelming the computations
- The threads are not getting assigned to each of the four cores and spend some time waiting on the same processor core .. not sure how to test this theory...
- .NET CLR threads are not running at the priority expected or have some hidden internal overhead.
Below is a representative excerpt from the code that should exhibit the same behavior:
var evaluator = new LookupBasedEvaluator();
// find all ten-vertex polygons that are a subset of the set of points
var ssg = new SubsetGenerator<PolygonData>(Points.All, 10);
const int TEST_SIZE = 10000000; // evaluate the first 10 million records
// materialize the data into memory...
var polygons = ssg.AsParallel()
.Take(TEST_SIZE)
.Cast<PolygonData>()
.ToArray();
var sw1 = Stopwatch.StartNew();
// for loop completes in about 4.02 seconds... ~ 2.483 million/sec
foreach( var polygon in polygons )
evaluator.Evaluate(polygon);
s1.Stop();
Console.WriteLine( "Linear, single core loop: {0}", s1.ElapsedMilliseconds );
// now attempt the same thing in parallel using Parallel.ForEach...
// MS documentation indicates this internally uses a worker thread pool
// completes in 2.61 seconds ... or ~ 3.831 million/sec
var sw2 = Stopwatch.StartNew();
Parallel.ForEach(polygons, p => evaluator.Evaluate(p));
sw2.Stop();
Console.WriteLine( "Parallel.ForEach() loop: {0}", s2.ElapsedMilliseconds );
// now using PLINQ, er get slightly better results, but not by much
// completes in 2.21 seconds ... or ~ 4.524 million/second
var sw3 = Stopwatch.StartNew();
polygons.AsParallel(Environment.ProcessorCount)
.AsUnordered() // no sure this is necessary...
.ForAll( h => evalautor.Evaluate(h) );
sw3.Stop();
Console.WriteLine( "PLINQ.AsParallel.ForAll: {0}", s3.EllapsedMilliseconds );
// now using four explicit threads:
// best, still short of expectations at 1.99 seconds = ~ 5 million/sec
ParameterizedThreadStart tsd = delegate(object pset) { foreach (var p in (IEnumerable<Card[]>) pset) evaluator.Evaluate(p); };
var t1 = new Thread(tsd);
var t2 = new Thread(tsd);
var t3 = new Thread(tsd);
var t4 = new Thread(tsd);
var sw4 = Stopwatch.StartNew();
t1.Start(hands);
t2.Start(hands);
t3.Start(hands);
t4.Start(hands);
t1.Join();
t2.Join();
t3.Join();
t4.Join();
sw.Stop();
Console.WriteLine( "Four Explicit Threads: {0}", s4.EllapsedMilliseconds );
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
所以我终于弄清楚了问题所在 - 我认为与 SO 社区分享它会很有用。
非线性性能的整个问题是
Evaluate()
方法中的一行代码造成的:由于
Evaluate()
被调用了数百万次,因此此内存分配发生数百万次。事实上,CLR 在分配内存时会在内部执行一些线程间同步 - 否则多个线程上的分配可能会无意中重叠。将数组从方法本地实例更改为仅分配一次的类实例(但随后在方法本地循环中初始化)消除了可伸缩性问题。通常,为仅在单个方法范围内使用(且有意义)的变量创建类级别成员是一种反模式。但在这种情况下,由于我需要最大可能的可扩展性,因此我将接受(并记录)这种优化。
尾声:在我进行此更改后,并发进程能够实现 1220 万次计算/秒。
PS 感谢 Igor Ostrovsky 与 MSDN 博客的密切链接,帮助我识别和诊断问题。
So I finally figured out what the problem was - and I think it would be useful to share it with the SO community.
The entire issue with non-linear performance was the result of a single line inside the
Evaluate()
method:Since
Evaluate()
is invoked millions of times, this memory allocation was occurring millions of times. As it happens, the CLR internally performs some inter-thread synchronization when allocating memory - otherwise allocation on multiple threads could inadvertently overlap. Changing the array from a method-local instance to a class instance that is only allocated once (but then initializing in a method-local loop) eliminated the scalability problem.Normally, it's an antipattern to create a class-level member for a variable that is only used (and meaningful) within the scope of a single method. But in this case, since I need the greatest possible scalability, I will live with (and document) this optimization.
Epilogue: After I made this change, the concurrent process was able to achieve 12.2 million computations / sec.
P.S. Kudos to Igor Ostrovsky for his germane link to MSDN blogs which helped me identify and diagnose the problem.
看一下这篇文章: http://blogs.msdn .com/pfxteam/archive/2008/08/12/8849984.aspx
具体来说,限制并行区域中的内存分配,并仔细检查写入以确保它们不会发生在其他线程的内存位置附近读或写。
Take a look at this article: http://blogs.msdn.com/pfxteam/archive/2008/08/12/8849984.aspx
Specifically, limit memory allocations in the parallel region, and carefully inspect writes to make sure that they don't occur close to memory locations that other threads read or write.
与顺序算法相比,并行算法预计会出现非线性缩放,因为并行化存在一些固有的开销。 (当然,理想情况下,您希望尽可能接近。)
此外,通常在并行算法中您需要注意某些事情,而在顺序算法中则不需要。除了同步(这确实会使您的工作陷入困境)之外,还可能发生其他一些事情:
据我所知,您当前的显式方法在线程之间使用共享迭代器。如果整个数组的处理变化很大,那么这是一个不错的解决方案,但可能会产生同步开销,以防止跳过某个元素(检索当前元素并将内部指针移动到下一个元素需要是原子操作,以防止跳过跳过一个元素)。
因此,假设每个元素的处理时间预计大致相等,无论元素的位置如何,对数组进行分区可能是一个更好的主意。假定您有 1000 万条记录,这意味着告诉线程 1 处理元素 0 到 2,499,999,线程 2 处理元素 2,500,000 到 4,999,999,等等。您可以为每个线程分配一个 ID 并使用它来计算实际范围。
另一个小改进是让主线程充当计算线程之一。不过,如果我没记错的话,那只是一件非常的小事。
Non-linear scaling is to be expected with a parallel algorithm in comparison with a sequential algorithm, since there is some inherent overhead in the parallelization. ( Ideally, of course, you want to get as close as you can.)
Additionally, there will usually be certain things you need to take care of in a parallel algorithm that you don't need to in a sequential algorithm. Beyond synchronization (which can really bog down your work), there are some other things that can happen:
As far as I can tell, your current explicit approach uses a shared iterator between the threads. That's an okay solution if the processing vary wildly throughout the array, but there is likely to be synchronization overhead to prevent an element from being skipped (retrieving the current element and moving the internal pointer to the next element needs to be an atomic operation to prevent skipping an element).
Therefore, it might be a better idea to partition the array, assuming the processing time of each element is expected to be roughly equal regardless of the position of the element. Given that you have 10 million records, that means telling thread 1 to work on elements 0 to 2,499,999, thread 2 works on elements 2,500,000 to 4,999,999, etc. You can assign each thread an ID and use this to calculate the actual range.
Another small improvement would be to let the main thread act as one of the threads that calculates. However, if I remember correctly, that's a very minor thing.
我当然不会期望存在线性关系,但我认为您会看到比这更大的收益。我假设所有核心的 CPU 使用率都达到最大。我脑子里只有几个想法。
编辑:抱歉,我刚刚注意到您已经解决了我的两个观点。
I certainly would not expect a linear relationship, but I would have thought you would have seen a bigger gain than that. I am assuming that the CPU usage is maxed out on all cores. Just a couple of thoughts off the top of my head.
Edit: Sorry, I just noticed that you have already addressed both of my points.