多核机器上 .NET 操作的非线性扩展

发布于 2024-08-05 09:42:36 字数 2931 浏览 8 评论 0原文

我在 .NET 应用程序中遇到了一种奇怪的行为,该应用程序对一组内存数据执行一些高度并行的处理。

当在多核处理器(IntelCore2 Quad Q6600 2.4GHz)上运行时,它会在启动多个线程来处理数据时表现出非线性扩展。

当在单核上作为非多线程循环运行时,该进程每秒能够完成大约 240 万次计算。当作为四个线程运行时,您会期望四倍的吞吐量 - 大约每秒 900 万次计算 - 但遗憾的是,没有。实际上,它每秒只能完成大约 410 万个数据……与预期吞吐量相比还差很多。

此外,无论我使用 PLINQ、线程池还是四个显式创建的线程,都会发生该行为。很奇怪...

机器上没有使用 CPU 时间运行其他任何东西,计算中也没有涉及任何锁或其他同步对象...它应该只是提前处理数据。我已经通过在进程运行时查看 perfmon 数据来确认这一点(在可能的范围内)......并且没有报告线程争用或垃圾收集活动。

我目前的理论:

  1. 所有技术(线程上下文切换等)的开销压倒了计算
  2. 线程没有被分配给四个核心中的每一个,并且花费一些时间等待同一个处理器核心..不确定如何测试这个理论....
  3. NET CLR 线程没有以预期的优先级运行或者有一些隐藏的内部开销。

下面是应该表现出相同行为的代码的代表性摘录:

    var evaluator = new LookupBasedEvaluator();

    // find all ten-vertex polygons that are a subset of the set of points
    var ssg = new SubsetGenerator<PolygonData>(Points.All, 10);

    const int TEST_SIZE = 10000000;  // evaluate the first 10 million records

    // materialize the data into memory...
    var polygons = ssg.AsParallel()
                      .Take(TEST_SIZE)
                      .Cast<PolygonData>()
                      .ToArray();

    var sw1 = Stopwatch.StartNew();
    // for loop completes in about 4.02 seconds... ~ 2.483 million/sec
    foreach( var polygon in polygons )
        evaluator.Evaluate(polygon);
    s1.Stop(); 
    Console.WriteLine( "Linear, single core loop: {0}", s1.ElapsedMilliseconds );

    // now attempt the same thing in parallel using Parallel.ForEach...
    // MS documentation indicates this internally uses a worker thread pool
    // completes in 2.61 seconds ... or ~ 3.831 million/sec
    var sw2 = Stopwatch.StartNew();
    Parallel.ForEach(polygons, p => evaluator.Evaluate(p));
    sw2.Stop();
    Console.WriteLine( "Parallel.ForEach() loop: {0}", s2.ElapsedMilliseconds );

    // now using PLINQ, er get slightly better results, but not by much
    // completes in 2.21 seconds ... or ~ 4.524 million/second
    var sw3 = Stopwatch.StartNew();
    polygons.AsParallel(Environment.ProcessorCount)
            .AsUnordered() // no sure this is necessary...
            .ForAll( h => evalautor.Evaluate(h) );
    sw3.Stop();
    Console.WriteLine( "PLINQ.AsParallel.ForAll: {0}", s3.EllapsedMilliseconds );

    // now using four explicit threads:
    // best, still short of expectations at 1.99 seconds = ~ 5 million/sec
    ParameterizedThreadStart tsd = delegate(object pset) { foreach (var p in (IEnumerable<Card[]>) pset) evaluator.Evaluate(p); };
     var t1 = new Thread(tsd);
     var t2 = new Thread(tsd);
     var t3 = new Thread(tsd);
     var t4 = new Thread(tsd);

     var sw4 = Stopwatch.StartNew(); 
     t1.Start(hands);
     t2.Start(hands);
     t3.Start(hands);
     t4.Start(hands);
     t1.Join();
     t2.Join();
     t3.Join();
     t4.Join();
     sw.Stop();
     Console.WriteLine( "Four Explicit Threads: {0}", s4.EllapsedMilliseconds );

I've encountered a strange behavior in a .NET application that performs some highly parallel processing on a set of in-memory data.

When run on a multi-core processor (IntelCore2 Quad Q6600 2.4GHz) it exhibits non-linear scaling as multiple threads are kicked off to process the data.

When run as a non-multithreaded loop on a single core, the process is able to complete approximately 2.4 million computations per second. When run as four threads you would expect four times as much throughput - somewhere in the neighborhood of 9 million computations per second - but alas, no. In practice it only completes about 4.1 millions per second ... quite a bit short from the expected throughput.

Furthermore, the behavior occurs no matter whether I use PLINQ, a thread pool, or four explicitly created threads. Quite strange...

Nothing else is running on the machine using CPU time, nor are there any locks or other synchronization objects involved in the computation ... it should just tear ahead through the data. I've confirmed this (to the extent possible) by looking at perfmon data while the process runs ... and there are no reported thread contentions or garbage collection activity.

My theories at the moment:

  1. The overhead of all of the techniques (thread context switches, etc) is overwhelming the computations
  2. The threads are not getting assigned to each of the four cores and spend some time waiting on the same processor core .. not sure how to test this theory...
  3. .NET CLR threads are not running at the priority expected or have some hidden internal overhead.

Below is a representative excerpt from the code that should exhibit the same behavior:

    var evaluator = new LookupBasedEvaluator();

    // find all ten-vertex polygons that are a subset of the set of points
    var ssg = new SubsetGenerator<PolygonData>(Points.All, 10);

    const int TEST_SIZE = 10000000;  // evaluate the first 10 million records

    // materialize the data into memory...
    var polygons = ssg.AsParallel()
                      .Take(TEST_SIZE)
                      .Cast<PolygonData>()
                      .ToArray();

    var sw1 = Stopwatch.StartNew();
    // for loop completes in about 4.02 seconds... ~ 2.483 million/sec
    foreach( var polygon in polygons )
        evaluator.Evaluate(polygon);
    s1.Stop(); 
    Console.WriteLine( "Linear, single core loop: {0}", s1.ElapsedMilliseconds );

    // now attempt the same thing in parallel using Parallel.ForEach...
    // MS documentation indicates this internally uses a worker thread pool
    // completes in 2.61 seconds ... or ~ 3.831 million/sec
    var sw2 = Stopwatch.StartNew();
    Parallel.ForEach(polygons, p => evaluator.Evaluate(p));
    sw2.Stop();
    Console.WriteLine( "Parallel.ForEach() loop: {0}", s2.ElapsedMilliseconds );

    // now using PLINQ, er get slightly better results, but not by much
    // completes in 2.21 seconds ... or ~ 4.524 million/second
    var sw3 = Stopwatch.StartNew();
    polygons.AsParallel(Environment.ProcessorCount)
            .AsUnordered() // no sure this is necessary...
            .ForAll( h => evalautor.Evaluate(h) );
    sw3.Stop();
    Console.WriteLine( "PLINQ.AsParallel.ForAll: {0}", s3.EllapsedMilliseconds );

    // now using four explicit threads:
    // best, still short of expectations at 1.99 seconds = ~ 5 million/sec
    ParameterizedThreadStart tsd = delegate(object pset) { foreach (var p in (IEnumerable<Card[]>) pset) evaluator.Evaluate(p); };
     var t1 = new Thread(tsd);
     var t2 = new Thread(tsd);
     var t3 = new Thread(tsd);
     var t4 = new Thread(tsd);

     var sw4 = Stopwatch.StartNew(); 
     t1.Start(hands);
     t2.Start(hands);
     t3.Start(hands);
     t4.Start(hands);
     t1.Join();
     t2.Join();
     t3.Join();
     t4.Join();
     sw.Stop();
     Console.WriteLine( "Four Explicit Threads: {0}", s4.EllapsedMilliseconds );

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

久伴你 2024-08-12 09:42:36

所以我终于弄清楚了问题所在 - 我认为与 SO 社区分享它会很有用。

非线性性能的整个问题是 Evaluate() 方法中的一行代码造成的:

var coordMatrix = new long[100];

由于 Evaluate() 被调用了数百万次,因此此内存分配发生数百万次。事实上,CLR 在分配内存时会在内部执行一些线程间同步 - 否则多个线程上的分配可能会无意中重叠。将数组从方法本地实例更改为仅分配一次的类实例(但随后在方法本地循环中初始化)消除了可伸缩性问题。

通常,为仅在单个方法范围内使用(且有意义)的变量创建类级别成员是一种反模式。但在这种情况下,由于我需要最大可能的可扩展性,因此我将接受(并记录)这种优化。

尾声:在我进行此更改后,并发进程能够实现 1220 万次计算/秒。

PS 感谢 Igor Ostrovsky 与 MSDN 博客的密切链接,帮助我识别和诊断问题。

So I finally figured out what the problem was - and I think it would be useful to share it with the SO community.

The entire issue with non-linear performance was the result of a single line inside the Evaluate() method:

var coordMatrix = new long[100];

Since Evaluate() is invoked millions of times, this memory allocation was occurring millions of times. As it happens, the CLR internally performs some inter-thread synchronization when allocating memory - otherwise allocation on multiple threads could inadvertently overlap. Changing the array from a method-local instance to a class instance that is only allocated once (but then initializing in a method-local loop) eliminated the scalability problem.

Normally, it's an antipattern to create a class-level member for a variable that is only used (and meaningful) within the scope of a single method. But in this case, since I need the greatest possible scalability, I will live with (and document) this optimization.

Epilogue: After I made this change, the concurrent process was able to achieve 12.2 million computations / sec.

P.S. Kudos to Igor Ostrovsky for his germane link to MSDN blogs which helped me identify and diagnose the problem.

戒ㄋ 2024-08-12 09:42:36

看一下这篇文章: http://blogs.msdn .com/pfxteam/archive/2008/08/12/8849984.aspx

具体来说,限制并行区域中的内存分配,并仔细检查写入以确保它们不会发生在其他线程的内存位置附近读或写。

Take a look at this article: http://blogs.msdn.com/pfxteam/archive/2008/08/12/8849984.aspx

Specifically, limit memory allocations in the parallel region, and carefully inspect writes to make sure that they don't occur close to memory locations that other threads read or write.

能否归途做我良人 2024-08-12 09:42:36

与顺序算法相比,并行算法预计会出现非线性缩放,因为并行化存在一些固有的开销。 (当然,理想情况下,您希望尽可能接近。)

此外,通常在并行算法中您需要注意某些事情,而在顺序算法中则不需要。除了同步(这确实会使您的工作陷入困境)之外,还可能发生其他一些事情:

  • CPU 和操作系统无法将所有时间都投入到您的应用程序上。因此,它需要时不时地进行上下文切换,以让其他进程完成一些工作。当您仅使用单个核心时,您的进程被切换的可能性较小,因为它还有其他三个核心可供选择。请注意,即使您可能认为没有其他东西在运行,操作系统或某些服务仍然可能正在执行一些后台工作。
  • 如果每个线程都访问大量数据,并且这些数据在线程之间并不常见,那么您很可能无法将所有这些数据存储在 CPU 缓存中。这意味着需要更多的内存访问,这(相对)慢。

据我所知,您当前的显式方法在线程之间使用共享迭代器。如果整个数组的处理变化很大,那么这是一个不错的解决方案,但可能会产生同步开销,以防止跳过某个元素(检索当前元素并将内部指针移动到下一个元素需要是原子操作,以防止跳过跳过一个元素)。

因此,假设每个元素的处理时间预计大致相等,无论元素的位置如何,对数组进行分区可能是一个更好的主意。假定您有 1000 万条记录,这意味着告诉线程 1 处理元素 0 到 2,499,999,线程 2 处理元素 2,500,000 到 4,999,999,等等。您可以为每个线程分配一个 ID 并使用它来计算实际范围。

另一个小改进是让主线程充当计算线程之一。不过,如果我没记错的话,那只是一件非常的小事。

Non-linear scaling is to be expected with a parallel algorithm in comparison with a sequential algorithm, since there is some inherent overhead in the parallelization. ( Ideally, of course, you want to get as close as you can.)

Additionally, there will usually be certain things you need to take care of in a parallel algorithm that you don't need to in a sequential algorithm. Beyond synchronization (which can really bog down your work), there are some other things that can happen:

  • The CPU and OS can't devote all of its time to your application. Thus, it needs to do context switching every now and again to let other processes get some work done. When you're only using a single core, it is less likely that your process is switched out, because it has three other cores to choose from. Note that even though you might think nothing else is running, the OS or some services could still be performing some background work.
  • If each of your threads is accessing a lot of data, and this data is not common between threads, you will most likely not be able to store all of this in the CPU cache. That means a lot more memory accessing is required, which is (relatively) slow.

As far as I can tell, your current explicit approach uses a shared iterator between the threads. That's an okay solution if the processing vary wildly throughout the array, but there is likely to be synchronization overhead to prevent an element from being skipped (retrieving the current element and moving the internal pointer to the next element needs to be an atomic operation to prevent skipping an element).

Therefore, it might be a better idea to partition the array, assuming the processing time of each element is expected to be roughly equal regardless of the position of the element. Given that you have 10 million records, that means telling thread 1 to work on elements 0 to 2,499,999, thread 2 works on elements 2,500,000 to 4,999,999, etc. You can assign each thread an ID and use this to calculate the actual range.

Another small improvement would be to let the main thread act as one of the threads that calculates. However, if I remember correctly, that's a very minor thing.

心凉 2024-08-12 09:42:36

我当然不会期望存在线性关系,但我认为您会看到比这更大的收益。我假设所有核心的 CPU 使用率都达到最大。我脑子里只有几个想法。

  • 您是否使用任何需要同步的共享数据结构(显式或隐式)?
  • 您是否尝试过分析或记录性能计数器来确定瓶颈所在?你能提供更多线索吗?

编辑:抱歉,我刚刚注意到您已经解决了我的两个观点。

I certainly would not expect a linear relationship, but I would have thought you would have seen a bigger gain than that. I am assuming that the CPU usage is maxed out on all cores. Just a couple of thoughts off the top of my head.

  • Are you using any shared data structures (either explicitly or implicitly) that require synchronization?
  • Have you tried profiling or recording performance counters to determine where the bottleneck is? Can you give any more clues.

Edit: Sorry, I just noticed that you have already addressed both of my points.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文