为什么我的线程 .Net 应用程序在分配大量内存时不能线性扩展？

发布于 2024-08-18 08:39:52 字数 6824 浏览 5 评论 0原文

我遇到了一些关于大内存分配对 .Net 运行时可扩展性的影响的奇怪问题。在我的测试应用程序中，我在固定循环次数的紧密循环中创建了大量字符串，并给出了每秒循环迭代的速率。当我在多个线程中运行这个循环时，奇怪的事情出现了——速度似乎并没有线性增加。当您创建大字符串时，问题会变得更糟。

让我向您展示结果。我的机器是运行 Windows Server 2008 R1 32 位的 8GB、8 核机器。它有两个 4 核 Intel Xeon 1.83ghz (E5320) 处理器。执行的“工作”是对字符串上的 ToUpper() 和 ToLower() 的一组交替调用。我对一个线程、两个线程等运行测试——直到最大值。下表中的列是：

速率：所有线程的循环次数除以持续时间。
线性速率：性能线性扩展时的理想速率。它的计算方式为一个线程实现的速率乘以该测试的线程数。
方差：计算为比率低于线性比率的百分比。

示例 1：10,000 个循环、8 个线程、每个字符串 1024 个字符

第一个示例从一个线程开始，然后是两个线程，最后使用八个线程运行测试。每个线程创建 10,000 个字符串，每个字符串包含 1024 个字符：

Creating 10000 strings per thread, 1024 chars each, using up to 8 threads
GCMode = Server

Rate          Linear Rate   % Variance    Threads
--------------------------------------------------------
322.58        322.58        0.00 %        1
689.66        645.16        -6.90 %       2
882.35        967.74        8.82 %        3
1081.08       1290.32       16.22 %       4
1388.89       1612.90       13.89 %       5
1666.67       1935.48       13.89 %       6
2000.00       2258.07       11.43 %       7
2051.28       2580.65       20.51 %       8
Done.

示例 2：10,000 个循环，8 个线程，每个字符串 32,000 个字符

在第二个示例中，我将每个字符串的字符数增加到 32,000 个。

Creating 10000 strings per thread, 32000 chars each, using up to 8 threads
GCMode = Server

Rate          Linear Rate   % Variance    Threads
--------------------------------------------------------
14.10         14.10         0.00 %        1
24.36         28.21         13.64 %       2
33.15         42.31         21.66 %       3
40.98         56.42         27.36 %       4
48.08         70.52         31.83 %       5
61.35         84.63         27.51 %       6
72.61         98.73         26.45 %       7
67.85         112.84        39.86 %       8
Done.

注意方差与线性率的差异；在第二个表中，实际速率比线性速率低 39%。

我的问题是：为什么这个应用程序不能线性扩展？

我的观察

错误共享

我最初认为这可能是由于错误共享造成的，但是，正如您在源代码中看到的那样，我不是共享任何集合并且字符串相当大。唯一可能存在的重叠是一个字符串的开头和另一个字符串的结尾。

服务器模式垃圾收集器

我使用 gcServerenabled=true 以便每个核心都有自己的堆和垃圾收集器线程。

大对象堆

我不认为我分配的对象被发送到大对象堆，因为它们的大小低于 85000 字节。

字符串实习

我认为由于实习MSDN，所以我尝试编译实习禁用。这产生了比上面显示的结果更糟糕的结果

其他数据类型

我使用小型和大型整数数组尝试了相同的示例，其中我循环遍历每个元素并更改值。它产生了类似的结果，遵循分配较大时表现较差的趋势。

源代码

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading;
using System.Diagnostics;
using System.Runtime;
using System.Runtime.CompilerServices;

namespace StackOverflowExample
{
  public class Program
  {
    private static int columnWidth = 14;

    static void Main(string[] args)
    {
      int loopCount, maxThreads, stringLength;
      loopCount = maxThreads = stringLength = 0;
      try
      {
        loopCount = args.Length != 0 ? Int32.Parse(args[0]) : 1000;
        maxThreads = args.Length != 0 ? Int32.Parse(args[1]) : 4;
        stringLength = args.Length != 0 ? Int32.Parse(args[2]) : 1024;
      }
      catch
      {
        Console.WriteLine("Usage: StackOverFlowExample.exe [loopCount] [maxThreads] [stringLength]");
        System.Environment.Exit(2);
      }

      float rate;
      float linearRate = 0;
      Stopwatch stopwatch;
      Console.WriteLine("Creating {0} strings per thread, {1} chars each, using up to {2} threads", loopCount, stringLength, maxThreads);
      Console.WriteLine("GCMode = {0}", GCSettings.IsServerGC ? "Server" : "Workstation");
      Console.WriteLine();
      PrintRow("Rate", "Linear Rate", "% Variance", "Threads"); ;
      PrintRow(4, "".PadRight(columnWidth, '-'));

      for (int runCount = 1; runCount <= maxThreads; runCount++)
      {
        // Create the workers
        Worker[] workers = new Worker[runCount];
        workers.Length.Range().ForEach(index => workers[index] = new Worker());

        // Start timing and kick off the threads
        stopwatch = Stopwatch.StartNew();
        workers.ForEach(w => new Thread(
          new ThreadStart(
            () => w.DoWork(loopCount, stringLength)
          )
        ).Start());

        // Wait until all threads are complete
        WaitHandle.WaitAll(
          workers.Select(p => p.Complete).ToArray());
        stopwatch.Stop();

        // Print the results
        rate = (float)loopCount * runCount / stopwatch.ElapsedMilliseconds;
        if (runCount == 1) { linearRate = rate; }

        PrintRow(String.Format("{0:#0.00}", rate),
          String.Format("{0:#0.00}", linearRate * runCount),
          String.Format("{0:#0.00} %", (1 - rate / (linearRate * runCount)) * 100),
          runCount.ToString()); 
      }
      Console.WriteLine("Done.");
    }

    private static void PrintRow(params string[] columns)
    {
      columns.ForEach(c => Console.Write(c.PadRight(columnWidth)));
      Console.WriteLine();
    }

    private static void PrintRow(int repeatCount, string column)
    {
      for (int counter = 0; counter < repeatCount; counter++)
      {
        Console.Write(column.PadRight(columnWidth));
      }
      Console.WriteLine();
    }
  }

  public class Worker
  {
    public ManualResetEvent Complete { get; private set; }

    public Worker()
    {
      Complete = new ManualResetEvent(false);
    }

    public void DoWork(int loopCount, int stringLength)
    {
      // Build the string
      string theString = "".PadRight(stringLength, 'a');
      for (int counter = 0; counter < loopCount; counter++)
      {
        if (counter % 2 == 0) { theString.ToUpper(); }
        else { theString.ToLower(); }
      }
      Complete.Set();
    }
  }

  public static class HandyExtensions
  {
    public static IEnumerable<int> Range(this int max)
    {
      for (int counter = 0; counter < max; counter++)
      {
        yield return counter;
      }
    }

    public static void ForEach<T>(this IEnumerable<T> items, Action<T> action)
    {
      foreach(T item in items)
      {
        action(item);
      }
    }
  }
}

App.Config

<?xml version="1.0" encoding="utf-8" ?>
<configuration>
  <runtime>
    <gcServer enabled="true"/>
  </runtime>
</configuration>

运行示例

要在您的机器上运行 StackOverflowExample.exe，请使用以下命令行参数调用它：

StackOverFlowExample.exe [loopCount] [maxThreads] [stringLength]

loopCount< /code>：每个线程操作字符串的次数。
maxThreads：要进行的线程数。
stringLength：填充字符串的字符数。

原文

I’ve run into something strange about the effect of large memory allocations on the scalability of the .Net runtime. In my test application I create lots of strings in a tight loop for a fixed number of cycles and spit out a rate of loop iterations per second. The weirdness comes in when I run this loop in several threads – it appears that the rate does not increase linearly. The problem gets even worse when you create large strings.

Let me show you the results. My machine is an 8gb, 8-core box running Windows Server 2008 R1, 32-bit. It has two 4-core Intel Xeon 1.83ghz (E5320) processors. The "work" performed is a set of alternating calls to ToUpper() and ToLower() on a string. I run the test for one thread, two threads, etc – up to the maximum. The columns in the table below are:

Rate: The number of loops across all threads divided by the duration.
Linear Rate: The ideal rate if performance were to scale linearly. It is calculated as the rate achieved by one thread multiplied by the number of threads for that test.
Variance: Calculated as the percentage by which the rate falls short of the linear rate.

Example 1: 10,000 loops, 8 threads, 1024 chars per string

The first example starts off with one thread, then two threads and eventually runs the test with eight threads. Each thread creates 10,000 strings of 1024 chars each:

Creating 10000 strings per thread, 1024 chars each, using up to 8 threads
GCMode = Server

Rate          Linear Rate   % Variance    Threads
--------------------------------------------------------
322.58        322.58        0.00 %        1
689.66        645.16        -6.90 %       2
882.35        967.74        8.82 %        3
1081.08       1290.32       16.22 %       4
1388.89       1612.90       13.89 %       5
1666.67       1935.48       13.89 %       6
2000.00       2258.07       11.43 %       7
2051.28       2580.65       20.51 %       8
Done.

Example 2: 10,000 loops, 8 threads, 32,000 chars per string

In the second example I’ve increased the number of chars for each string to 32,000.

Creating 10000 strings per thread, 32000 chars each, using up to 8 threads
GCMode = Server

Rate          Linear Rate   % Variance    Threads
--------------------------------------------------------
14.10         14.10         0.00 %        1
24.36         28.21         13.64 %       2
33.15         42.31         21.66 %       3
40.98         56.42         27.36 %       4
48.08         70.52         31.83 %       5
61.35         84.63         27.51 %       6
72.61         98.73         26.45 %       7
67.85         112.84        39.86 %       8
Done.

Notice the difference in variance from the linear rate; in the second table the actual rate is 39% less than the linear rate.

My question is: Why does this app not scale linearly?

My Observations

False Sharing

I initially thought that this could be due to False Sharing but, as you’ll see in the source code, I’m not sharing any collections and the strings are quite big. The only overlap that could exist is at the beginning of one string and the end of another.

Server-mode Garbage Collector

I’m using gcServer enabled=true so that each core gets its own heap and garbage collector thread.

Large Object Heap

I don't think that objects I allocate are being sent to the Large Object Heap because they are under 85000 bytes big.

String Interning

I thought that string values may being shared under the hood due to interningMSDN, so I tried compiling interning disabled. This produced worse results than those shown above

Other data types

I tried the same example using small and large integer arrays, in which I loop through each element and change the value. It produces similar results, following the trend of performing worse with larger allocations.

Source Code

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading;
using System.Diagnostics;
using System.Runtime;
using System.Runtime.CompilerServices;

namespace StackOverflowExample
{
  public class Program
  {
    private static int columnWidth = 14;

    static void Main(string[] args)
    {
      int loopCount, maxThreads, stringLength;
      loopCount = maxThreads = stringLength = 0;
      try
      {
        loopCount = args.Length != 0 ? Int32.Parse(args[0]) : 1000;
        maxThreads = args.Length != 0 ? Int32.Parse(args[1]) : 4;
        stringLength = args.Length != 0 ? Int32.Parse(args[2]) : 1024;
      }
      catch
      {
        Console.WriteLine("Usage: StackOverFlowExample.exe [loopCount] [maxThreads] [stringLength]");
        System.Environment.Exit(2);
      }

      float rate;
      float linearRate = 0;
      Stopwatch stopwatch;
      Console.WriteLine("Creating {0} strings per thread, {1} chars each, using up to {2} threads", loopCount, stringLength, maxThreads);
      Console.WriteLine("GCMode = {0}", GCSettings.IsServerGC ? "Server" : "Workstation");
      Console.WriteLine();
      PrintRow("Rate", "Linear Rate", "% Variance", "Threads"); ;
      PrintRow(4, "".PadRight(columnWidth, '-'));

      for (int runCount = 1; runCount <= maxThreads; runCount++)
      {
        // Create the workers
        Worker[] workers = new Worker[runCount];
        workers.Length.Range().ForEach(index => workers[index] = new Worker());

        // Start timing and kick off the threads
        stopwatch = Stopwatch.StartNew();
        workers.ForEach(w => new Thread(
          new ThreadStart(
            () => w.DoWork(loopCount, stringLength)
          )
        ).Start());

        // Wait until all threads are complete
        WaitHandle.WaitAll(
          workers.Select(p => p.Complete).ToArray());
        stopwatch.Stop();

        // Print the results
        rate = (float)loopCount * runCount / stopwatch.ElapsedMilliseconds;
        if (runCount == 1) { linearRate = rate; }

        PrintRow(String.Format("{0:#0.00}", rate),
          String.Format("{0:#0.00}", linearRate * runCount),
          String.Format("{0:#0.00} %", (1 - rate / (linearRate * runCount)) * 100),
          runCount.ToString()); 
      }
      Console.WriteLine("Done.");
    }

    private static void PrintRow(params string[] columns)
    {
      columns.ForEach(c => Console.Write(c.PadRight(columnWidth)));
      Console.WriteLine();
    }

    private static void PrintRow(int repeatCount, string column)
    {
      for (int counter = 0; counter < repeatCount; counter++)
      {
        Console.Write(column.PadRight(columnWidth));
      }
      Console.WriteLine();
    }
  }

  public class Worker
  {
    public ManualResetEvent Complete { get; private set; }

    public Worker()
    {
      Complete = new ManualResetEvent(false);
    }

    public void DoWork(int loopCount, int stringLength)
    {
      // Build the string
      string theString = "".PadRight(stringLength, 'a');
      for (int counter = 0; counter < loopCount; counter++)
      {
        if (counter % 2 == 0) { theString.ToUpper(); }
        else { theString.ToLower(); }
      }
      Complete.Set();
    }
  }

  public static class HandyExtensions
  {
    public static IEnumerable<int> Range(this int max)
    {
      for (int counter = 0; counter < max; counter++)
      {
        yield return counter;
      }
    }

    public static void ForEach<T>(this IEnumerable<T> items, Action<T> action)
    {
      foreach(T item in items)
      {
        action(item);
      }
    }
  }
}

App.Config

<?xml version="1.0" encoding="utf-8" ?>
<configuration>
  <runtime>
    <gcServer enabled="true"/>
  </runtime>
</configuration>

Running the Example

To run StackOverflowExample.exe on your box, call it with these command-line parameters:

StackOverFlowExample.exe [loopCount] [maxThreads] [stringLength]

loopCount: The number of times each thread will manipulate the string.
maxThreads: The number of threads to progress to.
stringLength: the number of characters to fill the string with.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

鞋纸虽美，但不合脚ㄋ〞 2024-08-25 08:39:52

您可能想看看我的这个问题。

我遇到了类似的问题，这是由于 CLR 在分配内存时执行线程间同步以避免重叠分配。现在，对于服务器 GC，锁定算法可能会有所不同 - 但同样的事情可能会影响您的代码。

回复收藏 0 原文

鹿港巷口少年归 2024-08-25 08:39:52

您运行此程序的硬件无法线性扩展多个进程或线程。

你只有一个记忆库。这是一个瓶颈（多通道内存可以改善访问，但不能比内存组提供更多的处理能力（似乎 e5320 处理器支持 1 - 4 个内存通道）。

每个物理 cpu 包只有一个内存控制器（您的 CPU 包中有两个）情况），这是一个瓶颈，

每个 cpu 包有 2 个 l2 缓存，如果缓存耗尽，就会出现缓存一致性问题

。调度和内存管理，这也将有助于非线性扩展，

我认为您在多线程和每次增量到 8 时获得了相当合理的结果……

确实，您是否读过任何建议该商品的内容。多CPU硬件能够线性扩展多个进程/线程吗？

回复收藏 0 原文

瞎闹 2024-08-25 08:39:52

您最初的帖子从根本上来说是有缺陷的 - 您假设可以通过并行执行实现线性加速。事实并非如此，也从来没有如此。请参阅阿姆达尔定律（是的，我知道，维基百科，但它比其他任何东西都容易）。

从 CLR 提供的抽象来看，您的代码似乎没有依赖项 - 然而，正如 LBushkin 指出的那样，情况并非如此。正如 SuperMagic 指出的，硬件本身意味着执行线程之间的依赖关系。几乎任何可以并行化的问题都是如此——即使对于独立的机器、独立的硬件，问题的某些部分通常需要一些同步元素，而同步会阻碍线性加速。

回复收藏 0 原文

苍风燃霜 2024-08-25 08:39:52

内存分配器对应用程序加速的影响与内存分配器的数量密切相关。分配比分配的金额。它还更多地受到分配延迟（在单个线程上完成单个分配的时间量）的影响，在 CLR 的情况下，由于使用凹凸指针分配器（参见第 3.4.3 节）。

您的问题是问为什么实际加速是次线性的，要回答这个问题，您当然应该查看 Amdahl 定律< /a>.

回到关于 CLR 垃圾收集器的说明，您可以看到分配上下文属于特定线程（第 3.4.1 节），这减少了（但没有消除）多线程分配期间所需的同步量。如果您发现分配确实是弱点，我建议尝试使用对象池（可能是每个线程）来减少收集器的负载。通过减少分配的绝对数量，您将减少收集器运行的次数。但是，这也会导致更多对象进入第 2 代，而第 2 代在需要时收集速度最慢。

最后，Microsoft 继续改进较新版本的 CLR 中的垃圾收集器，因此您应该以能够使用的最新版本为目标（至少是 .NET 2）。

回复收藏 0 原文

淑女气质 2024-08-25 08:39:52

卢克这个问题问得好！我对答案很感兴趣。

我怀疑您期望的不是线性缩放，而是比 39% 的方差更好的结果。

NoBugz - 基于 280Z28 的链接，实际上每个核心都有一个 GC 堆，GCMode=Server。每个堆还应该有一个 GC 线程。这不应该导致您提到的并发问题吗？

遇到了类似的问题
由于 CLR 执行的事实
线程间同步时
分配内存以避免重叠
分配。现在，有了服务器GC，
锁定算法可能不同
- 但同样的事情可能会影响您的代码

LBushkin - 我认为这是关键问题，分配内存时 GCMode=Server 是否仍然会导致线程间锁定？有人知道吗？或者可以简单地用 SuperMagic 提到的硬件限制来解释吗？

回复收藏 0 原文

~没有更多了~

关于作者

旧伤还要旧人安

暂无简介

0 文章

0 评论

23 人气

关注发私信

友情链接

文江博客

为什么我的线程 .Net 应用程序在分配大量内存时不能线性扩展？

示例 1：10,000 个循环、8 个线程、每个字符串 1024 个字符

示例 2：10,000 个循环，8 个线程，每个字符串 32,000 个字符

我的观察

错误共享

服务器模式垃圾收集器

大对象堆

字符串实习

其他数据类型

源代码

App.Config

运行示例

Example 1: 10,000 loops, 8 threads, 1024 chars per string

Example 2: 10,000 loops, 8 threads, 32,000 chars per string

My Observations

False Sharing

Server-mode Garbage Collector

Large Object Heap

String Interning

Other data types

Source Code

App.Config

Running the Example

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

娇女薄笑

biaggi

xiaolangfanhua

rivulet

我三岁

薆情海

友情链接

为什么我的线程 .Net 应用程序在分配大量内存时不能线性扩展？

示例 1：10,000 个循环、8 个线程、每个字符串 1024 个字符

示例 2：10,000 个循环，8 个线程，每个字符串 32,000 个字符

我的观察

错误共享

服务器模式垃圾收集器

大对象堆

字符串实习

其他数据类型

源代码

App.Config

运行示例

Example 1: 10,000 loops, 8 threads, 1024 chars per string

Example 2: 10,000 loops, 8 threads, 32,000 chars per string

My Observations

False Sharing

Server-mode Garbage Collector

Large Object Heap

String Interning

Other data types

Source Code

App.Config

Running the Example

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

娇女薄笑

biaggi

xiaolangfanhua

rivulet

我三岁

薆情海

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。