为什么我的线程 .Net 应用程序在分配大量内存时不能线性扩展?
我遇到了一些关于大内存分配对 .Net 运行时可扩展性的影响的奇怪问题。在我的测试应用程序中,我在固定循环次数的紧密循环中创建了大量字符串,并给出了每秒循环迭代的速率。当我在多个线程中运行这个循环时,奇怪的事情出现了——速度似乎并没有线性增加。当您创建大字符串时,问题会变得更糟。
让我向您展示结果。我的机器是运行 Windows Server 2008 R1 32 位的 8GB、8 核机器。它有两个 4 核 Intel Xeon 1.83ghz (E5320) 处理器。执行的“工作”是对字符串上的 ToUpper()
和 ToLower()
的一组交替调用。我对一个线程、两个线程等运行测试——直到最大值。下表中的列是:
- 速率:所有线程的循环次数除以持续时间。
- 线性速率:性能线性扩展时的理想速率。它的计算方式为一个线程实现的速率乘以该测试的线程数。
- 方差:计算为比率低于线性比率的百分比。
示例 1:10,000 个循环、8 个线程、每个字符串 1024 个字符
第一个示例从一个线程开始,然后是两个线程,最后使用八个线程运行测试。每个线程创建 10,000 个字符串,每个字符串包含 1024 个字符:
Creating 10000 strings per thread, 1024 chars each, using up to 8 threads GCMode = Server Rate Linear Rate % Variance Threads -------------------------------------------------------- 322.58 322.58 0.00 % 1 689.66 645.16 -6.90 % 2 882.35 967.74 8.82 % 3 1081.08 1290.32 16.22 % 4 1388.89 1612.90 13.89 % 5 1666.67 1935.48 13.89 % 6 2000.00 2258.07 11.43 % 7 2051.28 2580.65 20.51 % 8 Done.
示例 2:10,000 个循环,8 个线程,每个字符串 32,000 个字符
在第二个示例中,我将每个字符串的字符数增加到 32,000 个。
Creating 10000 strings per thread, 32000 chars each, using up to 8 threads GCMode = Server Rate Linear Rate % Variance Threads -------------------------------------------------------- 14.10 14.10 0.00 % 1 24.36 28.21 13.64 % 2 33.15 42.31 21.66 % 3 40.98 56.42 27.36 % 4 48.08 70.52 31.83 % 5 61.35 84.63 27.51 % 6 72.61 98.73 26.45 % 7 67.85 112.84 39.86 % 8 Done.
注意方差与线性率的差异;在第二个表中,实际速率比线性速率低 39%。
我的问题是:为什么这个应用程序不能线性扩展?
我的观察
错误共享
我最初认为这可能是由于错误共享造成的,但是,正如您在源代码中看到的那样,我不是共享任何集合并且字符串相当大。唯一可能存在的重叠是一个字符串的开头和另一个字符串的结尾。
服务器模式垃圾收集器
我使用 gcServerenabled=true 以便每个核心都有自己的堆和垃圾收集器线程。
大对象堆
我不认为我分配的对象被发送到大对象堆,因为它们的大小低于 85000 字节。
字符串实习
我认为由于实习MSDN,所以我尝试编译实习禁用。这产生了比上面显示的结果更糟糕的结果
其他数据类型
我使用小型和大型整数数组尝试了相同的示例,其中我循环遍历每个元素并更改值。它产生了类似的结果,遵循分配较大时表现较差的趋势。
源代码
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading;
using System.Diagnostics;
using System.Runtime;
using System.Runtime.CompilerServices;
namespace StackOverflowExample
{
public class Program
{
private static int columnWidth = 14;
static void Main(string[] args)
{
int loopCount, maxThreads, stringLength;
loopCount = maxThreads = stringLength = 0;
try
{
loopCount = args.Length != 0 ? Int32.Parse(args[0]) : 1000;
maxThreads = args.Length != 0 ? Int32.Parse(args[1]) : 4;
stringLength = args.Length != 0 ? Int32.Parse(args[2]) : 1024;
}
catch
{
Console.WriteLine("Usage: StackOverFlowExample.exe [loopCount] [maxThreads] [stringLength]");
System.Environment.Exit(2);
}
float rate;
float linearRate = 0;
Stopwatch stopwatch;
Console.WriteLine("Creating {0} strings per thread, {1} chars each, using up to {2} threads", loopCount, stringLength, maxThreads);
Console.WriteLine("GCMode = {0}", GCSettings.IsServerGC ? "Server" : "Workstation");
Console.WriteLine();
PrintRow("Rate", "Linear Rate", "% Variance", "Threads"); ;
PrintRow(4, "".PadRight(columnWidth, '-'));
for (int runCount = 1; runCount <= maxThreads; runCount++)
{
// Create the workers
Worker[] workers = new Worker[runCount];
workers.Length.Range().ForEach(index => workers[index] = new Worker());
// Start timing and kick off the threads
stopwatch = Stopwatch.StartNew();
workers.ForEach(w => new Thread(
new ThreadStart(
() => w.DoWork(loopCount, stringLength)
)
).Start());
// Wait until all threads are complete
WaitHandle.WaitAll(
workers.Select(p => p.Complete).ToArray());
stopwatch.Stop();
// Print the results
rate = (float)loopCount * runCount / stopwatch.ElapsedMilliseconds;
if (runCount == 1) { linearRate = rate; }
PrintRow(String.Format("{0:#0.00}", rate),
String.Format("{0:#0.00}", linearRate * runCount),
String.Format("{0:#0.00} %", (1 - rate / (linearRate * runCount)) * 100),
runCount.ToString());
}
Console.WriteLine("Done.");
}
private static void PrintRow(params string[] columns)
{
columns.ForEach(c => Console.Write(c.PadRight(columnWidth)));
Console.WriteLine();
}
private static void PrintRow(int repeatCount, string column)
{
for (int counter = 0; counter < repeatCount; counter++)
{
Console.Write(column.PadRight(columnWidth));
}
Console.WriteLine();
}
}
public class Worker
{
public ManualResetEvent Complete { get; private set; }
public Worker()
{
Complete = new ManualResetEvent(false);
}
public void DoWork(int loopCount, int stringLength)
{
// Build the string
string theString = "".PadRight(stringLength, 'a');
for (int counter = 0; counter < loopCount; counter++)
{
if (counter % 2 == 0) { theString.ToUpper(); }
else { theString.ToLower(); }
}
Complete.Set();
}
}
public static class HandyExtensions
{
public static IEnumerable<int> Range(this int max)
{
for (int counter = 0; counter < max; counter++)
{
yield return counter;
}
}
public static void ForEach<T>(this IEnumerable<T> items, Action<T> action)
{
foreach(T item in items)
{
action(item);
}
}
}
}
App.Config
<?xml version="1.0" encoding="utf-8" ?>
<configuration>
<runtime>
<gcServer enabled="true"/>
</runtime>
</configuration>
运行示例
要在您的机器上运行 StackOverflowExample.exe,请使用以下命令行参数调用它:
StackOverFlowExample.exe [loopCount] [maxThreads] [stringLength]
loopCount< /code>:每个线程操作字符串的次数。
maxThreads
:要进行的线程数。stringLength
:填充字符串的字符数。
I’ve run into something strange about the effect of large memory allocations on the scalability of the .Net runtime. In my test application I create lots of strings in a tight loop for a fixed number of cycles and spit out a rate of loop iterations per second. The weirdness comes in when I run this loop in several threads – it appears that the rate does not increase linearly. The problem gets even worse when you create large strings.
Let me show you the results. My machine is an 8gb, 8-core box running Windows Server 2008 R1, 32-bit. It has two 4-core Intel Xeon 1.83ghz (E5320) processors. The "work" performed is a set of alternating calls to ToUpper()
and ToLower()
on a string. I run the test for one thread, two threads, etc – up to the maximum. The columns in the table below are:
- Rate: The number of loops across all threads divided by the duration.
- Linear Rate: The ideal rate if performance were to scale linearly. It is calculated as the rate achieved by one thread multiplied by the number of threads for that test.
- Variance: Calculated as the percentage by which the rate falls short of the linear rate.
Example 1: 10,000 loops, 8 threads, 1024 chars per string
The first example starts off with one thread, then two threads and eventually runs the test with eight threads. Each thread creates 10,000 strings of 1024 chars each:
Creating 10000 strings per thread, 1024 chars each, using up to 8 threads GCMode = Server Rate Linear Rate % Variance Threads -------------------------------------------------------- 322.58 322.58 0.00 % 1 689.66 645.16 -6.90 % 2 882.35 967.74 8.82 % 3 1081.08 1290.32 16.22 % 4 1388.89 1612.90 13.89 % 5 1666.67 1935.48 13.89 % 6 2000.00 2258.07 11.43 % 7 2051.28 2580.65 20.51 % 8 Done.
Example 2: 10,000 loops, 8 threads, 32,000 chars per string
In the second example I’ve increased the number of chars for each string to 32,000.
Creating 10000 strings per thread, 32000 chars each, using up to 8 threads GCMode = Server Rate Linear Rate % Variance Threads -------------------------------------------------------- 14.10 14.10 0.00 % 1 24.36 28.21 13.64 % 2 33.15 42.31 21.66 % 3 40.98 56.42 27.36 % 4 48.08 70.52 31.83 % 5 61.35 84.63 27.51 % 6 72.61 98.73 26.45 % 7 67.85 112.84 39.86 % 8 Done.
Notice the difference in variance from the linear rate; in the second table the actual rate is 39% less than the linear rate.
My question is: Why does this app not scale linearly?
My Observations
False Sharing
I initially thought that this could be due to False Sharing but, as you’ll see in the source code, I’m not sharing any collections and the strings are quite big. The only overlap that could exist is at the beginning of one string and the end of another.
Server-mode Garbage Collector
I’m using gcServer enabled=true so that each core gets its own heap and garbage collector thread.
Large Object Heap
I don't think that objects I allocate are being sent to the Large Object Heap because they are under 85000 bytes big.
String Interning
I thought that string values may being shared under the hood due to interningMSDN, so I tried compiling interning disabled. This produced worse results than those shown above
Other data types
I tried the same example using small and large integer arrays, in which I loop through each element and change the value. It produces similar results, following the trend of performing worse with larger allocations.
Source Code
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading;
using System.Diagnostics;
using System.Runtime;
using System.Runtime.CompilerServices;
namespace StackOverflowExample
{
public class Program
{
private static int columnWidth = 14;
static void Main(string[] args)
{
int loopCount, maxThreads, stringLength;
loopCount = maxThreads = stringLength = 0;
try
{
loopCount = args.Length != 0 ? Int32.Parse(args[0]) : 1000;
maxThreads = args.Length != 0 ? Int32.Parse(args[1]) : 4;
stringLength = args.Length != 0 ? Int32.Parse(args[2]) : 1024;
}
catch
{
Console.WriteLine("Usage: StackOverFlowExample.exe [loopCount] [maxThreads] [stringLength]");
System.Environment.Exit(2);
}
float rate;
float linearRate = 0;
Stopwatch stopwatch;
Console.WriteLine("Creating {0} strings per thread, {1} chars each, using up to {2} threads", loopCount, stringLength, maxThreads);
Console.WriteLine("GCMode = {0}", GCSettings.IsServerGC ? "Server" : "Workstation");
Console.WriteLine();
PrintRow("Rate", "Linear Rate", "% Variance", "Threads"); ;
PrintRow(4, "".PadRight(columnWidth, '-'));
for (int runCount = 1; runCount <= maxThreads; runCount++)
{
// Create the workers
Worker[] workers = new Worker[runCount];
workers.Length.Range().ForEach(index => workers[index] = new Worker());
// Start timing and kick off the threads
stopwatch = Stopwatch.StartNew();
workers.ForEach(w => new Thread(
new ThreadStart(
() => w.DoWork(loopCount, stringLength)
)
).Start());
// Wait until all threads are complete
WaitHandle.WaitAll(
workers.Select(p => p.Complete).ToArray());
stopwatch.Stop();
// Print the results
rate = (float)loopCount * runCount / stopwatch.ElapsedMilliseconds;
if (runCount == 1) { linearRate = rate; }
PrintRow(String.Format("{0:#0.00}", rate),
String.Format("{0:#0.00}", linearRate * runCount),
String.Format("{0:#0.00} %", (1 - rate / (linearRate * runCount)) * 100),
runCount.ToString());
}
Console.WriteLine("Done.");
}
private static void PrintRow(params string[] columns)
{
columns.ForEach(c => Console.Write(c.PadRight(columnWidth)));
Console.WriteLine();
}
private static void PrintRow(int repeatCount, string column)
{
for (int counter = 0; counter < repeatCount; counter++)
{
Console.Write(column.PadRight(columnWidth));
}
Console.WriteLine();
}
}
public class Worker
{
public ManualResetEvent Complete { get; private set; }
public Worker()
{
Complete = new ManualResetEvent(false);
}
public void DoWork(int loopCount, int stringLength)
{
// Build the string
string theString = "".PadRight(stringLength, 'a');
for (int counter = 0; counter < loopCount; counter++)
{
if (counter % 2 == 0) { theString.ToUpper(); }
else { theString.ToLower(); }
}
Complete.Set();
}
}
public static class HandyExtensions
{
public static IEnumerable<int> Range(this int max)
{
for (int counter = 0; counter < max; counter++)
{
yield return counter;
}
}
public static void ForEach<T>(this IEnumerable<T> items, Action<T> action)
{
foreach(T item in items)
{
action(item);
}
}
}
}
App.Config
<?xml version="1.0" encoding="utf-8" ?>
<configuration>
<runtime>
<gcServer enabled="true"/>
</runtime>
</configuration>
Running the Example
To run StackOverflowExample.exe on your box, call it with these command-line parameters:
StackOverFlowExample.exe [loopCount] [maxThreads] [stringLength]
loopCount
: The number of times each thread will manipulate the string.maxThreads
: The number of threads to progress to.stringLength
: the number of characters to fill the string with.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
您可能想看看我的这个问题。
我遇到了类似的问题,这是由于 CLR 在分配内存时执行线程间同步以避免重叠分配。现在,对于服务器 GC,锁定算法可能会有所不同 - 但同样的事情可能会影响您的代码。
You may want to look that this question of mine.
I ran into a similar problem that was due to the fact that the CLR performs inter-thread synchronization when allocating memory to avoid overlapping allocations. Now, with the server GC, the locking algorithm may be different - but something along those same lines may be affecting your code.
您运行此程序的硬件无法线性扩展多个进程或线程。
你只有一个记忆库。这是一个瓶颈(多通道内存可以改善访问,但不能比内存组提供更多的处理能力(似乎 e5320 处理器支持 1 - 4 个内存通道)。
每个物理 cpu 包只有一个内存控制器(您的 CPU 包中有两个)情况),这是一个瓶颈,
每个 cpu 包有 2 个 l2 缓存,如果缓存耗尽,就会出现缓存一致性问题
。调度和内存管理,这也将有助于非线性扩展,
我认为您在多线程和每次增量到 8 时获得了相当合理的结果……
确实,您是否读过任何建议该商品的内容。多CPU硬件能够线性扩展多个进程/线程吗?
The hardware you're running this on is not capable of linear scaling of multiple processes or threads.
You have a single memory bank. that's a bottle neck (multiple channel memory may improve access, but not for more precess than you have memory banks (seems like the e5320 processor support 1 - 4 memory channels).
There is only one memory controller per physical cpu package (two in your case), that's a bottle neck.
There are 2 l2 caches per cpu package. that's a bottle neck. Cache coherency issues will happen if that cache is exhausted.
this doesn't even get to the OS/RTL/VM issues in managing process scheduling and memory management, which will also contribute to non-linear scaling.
I think you're getting pretty reasonable results. Significant speedup with multiple threads and at each increment to 8...
Truely, have you ever read anything to suggest that commodity multi-cpu hardware is capable of linear scaling of multiple processes/threads? I haven't.
您最初的帖子从根本上来说是有缺陷的 - 您假设可以通过并行执行实现线性加速。事实并非如此,也从来没有如此。请参阅阿姆达尔定律(是的,我知道,维基百科,但它比其他任何东西都容易) 。
从 CLR 提供的抽象来看,您的代码似乎没有依赖项 - 然而,正如 LBushkin 指出的那样,情况并非如此。正如 SuperMagic 指出的,硬件本身意味着执行线程之间的依赖关系。几乎任何可以并行化的问题都是如此——即使对于独立的机器、独立的硬件,问题的某些部分通常需要一些同步元素,而同步会阻碍线性加速。
Your initial post is fundamentally flawed - you're assuming that a linear speedup is possible through parallel execution. It isn't, and never has been. See Amdahl's Law (Yes, I know, Wikipedia, but its easier than anything else).
Your code, viewed from the abstraction the CLR provides, appears to have no dependencies - however, as LBushkin pointed, out that isn't the case. As SuperMagic pointed out, the hardware itself implies dependencies between the threads of execution. This is true of just about any problem that can be parallelized - even with independent machines, with independent hardware, some portion of the problem usually requires some element of synchronization, and that synchronization prevents a linear speedup.
内存分配器对应用程序加速的影响与内存分配器的数量密切相关。分配比分配的金额。它还更多地受到分配延迟(在单个线程上完成单个分配的时间量)的影响,在 CLR 的情况下,由于使用 凹凸指针分配器(参见第 3.4.3 节)。
您的问题是问为什么实际加速是次线性的,要回答这个问题,您当然应该查看 Amdahl 定律< /a>.
回到关于 CLR 垃圾收集器的说明,您可以看到分配上下文属于特定线程(第 3.4.1 节),这减少了(但没有消除)多线程分配期间所需的同步量。如果您发现分配确实是弱点,我建议尝试使用对象池(可能是每个线程)来减少收集器的负载。通过减少分配的绝对数量,您将减少收集器运行的次数。但是,这也会导致更多对象进入第 2 代,而第 2 代在需要时收集速度最慢。
最后,Microsoft 继续改进较新版本的 CLR 中的垃圾收集器,因此您应该以能够使用的最新版本为目标(至少是 .NET 2)。
The effect of a memory allocator on application speedup is more closely related to the number of allocations than the amount allocated. It's also more influenced by the allocation latency (amount of time to complete a single allocation on a single thread), which in the case of the CLR is extremely fast due to the use of a bump-pointer allocator (see section 3.4.3).
Your question is asking why the actual speedup is sublinear, and to answer that you should certainly review Amdahl's Law.
Going back to the Notes on the CLR Garbage Collector, you can see that an allocation context belongs to a particular thread (section 3.4.1), which reduces (but does not eliminate) the amount of synchronization required during multi-threaded allocations. If you find that allocation is truly the weak point, I would suggest trying an object pool (possibly per-thread) to reduce the load on the collector. By reducing the sheer number of allocations, you'll reduce the number of times the collector has to run. However, this will also result in more objects making it to generation 2, which is the slowest to collect when it is needed.
Finally, Microsoft continues to improve the garbage collector in newer versions of the CLR, so you should target the most recent version you are able to (.NET 2 at a bare minimum).
卢克这个问题问得好!我对答案很感兴趣。
我怀疑您期望的不是线性缩放,而是比 39% 的方差更好的结果。
NoBugz - 基于 280Z28 的链接,实际上每个核心都有一个 GC 堆,GCMode=Server。每个堆还应该有一个 GC 线程。这不应该导致您提到的并发问题吗?
LBushkin - 我认为这是关键问题,分配内存时 GCMode=Server 是否仍然会导致线程间锁定?有人知道吗?或者可以简单地用 SuperMagic 提到的硬件限制来解释吗?
Great question Luke! I'm very interested in the answer.
I suspect that you were not expecting linear scaling, but something better than a 39% variance.
NoBugz - Based on 280Z28's links, there would actually be a GC heap per core with GCMode=Server. There should also be a GC thread per heap. This shouldn't result in the concurrency issues you mention?
LBushkin - I think that that is the key question, does GCMode=Server still cause inter-thread locking when allocating memory? Anyone know - or can it simply be explained by hardware limitations as mentioned by SuperMagic?