.NET 4 垃圾收集器的可扩展性

发布于 2024-09-14 14:49:56 字数 527 浏览 9 评论 0原文

我最近对 ​​.NET 4 垃圾收集器进行了基准测试,从多个线程集中分配。当分配的值记录在数组中时,我观察到没有像我预期的那样可扩展性(因为系统争用对共享老一代的同步访问)。然而,当分配的值立即被丢弃时,我很惊讶地发现也没有可扩展性!

我原本预计临时情况会几乎线性扩展,因为每个线程应该简单地将 Nursery gen0 擦除干净并重新启动,而无需争用任何共享资源(没有任何内容会保留到老一代,也不会发生 L2 缓存未命中,因为 gen0 很容易适合 L1 缓存)。

例如,这篇 MSDN 文章指出

免同步分配 在多处理器系统上,托管堆的第 0 代被分成多个内存区域,每个线程使用一个区域。这允许多个线程同时进行分配,因此不需要对堆进行独占访问。

任何人都可以验证我的发现和/或解释我的预测和观察之间的差异吗?

I recently benchmarked the .NET 4 garbage collector, allocating intensively from several threads. When the allocated values were recorded in an array, I observed no scalability just as I had expected (because the system contends for synchronized access to a shared old generation). However, when the allocated values were immediately discarded, I was horrified to observe no scalability then either!

I had expected the temporary case to scale almost linearly because each thread should simply wipe the nursery gen0 clean and start again without contending for any shared resources (nothing surviving to older generations and no L2 cache misses because gen0 easily fits in L1 cache).

For example, this MSDN article says:

Synchronization-free Allocations On a multiprocessor system, generation 0 of the managed heap is split into multiple memory arenas using one arena per thread. This allows multiple threads to make allocations simultaneously so that exclusive access to the heap is not required.

Can anyone verify my findings and/or explain this discrepancy between my predictions and observations?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

嘿哥们儿 2024-09-21 14:49:56

不是问题的完整答案,只是为了澄清一些误解:.NET GC 仅在工作站模式下并发。在服务器模式下,它使用stop-the-world并行GC。更多详细信息请参见此处。 .NET 中单独的托儿所主要是为了避免分配同步;然而,它们是全局堆的一部分,不能单独收集。

Not a complete answer to the question, but just to clear up some misconceptions: the .NET GC is only concurrent in workstation mode. In server mode, it uses stop-the-world parallel GC. More details here. The separate nurseries in .NET are primarily to avoid synchronisation on allocation; they are nevertheless part of the global heap and cannot be collected separately.

梦罢 2024-09-21 14:49:56

不太确定这是什么以及您在机器上看到的确切。然而,您的计算机上有两个不同版本的 CLR。 Mscorwks.dll 和 mscorsvc.dll。前者是在工作站上运行程序时获得的,后者是在 Windows 服务器版本之一(如 Windows 2003 或 2008)上运行的。

工作站版本适合您的本地PC,它不会占用所有机器资源。在 GC 进行期间,您仍然可以阅读电子邮件。服务器版本经过优化,可在服务器级硬件上扩展。大量 RAM(GC 不会那么快启动)和大量 CPU 核心(垃圾在多个核心上收集)。您引用的文章可能谈论服务器版本。

您可以选择工作站上的服务器版本,使用 .config 文件中的 元素。

Not so sure what this is about and exactly what you saw on your machine. There are however two distinct versions of the CLR on your machine. Mscorwks.dll and mscorsvc.dll. The former is the one you get when you run your program on a work station, the latter on one of the server versions of Windows (like Windows 2003 or 2008).

The work station version is kind to your local PC, it doesn't gobble all machine resources. You can still read your email while a GC is going on. The server version is optimized to scale on server level hardware. Lots of RAM (GC doesn't kick in that quick) and lots of CPU cores (garbage gets collected on more than one core). Your quoted article probably talks about the server version.

You can select the server version on your workstation, use the <gcServer> element in your .config file.

苍暮颜 2024-09-21 14:49:56

我可以对正在发生的事情进行一些猜测。

(1) 如果您有一个单线程,并且第 0 代中有 M 个空闲空间,那么 GC 将仅在分配了 M 个字节后运行。

(2) 如果有 N 个线程,并且 GC 将第 0 代划分为每个线程的 N/M 空间,则每次线程分配 N/M 字节时,GC 将最终运行。这里最重要的是 GC 需要“停止世界”(即挂起所有正在运行的线程),以便标记来自线程根集的引用。这并不便宜。因此,GC 不仅会更频繁地运行,而且还会在每次收集上执行更多工作。

当然,另一个问题是多线程应用程序通常对缓存不太友好,这也会严重影响您的性能。

我不认为这是 .NET GC 问题,而是一般 GC 的问题。一位同事曾经运行过一个简单的“乒乓”基准测试,使用 SOAP 在两个线程之间发送简单的整数消息。当两个线程位于单独的进程中时,基准测试的运行速度是原来的两倍,因为内存分配和管理是完全解耦的!

I can hazard a couple of guesses as to what is happening.

(1) If you have a single thread and there is M space free in generation 0, then the GC will only run once M bytes have been allocated.

(2) If you have N threads and the GC divides up generation 0 into N/M space per thread, the GC will end up running every time a thread allocates N/M bytes. The showstopper here is that the GC needs to "stop the world" (i.e., suspend all running threads) in order to mark references from the threads' root sets. This is not cheap. So, not only will the GC run more often, it will be doing more work on each collection.

The other problem, of course, is that multi-threaded applications aren't typically very cache friendly, which can also put a significant dent in your performance.

I don't think this is a .NET GC issue, rather it's an issue with GC in general. A colleague once ran a simple "ping pong" benchmark sending simple integer messages between two threads using SOAP. The benchmark ran twice as fast when the two threads were in separate processes because memory allocation and management was completely decoupled!

偏闹i 2024-09-21 14:49:56

非常快速、易于查看(直接从根开始,分配空值)和大规模释放可以欺骗 GC 变得急切,缓存本地堆的整个想法是一个美好的梦想:-)即使你有完全分离的线程本地堆(你不需要)句柄指针表仍然必须是完全易失性的,以便对于一般的多 CPU 场景来说是安全的。哦,请记住,有很多线程,CPU 缓存是共享的,内核需要优先,所以它不仅仅适合你:-)

还要注意,带有双指针的“堆”有两部分 - 提供的内存块和句柄- 指针表(以便可以移动块,但您的代码始终有一个地址)。这样的表是一个关键但非常精简的进程级资源,唯一强调它的方法就是用大量的快速释放来淹没它 - 所以你设法做到了:-))

一般来说,GC 的规则是 - 泄漏: -) 当然不是永远,但只要你能,就可以。您是否还记得人们如何到处告诉我们“不要强制 GC 回收”?这就是故事的一部分。此外,“stop the world”集合实际上比“并发”集合更有效,并且过去被称为“循环窃取”或“调度者合作”。只有标记阶段需要冻结调度程序,并且在服务器上有几个线程在执行此操作(无论如何 N 个核心都是空闲的:-) 另一个的唯一原因是它会使实时操作(例如播放视频)变得紧张,就像较长的线程量子一样。

因此,如果您在短期和频繁的 CPU 突发(小分配、几乎没有工作、快速释放)上与基础设施竞争,您将看到/测量的唯一东西将是 GC 和 JIT 噪音。

如果这是为了真实的事情,即不仅仅是实验,那么您能做的最好的事情就是在堆栈(结构)上使用大值数组。它们不能被强制放到堆上,并且是本地化的,并且不受任何后门移动的影响=>缓存必须喜欢它们:-) 这可能意味着切换到“不安全”模式,使用普通指针,也许自己做一些分配(如果 yopu 需要一些简单的东西,比如列表),但这只是踢 GC 的一个很小的代价出:-) 尝试将数据强制放入缓存还取决于保持堆栈精简 - 请记住,您并不孤单。另外,为您的线程提供一些在发布期间至少值得几个量子的工作可能会有所帮助。最坏的情况是如果您在单个量程内分配和释放。

Very quick, easy to see (straight at root, assigning nulls) and massive releases can trick GC into being eager and the whole idea of a cache-local heap is a nice dream :-) Even if you had fully separated thread-local heaps (which you don't) the handle-pointer table would still have to be fully volatile just to make is safe for general multi-CPU scenarios. Oh and remember that there are many threads, CPU cache is shared, kernel needs take the precedence so it's not all just for you :-)

Also beware that "heap" with double pointers has 2 parts - block of memory to give and the handle-pointer table (so that blocks can be moved but your code always has one address). Such table is a critical but very lean process-level resource and just about the only way to stress it is to flood it with massive quick releases - so you managed to do it :-))

In general the rule of GC is - leak :-) Not forever of course, but kind of for as long as you can. If you remember how people go around telling "don't force GC collections"? That's the part of the story. Also the "stop the world" collection is actually much more efficient than "concurrent" and used to be known by a nicer name of cycle stealing or sheduler cooperation. Only the mark phase needs to freeze the scheduler and on a server there's a burst of several threads doing it (N cores are idle anyway :-) The only reason for the other one is that it can make real-time operations like playing videos jittery, just as the longer thread quantum does.

So again if you go competing with infrastructure on short and frequent CPU bursts (small alloc, almost no work, quick release) the only thing you'll see/measure will be the GC and JIT noise.

If this was for something real, i.e. not just experimenting, the best you can do is to use big value arrays on stack (structs). They can't be forced onto heap and are as local as a local can get, and not subject to any backdoor moving => cache has to love them :-) That may mean switching to "unsafe" mode, using normal pointers and maybe doing a bit of alloc on your own (if yopu need something simple like lists) but that's a small price to pay for kicking GC out :-) Trying to force data into a cache also depends on keeping your stacks lean otherwise - rememeber that you are not alone. Also giving your threads some work that's worth at least several quantums berween releases may help. Worst case scenario would be if you alloc and release within a signle quantum.

难忘№最初的完美 2024-09-21 14:49:56

或者解释一下我的预测和观察之间的差异?

基准测试很难。
对不受您完全控制的子系统进行基准测试更加困难。

or explain this discrepancy between my predictions and observations?

Benchmarking is hard.
Benchmarking a subsystem that is not under your full control is even harder.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文