了解 VS2010 C# 并行分析结果
我有一个包含许多独立计算的程序,因此我决定对其进行并行化。
我使用 Parallel.For/Each。
对于双核机器来说,结果还不错——大多数时候 CPU 利用率约为 80%-90%。 然而,对于双 Xeon 机器(即 8 核),我只能获得大约 30%-40% 的 CPU 利用率,尽管该程序在并行部分上花费了大量时间(有时超过 10 秒),并且我发现它使用了与串行部分相比,这些部分中的线程大约多 20-30 个。每个线程需要超过 1 秒才能完成,因此我认为它们没有理由不并行工作 - 除非存在同步问题。
我使用VS2010的内置分析器,结果很奇怪。 即使我只在一处使用锁,探查器报告大约 85% 的程序时间花费在同步上(还有 5-7% 的睡眠时间,5-7% 的执行时间,低于 1% 的 IO)。
锁定的代码只是一个缓存(字典)get/add:
bool esn_found;
lock (lock_load_esn)
esn_found = cache.TryGetValue(st, out esn);
if(!esn_found)
{
esn = pData.esa_inv_idx.esa[term_idx];
esn.populate(pData.esa_inv_idx.datafile);
lock (lock_load_esn)
{
if (!cache.ContainsKey(st))
cache.Add(st, esn);
}
}
lock_load_esn
是Object类型的类的静态成员。esn.populate
为每个线程使用单独的 StreamReader 从文件中读取数据。
但是,当我按“同步”按钮查看导致最大延迟的原因时,我发现探查器报告的行是函数入口行,并且不报告锁定的部分本身。
它甚至没有将包含上述代码的函数(提醒 - 程序中唯一的锁)报告为噪声级别 2% 的阻塞配置文件的一部分。当噪音水平为 0% 时,它会报告程序的所有功能,我不明白为什么它们被视为阻塞同步。
所以我的问题是 - 这是怎么回事?
怎么可能85%的时间都花在同步上?
我如何找出程序并行部分的真正问题是什么?
谢谢。
更新:深入研究线程(使用非常有用的可视化工具)后,我发现大部分同步时间都花在等待GC线程完成内存分配上,并且需要频繁分配由于通用数据结构调整大小操作。
我必须了解如何初始化我的数据结构,以便它们在初始化时分配足够的内存,从而可能避免 GC 线程的竞争。
我将在今天晚些时候报告结果。
更新:看来内存分配确实是问题的原因。当我对并行执行的类中的所有字典和列表使用初始容量时,同步问题较小。我现在只有大约 80% 的同步时间,CPU 利用率峰值为 70%(之前的峰值仅为 40% 左右)。
我进一步深入研究每个线程,发现现在对 GC allocate 的许多调用都是为了分配不属于大字典的小对象。
我通过为每个线程提供一个预先分配的此类对象池解决了这个问题,我使用它而不是调用“新”函数。
所以我本质上为每个线程实现了一个单独的内存池,但是以一种非常粗糙的方式,这是非常耗时的,而且实际上不是很好 - 我仍然必须使用大量的 new 进行初始化对于这些对象,现在我只在全局执行一次,并且 GC 线程上的争用更少,即使在必须增加池的大小时也是如此。
但这绝对不是我喜欢的解决方案,因为它不容易推广,而且我不想编写自己的内存管理器。
有没有办法告诉 .NET 为每个线程分配预定义的内存量,然后从本地池中获取所有内存分配?
I have a program with many independent computations so I decided to parallelize it.
I use Parallel.For/Each.
The results were okay for a dual-core machine - CPU utilization of about 80%-90% most of the time.
However, with a dual Xeon machine (i.e. 8 cores) I get only about 30%-40% CPU utilization, although the program spends quite a lot of time (sometimes more than 10 seconds) on the parallel sections, and I see it employs about 20-30 more threads in those sections compared to serial sections. Each thread takes more than 1 second to complete, so I see no reason for them to not work in parallel - unless there is a synchronization problem.
I used the built-in profiler of VS2010, and the results are strange.
Even though I use locks only in one place, the profiler reports that about 85% of the program's time is spent on synchronization (also 5-7% sleep, 5-7% execution, under 1% IO).
The locked code is only a cache (a dictionary) get/add:
bool esn_found;
lock (lock_load_esn)
esn_found = cache.TryGetValue(st, out esn);
if(!esn_found)
{
esn = pData.esa_inv_idx.esa[term_idx];
esn.populate(pData.esa_inv_idx.datafile);
lock (lock_load_esn)
{
if (!cache.ContainsKey(st))
cache.Add(st, esn);
}
}
lock_load_esn
is a static member of the class of type Object.esn.populate
reads from a file using a separate StreamReader for each thread.
However, when I press the Synchronization button to see what causes the most delay, I see that the profiler reports lines which are function entrance lines, and doesn't report the locked sections themselves.
It doesn't even report the function that contains the above code (reminder - the only lock in the program) as part of the blocking profile with noise level 2%. With noise level at 0% it reports all the functions of the program, which I don't understand why they count as blocking synchronizations.
So my question is - what is going on here?
How can it be that 85% of the time is spent on synchronization?
How do I find out what really is the problem with the parallel sections of my program?
Thanks.
Update: After drilling down into the threads (using the extremely useful visualizer) I found out that most of the synchronization time was spent on waiting for the GC thread to complete memory allocations, and that frequent allocations were needed because of generic data structures resize operations.
I'll have to see how to initialize my data structures so that they allocate enough memory on initialization, possibly avoiding this race for the GC thread.
I'll report the results later today.
Update: It appears memory allocations were indeed the cause of the problem. When I used initial capacities for all Dictionaries and Lists in the parallel executed class, the synchronization problem were smaller. I now had only about 80% Synchronization time, with spikes of 70% CPU utilization (previous spikes were only about 40%).
I drilled even further into each thread and discovered that now many calls to GC allocate were made for allocating small objects which were not part of the large dictionaries.
I solved this issue by providing each thread with a pool of preallocated such objects, which I use instead of calling the "new" function.
So I essentially implemented a separate pool of memory for each thread, but in a very crude way, which is very time consuming and actually not very good - I still have to use a lot of new for the initialization of these objects, only now I do it once globally and there is less contention on the GC thread, even when having to increase the size of the pool.
But this is definitely not a solution I like as it is not generalized easily and I wouldn't like to write my own memory manager.
Is there a way to tell .NET to allocate a predefined amount of memory for each thread, and then take all memory allocations from the local pool?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
能不能少分配一点?
我有过几次类似的经历,在查看糟糕的性能时发现问题的核心是 GC。不过,在每种情况下,我都发现我在某些内部循环中意外地耗尽了内存,不必要地分配了大量的临时对象。我会仔细查看代码,看看是否有可以删除的分配。我认为程序很少“需要”在内循环中进行大量分配。
Can you allocate less?
I've had a couple similar experiences, looking at bad perf and discovering the heart of the issue was the GC. In each case, though, I discovered that I was accidentally hemorrhaging memory in some inner loop, allocating tons of temporary objects needlessly. I'd give the code a careful look and see if there are allocations you can remove. I think it's rare for programs to 'need' to allocate heavily in inner loops.