什么可能导致性能提高? GC 时间、池化
我们的多线程应用程序会执行冗长的计算循环。平均而言,完成一个完整周期大约需要 29 秒。在此期间,.NET 性能计数器 % 时间GC 测量为 8.5%。全部由 Gen 2 系列制成。
为了提高性能,我们为大型对象实现了一个池。我们实现了 100% 的重用率。现在整个周期平均只需 20 秒。 “GC 时间百分比”显示介于 0.3...0.5% 之间。现在 GC 只进行 Gen 0 回收。
假设池化已有效实现,并忽略执行所需的额外时间。我们的性能提高了大约 33%。这与之前 8.5% 的 GC 值有何关系?
我有一些假设,希望能够得到确认、调整和修正:
1)“GC 中的时间”(如果我没读错的话)确实衡量了 2 个时间跨度的关系:
- 2 个 GC 周期之间的时间和
- 用于最后一个完整的 GC 周期,该值包含在第一个跨度中。
第二个时间跨度中不包括停止和重新启动阻塞 GC 的工作线程的开销。但怎么会占到总执行时间的 20%呢?
2)频繁阻塞线程进行GC可能会引起线程之间的争用?这只是一个想法。我无法通过 VS 并发分析器确认这一点。
3) 与此相反,可以确认非池化应用程序的页面未命中数(性能计数器:内存 -> 页面错误/秒)明显高于具有低 GC 的应用程序(每秒 25.000 个)速率(每秒 200 个)。我可以想象,这也会带来巨大的进步。但什么可以解释这种行为呢?是不是因为频繁的分配导致使用虚拟内存地址空间中更大的区域,因此更难保留到物理内存中?如何测量才能确认这是这里的原因?
顺便说一句:GCSettings.IsServerGC = false,.NET 4.0,64 位,在 Win7、4GB、Intel i5 上运行。 (对这个大问题感到抱歉..;)
Our multithreaded application does a lengthy computational loop. On average it takes about 29 sec for it to finish one full cycle. During that time, the .NET performance counter % time in GC measures 8.5 %. Its all made of Gen 2 collections.
In order to improve performance, we implemented a pool for our large objects. We archieved a 100% reusement rate. The overall cycle now takes only 20 sec on average. The "% time in GC" shows something between 0.3...0.5%. Now the GC does only Gen 0 collections.
Lets assume, the pooling is efficiently implemented and neglect the additional time it takes to execute. Than we got a performance improvement of roughly 33 percent. How does that relate to the former value for GC of 8.5%?
I have some assumptions, which I hope can be confirmed, adjusted and amended:
1) The "time in GC" (if I read it right) does measure the relation of 2 time spans:
- Time between 2 GC cycles and
- Time used for the last full GC cycle, this value is included into the first span.
What is not included into the second time span, would be the overhead of stopping and restarting the worker threads for the blocking GC. But how could that be as large as 20% of the overall execution time?
2) Frequently blocking the threads for GC may introduce contention between the treads? It is just a thought. I could not confirm that via the VS concurrency profiler.
3) In contrast to that, it could be confirmed that the number of page misses (performance counter: Memory -> Page Faults/sec) is significantly higher for the unpooled application (25.000 per second) than for the application with the low GC rate (200 per second). I could imagine, this would cause the great improvement as well. But what could explain that behaviour? Is it, because frequent allocations are causing a much larger area from the virtual memory address space to be used, which therefore is harder to keep into the physical memory? And how could that be measured to confirm as the reason here?
BTW: GCSettings.IsServerGC = false, .NET 4.0, 64bit, running on Win7, 4GB, Intel i5. (And sorry for the large question.. ;)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
通过池化,您还可以节省在
new
上花费的时间,这可能是相当可观的,但我不会花时间尝试平衡这些数字。与其“把礼物放在嘴里”,为什么不继续寻找其他“瓶颈”呢?
当您消除一个性能问题时,您就会使其他问题占用更大比例的时间,因为分母更小。
因此,只要您知道如何寻找它们,就更容易找到它们。。
这是一个示例和方法。
你解决了一个大问题。
这会使下一个更大(按百分比计算),因此您将其清理掉。
冲洗并重复。
它可能需要很短的时间,以至于您需要在其周围包裹一个临时的外循环,只是为了使其需要足够长的时间来进行调查。
你继续这样下去,逐渐使程序花费的时间越来越少,直到收益递减。
这就是如何使代码更快。
By pooling, you're also saving the time spent in
new
, which can be considerable, but I wouldn't spend time trying to balance the numbers.Rather than "look a gift horse in the mouth", why not move on to finding other "bottlenecks"?
When you remove one performance problem, you make others take a larger percentage of the time, because the denominator is smaller.
So they are easier to find, provided you know how to look for them.
Here's an example, and a method.
You clean out one big problem.
That makes the next one bigger, by percent, so you clean that one out.
Rinse and repeat.
It may get to take so little time that you need to wrap a temporary outer loop around it, just to make it take long enough to investigate.
You keep going this way, progressively making the program take less and less time, until you hit diminishing returns.
That's how to make the code fast.
预分配对象提高了并发性,线程不再需要进入保护垃圾收集堆的全局锁来分配对象。锁的持有时间很短,但显然您分配了很多对象,因此线程争夺锁的可能性不大。
“GC 时间”性能计数器测量用于收集而不是执行常规代码的 CPU 时间百分比。如果有很多 gen# 2 集合,并且分配对象的速率太大,以至于后台集合无法跟上并且线程必须被阻塞,那么您将得到一个很大的数字。拥有更多线程会使情况变得更糟,您可以分配更多。
Pre-allocating the objects improves concurrency, the threads no longer have to enter the global lock that protects the garbage collected heap to allocate an object. The lock is held for a very short time, but clearly you were allocating a lot of objects so it isn't unlikely that threads fight for the lock.
The 'time in GC' performance counter measures the percentage of cpu time spent collecting instead of executing regular code. You'll can get a big number if there are a lot of gen# 2 collections and the rate at which you allocate objects is so great that background collection can no longer keep up and the threads must be blocked. Having more threads makes that worse, you can allocate more.