我的（巨大的）应用程序抛出 OutOfMemoryException，现在怎么办？

发布于 2024-08-09 13:49:49 字数 775 浏览 2 评论 0原文

这是迄今为止我构建的最复杂的软件，现在它似乎在某个时候耗尽了内存。我还没有进行广泛的测试，因为我有点迷失了应该如何处理手头的问题。

HandleCount: 277
NonpagedSystemMemorySize: 48136
PagedMemorySize: 1898590208
PagedSystemMemorySize: 189036
PeakPagedMemorySize: 1938321408
VirtualMemorySize: 2016473088
PeakVirtualMemory: 2053062656
WorkingSet: 177774592
PeakWorkingSet: 883834880
PrivateMemorySize: 1898590208
PriviligedProcessorTime: 00:00:15.8593750
UserProcessorTime: 00:00:01.6562500
TotalProcessorTime: 00:00:17.5156250
GDI Objects: 30
User Objects: 27

我有一个自动的全局异常捕获器，它在异常时收集上述信息（使用 System.Diagnostics.Process） - 以及异常信息、日志和屏幕截图 - 并通过电子邮件将所有内容发送给我。

这一直工作得很好，因为我已经能够根据电子邮件信息来修复错误。这是，到现在为止。该软件有数万行，并使用托管和非托管资源。

我可以开始逐行浏览代码，但我感觉这可能不是尝试推断内存累积问题的最佳方法。

由于我以前从未做过此类分析，您建议如何解决此类问题？

原文

This is by far the most complex software I've built and now it seems to be running out of memory at some point. I haven't done extensive testing yet, because I'm a bit lost how I should approach the problem at hand.

HandleCount: 277
NonpagedSystemMemorySize: 48136
PagedMemorySize: 1898590208
PagedSystemMemorySize: 189036
PeakPagedMemorySize: 1938321408
VirtualMemorySize: 2016473088
PeakVirtualMemory: 2053062656
WorkingSet: 177774592
PeakWorkingSet: 883834880
PrivateMemorySize: 1898590208
PriviligedProcessorTime: 00:00:15.8593750
UserProcessorTime: 00:00:01.6562500
TotalProcessorTime: 00:00:17.5156250
GDI Objects: 30
User Objects: 27

I have an automated global exception catcher that upon exception gathers the above information (using System.Diagnostics.Process) - along with the exception information, log and a screen shot - and e-mails me everything.

This has been working nicely as I've been able to plug bugs based on the e-mailed information. This is, up until now. The software is tens of thousands of lines and uses managed and unmanaged resources.

I could start going through the code, line by line, but some how I get a feeling this might not be the best approach to try to deduce the memory build-up problem.

As I've never done this kind of analysis before, how would you suggest to approach this kind of a problem?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

卸妝后依然美 2024-08-16 13:49:49

我们为此提供了一个工具。

http://msdn.microsoft.com/en-us/library/ms979205。 ASPX

CLR Profiler 使您能够查看
进程的托管堆和
调查该人的行为
垃圾收集器。使用各种
工具中的视图，您可以获得
有关的有用信息
执行、分配和内存
您的应用程序的消耗。
使用 CLR Profiler，您可以
识别分配过多的代码
内存，造成太多垃圾
收集并保留记忆
太久了。

回复收藏 0 原文

猥︴琐丶欲为 2024-08-16 13:49:49

有几个选择。专用内存分析器（例如 RedGate 的 ANTS 内存分析器）可能非常有用用于解决此类问题。

如果您不想花钱购买专用工具，也可以使用 WinDbg（Windows 调试工具，可从 Microsoft 免费下载）。它可以向您显示托管堆、各种 AppDomain 堆等的堆使用情况。

请查看此博客，了解有关使用 WinDbg 的提示。

请记住，解决内存不足问题可能很困难，因为您通常看不到实际问题，而只是一个症状。因此，与崩溃时调用堆栈可以很好地指示问题根源不同，OOM 进程的调用堆栈可能揭示的信息很少。

根据我的经验，你必须看看内存被用在哪里。它可能位于托管堆上，在这种情况下，您必须查明某些内容是否保留实例的时间超过必要的时间。但是，它也可能与加载大量程序集（通常是动态生成的程序集）有关。

回复收藏 0 原文

ゞ花落谁相伴 2024-08-16 13:49:49

查看这篇关于检测 .NET 中内存泄漏的 MSDN 文章应用程序。

也许您遇到一些内存被分配但从未被收集的问题。

回复收藏 0 原文

妞丶爷亲个 2024-08-16 13:49:49

将调试器附加到它并重现错误。异常时的调用堆栈应该告诉您错误在哪里。

要么你有内存泄漏，要么你没有处理你的对象，要么你需要更好的硬件:)

回复收藏 0 原文

小梨窩很甜 2024-08-16 13:49:49

我有完全相同的应用程序。 :) 我们的应用程序最多需要 10GB 的 RAM。这显然是不好的。经过一些优化后，我设法将内存使用量减少了大约 50 倍，所以现在相同的数据集占用了 200MB。魔法？不。:)我做了什么：

一些数据被多次存储在内存中（几份副本）。我为每组数据制作了一份副本。
有些数据存储为string，但更有效的方式是int，因为这些字符串仅包含数字。
主要数据存储类是Dictionary。我们编写了自己的字典，它不存储任何哈希值 - 作为结果内存使用量在 64 位系统上减少了 3 倍，在 32 位系统上减少了 2 倍。

所以我的问题是：您用来存储数据的主要类/对象是什么？您存储什么样的数据？

回复收藏 0 原文

春风十里 2024-08-16 13:49:49

当 32 位 CLR 开始崩溃时，您的 PeakWorkingSet 指示公共数量。

不管人们告诉你什么，尽管自动内存管理具有巨大的讽刺意义，你必须意识到这一点，并确保你永远不会在此类/32位系统上达到这个限制。许多人都没有意识到这一点，我通常喜欢接受他们的 C# 膨胀反对票，但是当您在单个桌面上运行一些此类应用程序时，您可能会造成一些严重破坏。只要看一下VS关闭的托管部分，就像火车穿过PC一样。

有一个免费的 .NET MemProfiler，使用它并寻找悬挂的根。最终，特别是当您开始处理中等大小的数据时，您将不得不使用流式设计，而不是依赖它在 x64 上运行更多内存。

如今拥有 c880MB 数据集的大小实在是太可悲了……事实！

【片到C#3.0羊】

回复收藏 0 原文

浅蓝的眸勾画不出的柔情 2024-08-16 13:49:49

也许您应该首先检查使用非托管资源的位置。问题可能是你没有释放它们，或者你没有正确地释放它们。

回复收藏 0 原文

静赏你的温柔 2024-08-16 13:49:49

已经建议了很多有用的解决方案，并且 MSDN 文章非常详尽。结合上述建议，我还将做以下工作；

将异常发生的时间与日志文件相关联，以查看 OOM 异常发生时发生的情况。如果您在信息或调试级别上的日志记录很少，我建议您添加一些日志记录，以便您了解此错误的上下文。

在异常发生之前的很长一段时间内，内存使用量是否逐渐增加（例如，无限期运行的服务器进程），或者在异常发生之前，内存使用量是否会快速大幅增加？是有很多线程在运行还是只有一个线程在运行？

如果第一个为真并且异常很长一段时间没有发生，则意味着资源正在泄漏，如上所述。如果后者为真，则许多因素可能会导致该原因，例如每次迭代分配大量内存的循环、从服务接收大量结果等。

无论哪种方式，日志文件都应该为您提供足够的信息从哪里开始的信息。从那里我将确保我可以通过在界面中发出一组特定的命令或使用一组一致的输入来重新创建错误。之后，根据代码的状态，我将尝试（使用日志文件信息）创建一些针对假设的问题来源的集成测试。这应该可以让您更快地重新创建错误条件，并且更容易找到错误条件，因为您关注的代码会小得多。

我倾向于做的其他事情是用一个小的分析类包围内存敏感代码。这可以将内存使用情况记录到日志文件中，并让您立即了解日志中的问题。该类可以进行优化，这样它就不会被编译到发布版本中，或者具有很小的性能开销（如果您需要更多信息，请联系我）。当分配大量线程时，这种类型的方法效果不佳

您提到了非托管资源我假设您/您的团队编写的所有代码都是托管的？如果没有，并且如果可能的话，我将使用类似于上面提到的分析类来包围非托管边界，以排除来自非托管代码或互操作的泄漏。固定大量非托管指针也会导致堆碎片，但如果没有非托管代码，这两点都可以忽略。

不鼓励在早期评论中明确调用垃圾收集器。尽管您很少应该这样做，但有时它是有效的（搜索 Rico Mariani 的博客以获取示例）。我明确调用收集的一个示例（在提到的博客中介绍）是从服务返回大量字符串，将其放入数据集，然后绑定到网格。即使屏幕关闭后，一段时间内该内存也不会被收集。一般来说，不应显式调用它，因为垃圾收集器维护其收集（除其他外）所基于的指标。显式调用collect 会使这些指标失效。

最后，了解应用程序的内存需求通常是有好处的。通过记录更多信息、偶尔运行分析器、压力/单元/集成测试来获取此信息。了解某个操作在高级别上有什么影响，例如基于将分配大约 x 的一组输入。我通过在日志文件中的战略点注销详细信息来了解这一点。臃肿的日志文件可能难以理解或解释。

A lot of useful solutions have already been suggested and the MSDN article is very thorough. In conjunction with the suggestions above I would also do the following;

Correlate the time of the exception with your log file to see what was going on at the time of the OOM exception. If you have little logging at info or debug level I would suggest adding some logging so you have an idea of the context around this error.

Does the memory usage gradually increase over a long period of time before the exception (e.g. a server process that runs indefinitely) or a does it jump up in large increases quite quickly until the exception? Are lots of threads running or just one?

If the first is true and the exception doesn’t occur for a long time it would imply that resources are leaking are leaking as stated above. If the later is true a number of things could contribute to the cause e.g. a loop that allocates a lot of memory per iteration, receiving a very large set of results from a service etc. etc.

Either way the log file should provide you with enough information on where to start. From there I would ensure I could recreate the error either by issuing a certain set of commands in the interface or by using a consistent set of inputs. After that depending on the state of the code I would try (with the use of the log file info) to create some integration tests that targeted the assumed source of the problem. This should allow you to recreate the error condition much faster and make it a lot easier to find as the code you are concentrating on will be a lot smaller.

Other things I tend to do is surround memory sensitive code with a small profiling class. This can log memory usage to the log file and give you immediate visibility of problems in the log. The class can be optimized so it's not compiled into release builds or has a tiny performance overhead (if you need more info contact me). This type of approach doesn't work well when lots of threads are allocating

You mentioned unmanaged resources I assume all the code you / your team has written is managed? If not and if possible I would surround the unmanaged boundaries with a profiling class similar to the one mentioned above to rule out leaks from unmanaged code or interop. Pinning lots of unmanaged pointers can also cause heap fragmentation but if you have no unmanaged code both of these points can be ignored.

Explicitly calling the garbage collector in an earlier comment was discouraged. Although you should rarely do this there are times where it is valid (search Rico Mariani's blog for examples). One example (covered in the blog mentioned) in which I have explicitly called collect is when large amounts of string have been returned from a service, put into a dataset and then bound to a grid. Even after the screen was closed this memory wasn’t collected for some time. In general it shouldn't be called explicitly as the garbage collector maintains metrics on which it bases (among other things) collections on. Calling collect explicitly invalidates these metrics.

Finally it is generally good to have an idea of memory requirements of your application. Either obtain this by logging more information, occasionally running the profiler, stress / unit / integration tests. Get an idea of what impact a certain operation with have at a high level e.g. based on a set of inputs roughly x will be allocated. I gain an understanding of this by logging out detailed information at strategic points in the log file. A bloated log file can be hard to understand or interpret.

回复收藏 0 原文

~没有更多了~