进程突然崩溃但没有错误
我有一个用 .net-3.5 编写的有点大的服务器进程,也就是说,在 VMWare vCenter Server 中运行,该进程不断崩溃,但没有报告任何错误。该进程由 32 位 Windows Server 2003 上的 Windows 服务创建,并且是一个长时间运行的进程(多天)。它是一个协作进程,通过 Tcp 套接字接受来自其他 Windows XP 计算机上运行的多个客户端的连接,并允许它们共享数据。此外,该进程还自托管大约 8 个 WCF 服务,这些服务公开了 Tcp 和 TCP 混合体。 HTTP 端点。该进程通常始终消耗约 500 Mb 的内存和 30-50% 的 CPU。同一 VM 上还有一个 SQL Server 2005 实例,托管 6 个数据库,并消耗大约 1-1.2 Gb 内存。整个系统分配了 8 Gb 的 RAM,在正常运行期间消耗多达 7 Gb。我假设 PAE 已启用以允许系统寻址 8 Gb 内存,但尚未证实这一点。
问题是,在看似随机的时间,进程会突然崩溃,并且没有报告任何错误,包括在事件日志中。我尝试将调试器附加到进程,但它们也没有捕获崩溃。我首先在加载了符号的发布版本上尝试了 WinDbg,然后用调试版本替换了所有发布 dll/exe,并加载了它们的符号。崩溃仍然发生,并且调试器没有捕获它们。接下来,我使用 .Net Reflector 加载项在系统上安装了 Visual Studio,并将其附加。它也没有捕捉到崩溃。
在你向我讲授为什么我们要在单个虚拟机上运行如此多的东西之前,请知道我没有设计系统,也没有以这种方式实现它。我们的客户出于特定原因指定了它,我被要求参与并使其发挥作用。我只对对环境的批评感兴趣,如果你能找到有助于解释突然崩溃的具体证据。如果我们能提供此类证据,我们的客户可能愿意改变环境。任何能让我捕获有关崩溃的更多信息的额外调试技术也将不胜感激。
I have a somewhat large server process written in .net-3.5, that is, running in a VMWare vCenter Server that keeps crashing without any errors being reported. The process is created by a Windows Service on 32 bit Windows Server 2003, and is intended to be a long running process (multiple days). It is a collaboration process, that accepts connections via Tcp sockets from multiple clients running on other Windows XP machines, and allows them to share data. In addition, the process also self-hosts about 8 WCF services that expose a mixture Tcp & Http endpoints. The process generally consumes about 500 Mb of memory and between 30-50% CPU at all times. There is also an instance of SQL Server 2005 on the same VM that is hosting 6 databases, and consumes about 1-1.2 Gb of memory. The entire system has been allocated 8 Gb of ram, and is consuming as much as 7 Gb during normal operation. I assume PAE is enabled to allow the system to address 8 Gb of ram, but have not confirmed this.
The problem is that, at seemingly random times, the process will suddenly crash with no errors being reported, including in the event log. I've tried attaching debuggers to the process, and they have not caught the crash either. I first tried WinDbg on the release build with symbols loaded, then I replaced all of the release dlls/exes with debug builds and loaded their symbols. The crashes still occurred, and the debugger did not catch them. I next installed Visual Studio on the system with the .Net Reflector add-in, and attached that. It also did not catch the crash.
Before you lecture me on why we're running so many things on a single VM, know that I did not design the system, nor did I implement it this way. Our customer dictated it for specific reasons, and I've been asked to come in and make it work. I'm only interested in criticisms of the environment if you can site specific evidence that would help explain the sudden crashes. Our customer may be willing to alter the environment if we can show such evidence. Any additional debugging techniques that will allow me to capture more information about the crash would be greatly appreciated as well.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
http://blogs.msdn.com/b/tess/archive/2009/03/20/debugging-a-net-crash-with-rules-in-debug-diag.aspx
http://blogs.msdn.com/b/tess/archive/2009/03/20/debugging-a-net-crash-with-rules-in-debug-diag.aspx
没有输出的“崩溃”表明调用
_exit()
(甚至是exit()
)。我已经看到 Visual Studio 运行时库的一些角落这样做了,尽管它们通常会向stderr
发送一条神秘的消息。stderr
是否被捕获?内存不足的怀疑似乎也有可能。如果 .net 有一个类似
heapspace()
的函数来描述堆使用了多少内存,请定期记录该记录,也许还记录使用的总内存(代码 + 堆栈 + 数据)。我不熟悉.net,但必须有函数来获取这些值。A "crash" without output suggests a call to
_exit()
(or evenexit()
). I've seen a few corners of the Visual Studio runtime library do that, though they usually get a cryptic message out tostderr
. Isstderr
captured?The suspicion of running out of memory also seems likely. If .net has a
heapspace()
-like function to describe how much memory is being used by the heap, log that periodically, perhaps along with total memory used (code + stack + data). I'm not familiar with .net, but there must be functions to get those values.事实证明,其中一个服务插件正在寻找并引用 Java 库。当用户注销时,插件由于 JVM 被终止而导致服务崩溃。通过遵循这篇文章中的建议(使用“-Xrs”参数启动 JVM),我们能够让一切再次正常运行:
http://www.velocityreviews.com/forums/t128371 -java-app-dies-on-logoff.html
It turns out that one of the service plugins was seeking out and referencing a Java library. When the user logged out, the plugin crashed the service due to the JVM being terminated. We were able to get everything working again by following the suggestions in this post (starting JVM with the '-Xrs' parameter):
http://www.velocityreviews.com/forums/t128371-java-app-dies-on-logoff.html