垃圾收集期间崩溃的原因
一段时间以来,我一直在努力解决 C# 应用程序崩溃的问题,该应用程序也使用相当多的 C++/CLI 模块,这些模块主要是本地库的包装器来访问设备驱动程序。 崩溃并不总是很容易重现,但我能够收集六个崩溃转储,这些崩溃转储表明程序总是在垃圾收集期间因访问冲突而崩溃。这是本机调用堆栈和最后一个事件日志:
0:000> k
ChildEBP RetAddr
0012d754 79f95a8f mscorwks!WKS::gc_heap::find_first_object+0x62
0012d7dc 79f933bb mscorwks!WKS::gc_heap::mark_through_cards_for_segments+0x493
0012d814 79f92cbf mscorwks!WKS::gc_heap::mark_phase+0xc3
0012d838 79f93245 mscorwks!WKS::gc_heap::gc1+0x62
0012d84c 79f92f5a mscorwks!WKS::gc_heap::garbage_collect+0x253
0012d878 79f94e26 mscorwks!WKS::GCHeap::GarbageCollectGeneration+0x1a9
0012d904 79f926ce mscorwks!WKS::gc_heap::try_allocate_more_space+0x15b
0012d918 79f92769 mscorwks!WKS::gc_heap::allocate_more_space+0x11
0012d938 79e73291 mscorwks!WKS::GCHeap::Alloc+0x3b
0:000> .lastevent
Last event: 7e8.88: Access violation - code c0000005 (first/second chance not available)
debugger time: Mon Sep 26 11:34:53.646 2011 (UTC + 2:00)
所以让我首先提出我的问题并在下面提供更多详细信息。我的问题是:除了托管堆损坏之外,还有其他原因导致垃圾收集期间崩溃吗?
现在详细说明一下,我问这个问题的原因是因为我很难识别破坏托管堆的代码,并且似乎无法找到(据称)被覆盖的内存的模式。
我已经尝试注释所有“危险”C++/CLI 代码(特别是使用固定句柄的部分),但这没有帮助。在尝试在内存中找到被覆盖的模式时,我查看了崩溃时的反汇编代码:
0:000> u .-a .+a
mscorwks!WKS::gc_heap::find_first_object+0x54:
79f935b9 89450c mov dword ptr [ebp+0Ch],eax
79f935bc 8bd0 mov edx,eax
79f935be 8b02 mov eax,dword ptr [edx]
79f935c0 83e0fe and eax,0FFFFFFFEh
79f935c3 f70000000080 test dword ptr [eax],80000000h <<<<CRASH
79f935c9 0f84b1000000 je mscorwks!WKS::gc_heap::find_first_object+0x73
0:000> r
eax=00000000 ebx=01c81000 ecx=01c80454 edx=01c82fe0 esi=012f0000 edi=000027e1
eip=79f935c3 esp=0012d738 ebp=0012d754 iopl=0 nv up ei pl zr na pe nc
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00010246
mscorwks!WKS::gc_heap::find_first_object+0x62:
79f935c3 f70000000080 test dword ptr [eax],80000000h ds:0023:00000000=????????
当尝试取消引用空的 EAX 寄存器时,会发生崩溃。现在,从我看来,EAX 是从 EDX 寄存器指向的内容加载的,因此我查看了存储在那里的地址:
0:000> dd @edx-10
01c82fd0 06542778 00000000 00000000 01c82494
01c82fe0 00000000 00000000 00000000 00000000
01c82ff0 01b641d0 00000000 00000000 01c82380
编辑:我现在发现我的分析是错误的,缺乏对 x86 寻址模式的理解。
我可以看到从地址 01c82fed(存储在 EDX 中的值)开始,接下来的 16 个字节为空。 但是,当查看另一个类似的故障转储时,我看到以下内容:
0:000> dd @edx-10
018defd4 00000000 00000000 00000000 00000000
018defe4 00000000 00000000 018b468c 01742354
018deff4 00e0907f 00000000 00000000 00000000
因此,这里 EDX 指向的地址之前的 16 个字节以及接下来的 8 个字节为空。同样的情况也发生在我拥有的其他故障转储中,我在这里没有看到任何模式,即似乎某些代码片段只是简单地覆盖了内存的该区域。
回到这个问题,我想知道的是,除了一段不应该覆盖内存的代码之外,是否还有其他对崩溃的解释。或者关于如何继续的任何建议,我真的迷失在这个这里..
(固定手柄会引起问题吗?我们有很多这样的手柄,我认为有趣的是我总是请参阅 137 - 不多不少 - 在崩溃时用 !gchandles 固定手柄,这对我来说是一个奇怪的巧合..)。
编辑:忘记提及我们正在使用 .Net 框架的 3.5 版本。当后台 GC 处于活动状态时,我在 .Net 4 中看到类似崩溃的报告(某处提到这是 .Net 中的一个错误),但我认为这与此无关,因为 AFAIK 中没有后台 GC .Net 3.5。
I've been struggling now for some time with a crash in a C# application that also uses a fair share of C++/CLI modules that are mostly wrappers around native libraries to access device drivers.
The crash is not always easily reproducible but I was able to collect half a dozen crash dumps that shows that the program always crashes with an access violation during a garbage collection. This is the native callstack and the last event log:
0:000> k
ChildEBP RetAddr
0012d754 79f95a8f mscorwks!WKS::gc_heap::find_first_object+0x62
0012d7dc 79f933bb mscorwks!WKS::gc_heap::mark_through_cards_for_segments+0x493
0012d814 79f92cbf mscorwks!WKS::gc_heap::mark_phase+0xc3
0012d838 79f93245 mscorwks!WKS::gc_heap::gc1+0x62
0012d84c 79f92f5a mscorwks!WKS::gc_heap::garbage_collect+0x253
0012d878 79f94e26 mscorwks!WKS::GCHeap::GarbageCollectGeneration+0x1a9
0012d904 79f926ce mscorwks!WKS::gc_heap::try_allocate_more_space+0x15b
0012d918 79f92769 mscorwks!WKS::gc_heap::allocate_more_space+0x11
0012d938 79e73291 mscorwks!WKS::GCHeap::Alloc+0x3b
0:000> .lastevent
Last event: 7e8.88: Access violation - code c0000005 (first/second chance not available)
debugger time: Mon Sep 26 11:34:53.646 2011 (UTC + 2:00)
So let me first ask my question and give more details below. My question is: besides a managed heap corruption is there any other cause for a crash during garbage collection?
Now elaborating a bit, the reason I ask this is because I'm having a really hard time trying to identify the code that is corrupting the managed heap and can't seem to find a pattern for the memory that is (supposedly) overwritten.
I already tried to comment all "dangerous" C++/CLI code (specially the parts that use pinned handles) but this didn't help. In trying to find a pattern in the memory that is overwritten I looked at the dissassembled code at the point of the crash:
0:000> u .-a .+a
mscorwks!WKS::gc_heap::find_first_object+0x54:
79f935b9 89450c mov dword ptr [ebp+0Ch],eax
79f935bc 8bd0 mov edx,eax
79f935be 8b02 mov eax,dword ptr [edx]
79f935c0 83e0fe and eax,0FFFFFFFEh
79f935c3 f70000000080 test dword ptr [eax],80000000h <<<<CRASH
79f935c9 0f84b1000000 je mscorwks!WKS::gc_heap::find_first_object+0x73
0:000> r
eax=00000000 ebx=01c81000 ecx=01c80454 edx=01c82fe0 esi=012f0000 edi=000027e1
eip=79f935c3 esp=0012d738 ebp=0012d754 iopl=0 nv up ei pl zr na pe nc
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00010246
mscorwks!WKS::gc_heap::find_first_object+0x62:
79f935c3 f70000000080 test dword ptr [eax],80000000h ds:0023:00000000=????????
The crash happens when trying to dereference the EAX register which is null. Now from what I see EAX was loaded from contents pointed to by the EDX register so I looked at the address stored there:
0:000> dd @edx-10
01c82fd0 06542778 00000000 00000000 01c82494
01c82fe0 00000000 00000000 00000000 00000000
01c82ff0 01b641d0 00000000 00000000 01c82380
EDIT: I now see that my analysis was wrong, lacking was an understanding of x86 addressing modes.
So I can see that starting at address 01c82fed (the value stored at EDX) the next 16 bytes are null.
But when looking at another similar crash dump I see the following:
0:000> dd @edx-10
018defd4 00000000 00000000 00000000 00000000
018defe4 00000000 00000000 018b468c 01742354
018deff4 00e0907f 00000000 00000000 00000000
So here the 16 bytes before address pointed by EDX and the next 8 from there are null. And the same happens in the other crash dumps that I have, I don't see a pattern here, i.e. it doesn't seem that some piece of code is simply overwriting this region of the memory.
Going back to the question what I would like to know is if there is some other explanation for the crash besides one piece of the code overwriting memory that it shouldn't. Or any advice at all in how to proceed, I'm really lost in this one here..
(could the pinned handles cause a problem? We have quite a few of them and what I think that is that is funny is that I always see 137 - no more no less - pinned handles with !gchandles at the point of the crash, it's a strange coincidence for me..).
EDIT: forgot to mention that we're using version 3.5 of the .Net framework. I see reports of similar crashes in .Net 4 when the background GC is active (somewhere there is a mention that this is a bug in .Net) but I don't think that this is relevant here since AFAIK there is no background GC in .Net 3.5.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
不确定这是否有帮助,但通常不要使用析构函数或让 GC 处理非托管内存。请改用 Dispose 模式,并将所有析构函数代码移至终结器:
这将在对象上实现 IDisposable 模式。调用 Dispose 来清除非托管数据,在最坏的情况下,您将有更好的机会弄清楚到底发生了什么。
Not sure if this helps, but generally don't use destructors or let GC handle unmanaged memory. Use the Dispose pattern instead, and move all destructor code to finalizers instead:
This will implement the IDisposable pattern on the object. Call Dispose to clear unmanaged data, and you'll at worst have a better chance of figuring out what exactly happens.
因此,不幸的是,我的问题有点误导,因为我正在寻找除了托管堆损坏之外的其他解释 - 这最终证明是问题(由非托管到托管结构的不安全副本引起)。
问题现已解决,我将我的发现发布在单独的答案中,希望这没问题。
So unfortunately my question was a bit misleading since I was looking for alternative explanations besides a managed heap corruption - which turned out to be the problem in the end (caused by an unsafe copy of an unmanged to managed struct).
The problem is now solved and I'm posting my findings here in a separate answer, hope that this is ok.
您的终结器之一可能有例外。我相信你需要一一检查它们,因为终结队列中没有容错的地方。如果您没有非托管代码,最好根本没有终结器,只需手动调用 Dispose。
You probably have an exception in one of your finalizers. I believe you need to check them one by one, because there is no place for errors in finalization queue. In case you don't have unmanaged code, its better to not have finalizer at all, just manually call Dispose.