如何追踪 .NET 中 StackOverflowException 的原因?
当我运行以下代码时,我收到 StackOverflowException
:
private void MyButton_Click(object sender, EventArgs e) {
MyButton_Click_Aux();
}
private static volatile int reportCount;
private static void MyButton_Click_Aux() {
try { /*remove because stack overflows without*/ }
finally {
var myLogData = new ArrayList();
myLogData.Add(reportCount);
myLogData.Add("method MyButtonClickAux");
Log(myLogData);
}
}
private static void Log(object logData) {
// my log code is not matter
}
什么可能导致 StackOverflowException
?
I get a StackOverflowException
when I run the following code:
private void MyButton_Click(object sender, EventArgs e) {
MyButton_Click_Aux();
}
private static volatile int reportCount;
private static void MyButton_Click_Aux() {
try { /*remove because stack overflows without*/ }
finally {
var myLogData = new ArrayList();
myLogData.Add(reportCount);
myLogData.Add("method MyButtonClickAux");
Log(myLogData);
}
}
private static void Log(object logData) {
// my log code is not matter
}
What could be causing the StackOverflowException
?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我知道如何阻止它发生,
但我只是不知道为什么会导致它(还)。看来您确实在 .Net BCL 中或更可能在 JIT 中发现了错误。
我只是注释掉了
MyButton_Click_Aux
方法中的所有行,然后开始将它们一一带回来。从静态 int 中去掉
易失性
,您将不再收到StackOverflowException
。现在来研究为什么......显然与内存屏障有关的某些事情导致了问题 - 也许以某种方式强制
MyButton_Click_Aux
方法调用自身...更新
好吧,所以其他人发现 .Net 3.5 不是问题。
我也在使用 .Nt 4,所以这些评论与此相关:
正如我所说,去掉 volatile 就可以了。
同样,如果你重新打开易失性并删除 try/finally,它也可以工作:
我还想知道这是否与 try/finally 进入时未初始化的
reportCount
有关。如果将其初始化为零,则没有区别。我现在正在研究 IL - 尽管它可能需要一些 ASM 小伙子参与进来......
最终更新
正如我所说,这确实需要分析 JIT 输出才能真正了解正在发生的情况,同时我发现分析汇编程序很有趣 - 我觉得这可能是 Microsoft 某人的工作,因此这个错误实际上可以得到确认和修复!也就是说——这似乎是一个相当狭窄的情况。
我已经转移到发布版本,以消除所有 IL 噪音(nops 等)以进行分析。
然而,这对诊断产生了复杂的影响。我以为我拥有它,但没有 - 但现在我知道它是什么。
我尝试了这段代码:
将 int 设为易失性。它运行没有错误。这是 IL:
然后我们看看再次出现错误所需的最少代码:
这是 IL: 的
区别?嗯,我发现了两个 - 易失性 int 的装箱和虚拟调用。所以我设置了这两个类:
然后尝试了这四个版本的违规方法:
哪个有效。
这也有效。
糟糕 - StackOverflowException。
并且:
哎呀又来了! StackOverflowException!
我们可以更进一步 - 但我觉得问题显然是由 try/catch 的 finally 块内的 volatile int 装箱引起的...我将代码放在 try 中,没有问题。我添加了一个 catch 子句(并将代码放在那里),也没有问题。
我猜它也可以适用于其他值类型的装箱。
因此,总而言之,在 .Net 4.0 中,在调试和发布版本中,finally 块中 volatile int 的装箱似乎会导致 JIT 生成最终填充堆栈的代码。堆栈跟踪仅显示“外部代码”这一事实也支持了这一主张。
甚至有可能它不能总是被重现,甚至可能取决于 try/finally 生成的代码的布局和大小。显然,这与错误的
jmp
或生成到错误位置的类似内容有关,最终会向堆栈重复一个或多个推送命令。坦率地说,这实际上是由盒子操作引起的,这一想法令人着迷!最终更新
如果您查看 @Hasty G 发现的 MS Connect 错误(进一步回答) - 您会发现该错误以类似的方式表现出来,但在 catch 语句。
另外 - MS 在重现此问题后排队修复 - 但 7 个月后还没有可用的修补程序。我之前已经公开表示支持 MS Connect,所以我不再多说 - 我认为我不需要这样做!
最终最终更新 (2011 年 2 月 23 日)
已修复 - 但尚未发布。引用 MS 团队关于 MS Connect 错误的内容:
I know how to stop it from happening
I just don't know why it causes it (yet). And it would appear you have indeed found a bug either in the .Net BCL or, more likely, in the JIT.
I just commented out all the lines in the
MyButton_Click_Aux
method and then started bringing them back in, one by one.Take off the
volatile
from the static int and you'll no longer get aStackOverflowException
.Now to research why... Clearly something to do with Memory Barriers is causing an issue - perhaps somehow forcing the
MyButton_Click_Aux
method to call itself...UPDATE
Okay so other people are finding that .Net 3.5 is not an issue.
I'm using .Nt 4 as well so these comments relate to that:
As I said, take the volatile off and it works.
Equally, if you put the volatile back on and remove the try/finally it also works:
I also wondered if it was something to do with the uninitialised
reportCount
when the try/finally is in. But it makes no difference if you initialise it to zero.I'm looking at the IL now - although it might require someone with some ASM chaps to get involved...
Final Update
As I say, this really is going to require analysis of the JIT output to really understand what's happening and whilst I find it fun to analyse assembler - I feel it's probably a job for someone in Microsoft so this bug can actually be confirmed and fixed! That said - it appears to be a pretty narrow set of circumstances.
I've moved over to a release build to get rid of all the IL noise (nops etc) for analysis.
This has, however, had a complicating impact on the diagnosis. I thought I had it but didn't - but now I know what it is.
I tried this code:
With the int as volatile. It runs without fault. Here's the IL:
Then we look at the minimum code required to get the error again:
And it's IL:
The difference? Well there's two that I spotted - boxing of the volatile int, and a virtual call. So I setup these two classes:
And then tried each of these four versions of the offending method:
Which works.
Which also works.
Oops -
StackOverflowException
.And:
Oops again!
StackOverflowException
!We could go further with this - but the issue is, I feel, clearly caused by the boxing of the volatile int whilst inside the finally block of a try/catch... I put the code inside the try, and no problem. I added a catch clause (and put the code in there), also no problem.
It could also apply to the boxing of other value types I guess.
So, to summarise - in .Net 4.0 - in both debug and release builds - the boxing of a volatile int in a finally block appears to cause the JIT to generate code that ends up filling the stack. The fact that the stack trace simply shows 'external code' also supports this proposition.
There's even a possibility that it can't always be reproduced and might even depend on the layout and size of the code that is generated by the try/finally. It's clearly something to do with an errant
jmp
or something similar being generated to the wrong location which eventually repeats one or more push commands to the stack. The idea that that is being caused actually by a box operation is, frankly, fascinating!Final Final Update
If you look at the MS Connect bug that @Hasty G found (answer further down) - you see there that the bug manifests in a similar fashion, but with a volatile bool in a catch statement.
Also - MS queued a fix for this after getting it to repro - but no hotfix available yet after 7 months. I've gone on record before as being in support of MS Connect, so I'll say no more - I don't think I need to!
Final Final Final Update (23/02/2011)
It is fixed - but not yet released. Quote from the MS Team on the MS Connect bug:
该错误在您的代码中。据推测,
MyButton_Click_Aux()
会导致重新输入某些方法。但是,您莫名其妙地从问题中省略了该代码,因此没有人可以对其发表评论。The bug is in your code. Presumably,
MyButton_Click_Aux()
causes some method to be re-entered. However, you've inexplicably omitted that code from your question and so no one can comment on it.Log回调Log吗?这也会导致SO。
Does Log call back to Log? This would also cause a SO.
当异常发生时,为什么不检查调用堆栈面板中记录的内容?调用堆栈本身可以说明很多信息。
此外,使用 SOS.dll 和 WinDbg 进行低级调试也可以告诉您很多信息。
When that exception happens, why not check what was recorded in call stack panel? The call stack itself can tell a lot.
Besides, low level debugging using SOS.dll and WinDbg can also tell you a lot.