如何追踪 .NET 中 StackOverflowException 的原因?

发布于 2024-10-15 18:11:56 字数 587 浏览 5 评论 0原文

当我运行以下代码时,我收到 StackOverflowException

private void MyButton_Click(object sender, EventArgs e) {
  MyButton_Click_Aux();
}

private static volatile int reportCount;

private static void MyButton_Click_Aux() {
  try { /*remove because stack overflows without*/ }
  finally {
    var myLogData = new ArrayList();
    myLogData.Add(reportCount);
    myLogData.Add("method MyButtonClickAux");
    Log(myLogData);
  }
}

private static void Log(object logData) {
  // my log code is not matter
}

什么可能导致 StackOverflowException

I get a StackOverflowException when I run the following code:

private void MyButton_Click(object sender, EventArgs e) {
  MyButton_Click_Aux();
}

private static volatile int reportCount;

private static void MyButton_Click_Aux() {
  try { /*remove because stack overflows without*/ }
  finally {
    var myLogData = new ArrayList();
    myLogData.Add(reportCount);
    myLogData.Add("method MyButtonClickAux");
    Log(myLogData);
  }
}

private static void Log(object logData) {
  // my log code is not matter
}

What could be causing the StackOverflowException?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

旧瑾黎汐 2024-10-22 18:11:56

我知道如何阻止它发生,

但我只是不知道为什么会导致它(还)。看来您确实在 .Net BCL 中或更可能在 JIT 中发现了错误。

我只是注释掉了 MyButton_Click_Aux 方法中的所有行,然后开始将它们一一带回来。

从静态 int 中去掉 易失性,您将不再收到 StackOverflowException

现在来研究为什么......显然与内存屏障有关的某些事情导致了问题 - 也许以某种方式强制 MyButton_Click_Aux 方法调用自身...

更新

好吧,所以其他人发现 .Net 3.5 不是问题。

我也在使用 .Nt 4,所以这些评论与此相关:

正如我所说,去掉 volatile 就可以了。

同样,如果你重新打开易失性并删除 try/finally,它也可以工作:

private static void MyButton_Click_Aux()
{
  //try { /*remove because stack overflows without*/ }
  //finally
  //{
    var myLogData = new ArrayList(); 
    myLogData.Add(reportCount); 
    //myLogData.Add("method MyButtonClickAux");
    //Log(myLogData);
  //}
}  

我还想知道这是否与 try/finally 进入时未初始化的 reportCount 有关。如果将其初始化为零,则没有区别。

我现在正在研究 IL - 尽管它可能需要一些 ASM 小伙子参与进来......

最终更新
正如我所说,这确实需要分析 JIT 输出才能真正了解正在发生的情况,同时我发现分析汇编程序很有趣 - 我觉得这可能是 Microsoft 某人的工作,因此这个错误实际上可以得到确认和修复!也就是说——这似乎是一个相当狭窄的情况。

我已经转移到发布版本,以消除所有 IL 噪音(nops 等)以进行分析。

然而,这对诊断产生了复杂的影响。我以为我拥有它,但没有 - 但现在我知道它是什么。

我尝试了这段代码:

private static void MyButton_Click_Aux()
{
  try { }
  finally
  {
    var myLogData = new ArrayList();
    Console.WriteLine(reportCount);
    //myLogData.Add("method MyButtonClickAux");
    //Log(myLogData);
  }
}

将 int 设为易失性。它运行没有错误。这是 IL:

.maxstack 1
L_0000: leave.s L_0015
L_0002: newobj instance void [mscorlib]System.Collections.ArrayList::.ctor()
L_0007: pop 
L_0008: volatile. 
L_000a: ldsfld int32 modreq([mscorlib]System.Runtime.CompilerServices.IsVolatile) WindowsFormsApplication1.Form1::reportCount
L_000f: call void [mscorlib]System.Console::WriteLine(int32)
L_0014: endfinally 
L_0015: ret 
.try L_0000 to L_0002 finally handler L_0002 to L_0015

然后我们看看再次出现错误所需的最少代码:

private static void MyButton_Click_Aux()
{
  try { }
  finally
  {
    var myLogData = new ArrayList();
    myLogData.Add(reportCount);
  }
}

这是 IL: 的

.maxstack 2
.locals init (
    [0] class [mscorlib]System.Collections.ArrayList myLogData)
L_0000: leave.s L_001c
L_0002: newobj instance void [mscorlib]System.Collections.ArrayList::.ctor()
L_0007: stloc.0 
L_0008: ldloc.0 
L_0009: volatile. 
L_000b: ldsfld int32 modreq([mscorlib]System.Runtime.CompilerServices.IsVolatile) WindowsFormsApplication1.Form1::reportCount
L_0010: box int32
L_0015: callvirt instance int32 [mscorlib]System.Collections.ArrayList::Add(object)
L_001a: pop 
L_001b: endfinally 
L_001c: ret 
.try L_0000 to L_0002 finally handler L_0002 to L_001c

区别?嗯,我发现了两个 - 易失性 int 的装箱和虚拟调用。所以我设置了这两个类:

public class DoesNothingBase
{
  public void NonVirtualFooBox(object arg) { }
  public void NonVirtualFooNonBox(int arg) { }

  public virtual void FooBox(object arg) { }
  public virtual void FooNonBox(int arg) { }
}

public class DoesNothing : DoesNothingBase
{
  public override void FooBox(object arg) { }
  public override void FooNonBox(int arg) { }
}

然后尝试了这四个版本的违规方法:

try { }
finally
{
  var doesNothing = new DoesNothing();
  doesNothing.FooNonBox(reportCount);
}

哪个有效。

try { }
finally
{
  var doesNothing = new DoesNothing();
  doesNothing.NonVirtualFooNonBox(reportCount);
}

这也有效。

try { }
finally
{
  var doesNothing = new DoesNothing();
  doesNothing.FooBox(reportCount);
}

糟糕 - StackOverflowException。

并且:

try { }
finally
{
  var doesNothing = new DoesNothing();
  doesNothing.NonVirtualFooBox(reportCount);
}

哎呀又来了! StackOverflowException!

我们可以更进一步 - 但我觉得问题显然是由 try/catch 的 finally 块内的 volatile int 装箱引起的...我将代码放在 try 中,没有问题。我添加了一个 catch 子句(并将代码放在那里),也没有问题。

我猜它也可以适用于其他值类型的装箱。

因此,总而言之,在 .Net 4.0 中,在调试和发布版本中,finally 块中 volatile int 的装箱似乎会导致 JIT 生成最终填充堆栈的代码。堆栈跟踪仅显示“外部代码”这一事实也支持了这一主张。

甚至有可能它不能总是被重现,甚至可能取决于 try/finally 生成的代码的布局和大小。显然,这与错误的 jmp 或生成到错误位置的类似内容有关,最终会向堆栈重复一个或多个推送命令。坦率地说,这实际上是由盒子操作引起的,这一想法令人着迷!

最终更新

如果您查看 @Hasty G 发现的 MS Connect 错误(进一步回答) - 您会发现该错误以类似的方式表现出来,但在 catch 语句。

另外 - MS 在重现此问题后排队修复 - 但 7 个月后还没有可用的修补程序。我之前已经公开表示支持 MS Connect,所以我不再多说 - 我认为我不需要这样做!

最终最终更新 (2011 年 2 月 23 日)

已修复 - 但尚未发布。引用 MS 团队关于 MS Connect 错误的内容:

是的,已经修复了。我们正在研究如何最好地提供修复程序。它已在 4.5 中修复,但我们确实希望在 4.5 发布之前修复一批代码生成错误。

I know how to stop it from happening

I just don't know why it causes it (yet). And it would appear you have indeed found a bug either in the .Net BCL or, more likely, in the JIT.

I just commented out all the lines in the MyButton_Click_Aux method and then started bringing them back in, one by one.

Take off the volatile from the static int and you'll no longer get a StackOverflowException.

Now to research why... Clearly something to do with Memory Barriers is causing an issue - perhaps somehow forcing the MyButton_Click_Aux method to call itself...

UPDATE

Okay so other people are finding that .Net 3.5 is not an issue.

I'm using .Nt 4 as well so these comments relate to that:

As I said, take the volatile off and it works.

Equally, if you put the volatile back on and remove the try/finally it also works:

private static void MyButton_Click_Aux()
{
  //try { /*remove because stack overflows without*/ }
  //finally
  //{
    var myLogData = new ArrayList(); 
    myLogData.Add(reportCount); 
    //myLogData.Add("method MyButtonClickAux");
    //Log(myLogData);
  //}
}  

I also wondered if it was something to do with the uninitialised reportCount when the try/finally is in. But it makes no difference if you initialise it to zero.

I'm looking at the IL now - although it might require someone with some ASM chaps to get involved...

Final Update
As I say, this really is going to require analysis of the JIT output to really understand what's happening and whilst I find it fun to analyse assembler - I feel it's probably a job for someone in Microsoft so this bug can actually be confirmed and fixed! That said - it appears to be a pretty narrow set of circumstances.

I've moved over to a release build to get rid of all the IL noise (nops etc) for analysis.

This has, however, had a complicating impact on the diagnosis. I thought I had it but didn't - but now I know what it is.

I tried this code:

private static void MyButton_Click_Aux()
{
  try { }
  finally
  {
    var myLogData = new ArrayList();
    Console.WriteLine(reportCount);
    //myLogData.Add("method MyButtonClickAux");
    //Log(myLogData);
  }
}

With the int as volatile. It runs without fault. Here's the IL:

.maxstack 1
L_0000: leave.s L_0015
L_0002: newobj instance void [mscorlib]System.Collections.ArrayList::.ctor()
L_0007: pop 
L_0008: volatile. 
L_000a: ldsfld int32 modreq([mscorlib]System.Runtime.CompilerServices.IsVolatile) WindowsFormsApplication1.Form1::reportCount
L_000f: call void [mscorlib]System.Console::WriteLine(int32)
L_0014: endfinally 
L_0015: ret 
.try L_0000 to L_0002 finally handler L_0002 to L_0015

Then we look at the minimum code required to get the error again:

private static void MyButton_Click_Aux()
{
  try { }
  finally
  {
    var myLogData = new ArrayList();
    myLogData.Add(reportCount);
  }
}

And it's IL:

.maxstack 2
.locals init (
    [0] class [mscorlib]System.Collections.ArrayList myLogData)
L_0000: leave.s L_001c
L_0002: newobj instance void [mscorlib]System.Collections.ArrayList::.ctor()
L_0007: stloc.0 
L_0008: ldloc.0 
L_0009: volatile. 
L_000b: ldsfld int32 modreq([mscorlib]System.Runtime.CompilerServices.IsVolatile) WindowsFormsApplication1.Form1::reportCount
L_0010: box int32
L_0015: callvirt instance int32 [mscorlib]System.Collections.ArrayList::Add(object)
L_001a: pop 
L_001b: endfinally 
L_001c: ret 
.try L_0000 to L_0002 finally handler L_0002 to L_001c

The difference? Well there's two that I spotted - boxing of the volatile int, and a virtual call. So I setup these two classes:

public class DoesNothingBase
{
  public void NonVirtualFooBox(object arg) { }
  public void NonVirtualFooNonBox(int arg) { }

  public virtual void FooBox(object arg) { }
  public virtual void FooNonBox(int arg) { }
}

public class DoesNothing : DoesNothingBase
{
  public override void FooBox(object arg) { }
  public override void FooNonBox(int arg) { }
}

And then tried each of these four versions of the offending method:

try { }
finally
{
  var doesNothing = new DoesNothing();
  doesNothing.FooNonBox(reportCount);
}

Which works.

try { }
finally
{
  var doesNothing = new DoesNothing();
  doesNothing.NonVirtualFooNonBox(reportCount);
}

Which also works.

try { }
finally
{
  var doesNothing = new DoesNothing();
  doesNothing.FooBox(reportCount);
}

Oops - StackOverflowException.

And:

try { }
finally
{
  var doesNothing = new DoesNothing();
  doesNothing.NonVirtualFooBox(reportCount);
}

Oops again! StackOverflowException!

We could go further with this - but the issue is, I feel, clearly caused by the boxing of the volatile int whilst inside the finally block of a try/catch... I put the code inside the try, and no problem. I added a catch clause (and put the code in there), also no problem.

It could also apply to the boxing of other value types I guess.

So, to summarise - in .Net 4.0 - in both debug and release builds - the boxing of a volatile int in a finally block appears to cause the JIT to generate code that ends up filling the stack. The fact that the stack trace simply shows 'external code' also supports this proposition.

There's even a possibility that it can't always be reproduced and might even depend on the layout and size of the code that is generated by the try/finally. It's clearly something to do with an errant jmp or something similar being generated to the wrong location which eventually repeats one or more push commands to the stack. The idea that that is being caused actually by a box operation is, frankly, fascinating!

Final Final Update

If you look at the MS Connect bug that @Hasty G found (answer further down) - you see there that the bug manifests in a similar fashion, but with a volatile bool in a catch statement.

Also - MS queued a fix for this after getting it to repro - but no hotfix available yet after 7 months. I've gone on record before as being in support of MS Connect, so I'll say no more - I don't think I need to!

Final Final Final Update (23/02/2011)

It is fixed - but not yet released. Quote from the MS Team on the MS Connect bug:

Yes, it's fixed. We're in the process of figuring out how best to ship a fix. It is already fixed in 4.5, but we'd really like to fix a batch of code generation bugs prior to 4.5 release.

灯下孤影 2024-10-22 18:11:56

该错误在您的代码中。据推测,MyButton_Click_Aux() 会导致重新输入某些方法。但是,您莫名其妙地从问题中省略了该代码,因此没有人可以对其发表评论。

The bug is in your code. Presumably, MyButton_Click_Aux() causes some method to be re-entered. However, you've inexplicably omitted that code from your question and so no one can comment on it.

ぇ气 2024-10-22 18:11:56

Log回调Log吗?这也会导致SO。

Does Log call back to Log? This would also cause a SO.

荒人说梦 2024-10-22 18:11:56

当异常发生时,为什么不检查调用堆栈面板中记录的内容?调用堆栈本身可以说明很多信息。

此外,使用 SOS.dll 和 WinDbg 进行低级调试也可以告诉您很多信息。

When that exception happens, why not check what was recorded in call stack panel? The call stack itself can tell a lot.

Besides, low level debugging using SOS.dll and WinDbg can also tell you a lot.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文