如何在没有有用的调用堆栈的情况下调试难以重现的崩溃？

发布于 2024-10-12 15:23:07 字数 1867 浏览 3 评论 0原文

我在我们的软件中遇到了奇怪的崩溃，并且在调试它时遇到了很多麻烦，因此我寻求有关如何解决它的建议。

崩溃是读取 NULL 的访问冲突指针：

第一次机会例外，$00CF0041。带有消息的异常类 $C0000005 '0x00cf0041处的访问冲突：读取地址 0x00000000'。

它只发生在“有时”——我还没有弄清楚任何押韵或原因，但是，什么时候——而且只在主线程中。发生这种情况时，调用堆栈包含一个不正确的条目：

Call stack with one line, Classes::TList::Get, address 0x00cf0041

对于主线程，它应该显示一个充满其他项目的大堆栈。

此时，所有其他线程均处于非活动状态（大部分位于 WaitForSingleObject 或类似函数中。）我只看到此崩溃发生在主线程中。它始终在同一地址的同一方法中具有一个条目的相同调用堆栈。此方法可能相关，也可能不相关 - 我们确实在应用程序中使用了 VCL。不过，我敢打赌，某些东西（可能是很久以前）正在破坏堆栈，并且崩溃的地址实际上是随机的。请注意，尽管如此，它在多个版本中都是相同的地址 - 它可能不是真正随机的。

这是我尝试过的：

尝试在某个点可靠地重现它。我发现没有任何东西每次都会重现它，并且有一些事情偶尔会出现或不会，没有明显的原因。这些操作还不够“狭窄”，不足以将其范围缩小到特定的代码部分。这可能与时序相关，但在 IDE 中断时，其他线程通常什么也不做。我不能排除线程问题，但认为不太可能。
使用额外的调试语句（额外的调试信息、额外的断言等）进行构建后，崩溃永远不会发生。
启用 Codeguard 进行构建。执行此操作后，崩溃永远不会发生，Codeguard 也不会显示任何错误。

我的问题：

1。如何找到导致崩溃的代码？我该如何做相当于往回走堆栈的操作？

2.对于如何追踪这次崩溃的原因，您有什么一般建议？

我正在使用 Embarcadero RAD Studio 2010（该项目主要包含 C++ Builder 代码和少量 Delphi。）

编辑：我认为我应该添加实际导致此问题的原因。有一个名为 ReadDirectoryChangesW< /code>，然后使用 GetOverlappedResult 等待事件继续并对更改执行某些操作。该事件也被发出信号，以便在设置状态标志后终止线程。问题是，当线程退出时，它从未调用取消IO。因此，当目录更改时，Windows 仍在跟踪更改，并且可能仍在写入缓冲区，即使缓冲区、重叠结构和事件不再存在（创建它们的线程上下文也不存在）。调用 CancelIO 后，不再发生崩溃。

原文

I am encountering an odd crash in our software and I'm having a lot of trouble debugging it, and so I am seeking SO's advice on how to tackle it.

The crash is an access violation reading a NULL pointer:

First chance exception at $00CF0041.
Exception class $C0000005 with message
'access violation at 0x00cf0041: read
of address 0x00000000'.

It only happens 'sometimes' - I haven't managed to figure out any rhyme or reason, yet, for when - and only in the main thread. When it occurs, the call stack contains one incorrect entry:

Call stack with one line, Classes::TList::Get, address 0x00cf0041

For the main thread, which this is, it should show a large stack full of other items.

At this point, all other threads are inactive (mostly sitting in WaitForSingleObject or a similar function.) I have only seen this crash occur in the main thread. It always has the same call stack of one entry, in the same method at the same address. This method may or may not be related - we do use the VCL in our application. My bet, though, is that something (possibly quite a while ago) is corrupting the stack, and the address where it's crashing is effectively random. Note it has been the same address across several builds, though - it's probably not truly random.

Here is what I've tried:

Trying to reproduce it reliably at a certain point. I have found nothing that reproduces it every time, and a couple of things that occasionally do, or do not, for no apparent reason. These are not 'narrow' enough actions to narrow it down to a particular section of code. It may be timing related, but at the point the IDE breaks in, other threads are usually doing nothing. I can't rule out a threading problem, but think it's unlikely.
Building with extra debugging statements (extra debug info, extra asserts, etc.) After doing so, the crash never occurs.
Building with Codeguard enabled. After doing so, the crash never occurs and Codeguard shows no errors.

My questions:

1. How do I find what code caused the crash? How do I do the equivalent of walking back up the stack?

2. What general advice do you have for how to trace the cause of this crash?

I am using Embarcadero RAD Studio 2010 (the project mostly contains C++ Builder code and small amounts of Delphi.)

Edit: I thought I should add what actually caused this. There was a thread that called ReadDirectoryChangesW and then, using GetOverlappedResult, waited on an event to continue and do something with the changes. The event was also signalled in order to terminate the thread after setting a status flag. The problem was that when the thread exited it never called CancelIO. As a result, Windows was still tracking changes and probably still writing to the buffer when the directory changed, even though the buffer, overlapped structure and event no longer existed (nor did the thread context in which they were created.) When CancelIO was called, there were no more crashes.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

夜光 2024-10-19 15:23:07

即使 IDE 提供的堆栈跟踪不是很完整，也不意味着堆栈上仍然没有有用的信息。打开CPU视图并查看堆栈窗格；对于每个 CALL 操作码，都会将返回地址压入堆栈。由于堆栈向下增长，您将在当前堆栈位置上方找到这些返回地址，即通过在堆栈窗格中向上滚动。

主线程的堆栈将在 $00120000 或 $00180000 左右（Vista 及更高版本中的地址空间随机化使其更加随机）。主要可执行文件的代码大约为 00400000 美元。您可以通过右键单击堆栈条目并选择跟随 -> 来推测性地调查堆栈上看起来不像整数数据（低值）或堆栈地址（$00120000+范围）的元素。 Near Code，这将导致反汇编窗口跳转到该代码地址。如果它看起来像无效代码，则它可能不是堆栈跟踪中的有效条目。如果它是有效的代码，它可能是操作系统代码（通常约为 77000000 美元及以上），在这种情况下，您不会有有意义的符号，但时不时您会遇到实际正确的堆栈条目。

这种技术虽然有些费力，但当调试器无法跟踪事物时，可以为您提供有意义的堆栈跟踪信息。不过，如果 ESP（堆栈指针）被搞砸了，那对你没有帮助。幸运的是，这种情况非常罕见。

回复收藏 0 原文

凑诗 2024-10-19 15:23:07

这就是我制作进程堆栈查看器的原因:-)
http://code.google.com/p/asmprofiler/wiki/ProcessStackViewer

它可以通过原始堆栈跟踪来显示堆栈，因此当无法进行正常堆栈跟踪时，它将显示完整的堆栈。
但要注意：原始堆栈跟踪将显示“误报”！将列出堆栈上可以找到函数名称的任何地址。

当我遇到与你相同的问题时，它帮助了我很多次（由于无效的堆栈状态，Delphi 可能无法进行正常的堆栈）

编辑：已上传新版本，网站上是旧版本（我经常使用新版本）我）
http://asmprofiler.googlecode.com/files/AsmProfiler_Sampling%20v1.0.7 .13.zip

回复收藏 0 原文

回心转意 2024-10-19 15:23:07

线程可能是这里的原因。通常的嫌疑是在堆栈上使用重叠结构的线程以及将堆栈上对象的指针发送给其他线程的线程。

如果您使用Windows 调试工具并使用“dps”命令。

回复收藏 0 原文

呆° 2024-10-19 15:23:07

我不是 100% 确定，但从您提供的图像来看，我相信在执行过程中的某个地方您试图访问 TList 中的 NULL 对象。即：

AList[Index].SomeProperty/SomeMethod/etc. <-- error if (AList[Index] == NULL)

关于调试和找到引发异常的实际位置从来都不是一件容易的事，特别是当信息不多或难以重现时，在这种情况下我通常：

从主窗体的执行中逐步进行（如果没有异常，直到那里）
在一步一步进行的过程中，如果我发现任何不安全的代码，我会将其放在 try... except 和索引条件（如果我有数组、列表、要传递的期望值等）
如果上面未能找到问题，请检查某些库是否失败
使用Eureka日志，它有时也会失败（很少几次），但它通常会为您指明正确的方向

我遇到过许多与您类似的问题，我可以告诉您该问题几乎非常容易修复，但是当错误弹出时，我没有得到“接近”错误的点。

I'm not 100% sure, but from the image you provided I believe that somewhere along the executing you're trying to access a object in a TList that is NULL. i.e.:

AList[Index].SomeProperty/SomeMethod/etc. <-- error if (AList[Index] == NULL)

Regarding debugging and finding the actual place where the exception is raised is never an easy task especially when there's not much information or it is hard to reproduce, in this case I usually:

go step by step from the main form's execution(if no exception until there)
while going step by step, if I find any unsafe code I put it between try...except and conditions for indexes(if I have arrays, lists, expected values to be passed, etc.)
if the above fails to find the issue, check if some libraries are failing
use Eureka log, it sometimes fail as well(very few times) but it usually points you in the right direction

I have had numerous issues similar to yours and I can tell you that the issue was almost a extremely easy to fix, however when the error pops, I did not get a "point near" the error.

回复收藏 0 原文

~没有更多了~