无需源代码即可调试/绕过 BSOD

发布于 2024-12-07 06:33:37 字数 5071 浏览 3 评论 0原文

你好,祝你有美好的一天。

这里需要一些帮助:

情况
我有一个不起眼的 DirectX 9 应用程序(名称和应用程序详细信息与问题无关),自某些驱动程序版本以来,它会导致所有 nvidia 卡(GeForce 8400GS 及更高版本)出现蓝屏死机。我认为该问题是由 DirectX 9 调用或触发驱动程序错误的标志间接引起的。

目标
我想追踪有问题的标志/函数调用(为了好玩,这不是我的工作/家庭作业)并通过编写代理 dll 来绕过错误条件。我已经有一个完成的代理 dll,它为 IDirect3D9、IDirect3DDevice9、IDirect3DVertexBuffer9 和 IDirect3DIndexBuffer9 提供包装器,并提供 Direct3D 调用的基本日志记录/跟踪。但是,我无法查明导致崩溃的函数。

问题

  1. 没有可用的源代码或技术支持。不会有任何帮助,也没有其他人可以解决问题。
  2. 内核生成的内存转储没有帮助 - 显然 nv4_disp.dll 内发生了访问冲突,但我无法使用堆栈跟踪转到 IDirect3DDevice9 方法调用,而且错误有可能异步发生。
  3. (主要问题)由于大量的 Direct3D9Device 方法调用,我无法可靠地将它们记录到文件中或通过网络:
    1. 即使没有刷新,登录文件也会导致速度显着减慢,因此,当系统出现 BSOD 时,日志的所有最后内容都会丢失。
    2. 通过网络进行日志记录(使用 UDP 和 WINSOck 的 sendto)也会导致显着的速度减慢,并且不能异步完成(异步数据包在 BSOD 时丢失),而且数据包(崩溃周围的数据包)有时会丢失即使同步发送也会丢失。
    3. 当应用程序因日志记录例程而“减慢”速度时,发生 BSOD 的可能性较小,这使得跟踪它变得更加困难。

问题
我通常不编写驱动程序,也不进行这种级别的调试,所以我觉得我错过了一些重要的东西,有一种比使用自定义日志记录机制编写 IDirect3DDevice9 代理 dll 更简单的方法来追踪问题。它是什么?诊断/处理/修复这样的问题的标准方法是什么(没有源代码,COM接口方法触发BSOD)?

小型转储分析(WinDBG)

Loading User Symbols
Loading unloaded module list
...........
Unable to load image nv4_disp.dll, Win32 error 0n2
*** WARNING: Unable to verify timestamp for nv4_disp.dll
*** ERROR: Module load completed but symbols could not be loaded for nv4_disp.dll
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

Use !analyze -v to get detailed debugging information.

BugCheck 1000008E, {c0000005, bd0a2fd0, b0562b40, 0}

Probably caused by : nv4_disp.dll ( nv4_disp+90fd0 )

Followup: MachineOwner
---------

0: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

KERNEL_MODE_EXCEPTION_NOT_HANDLED_M (1000008e)
This is a very common bugcheck.  Usually the exception address pinpoints
the driver/function that caused the problem.  Always note this address
as well as the link date of the driver/image that contains this address.
Some common problems are exception code 0x80000003.  This means a hard
coded breakpoint or assertion was hit, but this system was booted
/NODEBUG.  This is not supposed to happen as developers should never have
hardcoded breakpoints in retail code, but ...
If this happens, make sure a debugger gets connected, and the
system is booted /DEBUG.  This will let us see why this breakpoint is
happening.
Arguments:
Arg1: c0000005, The exception code that was not handled
Arg2: bd0a2fd0, The address that the exception occurred at
Arg3: b0562b40, Trap Frame
Arg4: 00000000

Debugging Details:
------------------


EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - The instruction at "0x%08lx" referenced memory at "0x%08lx". The memory could not be "%s".

FAULTING_IP: 
nv4_disp+90fd0
bd0a2fd0 39b8f8000000    cmp     dword ptr [eax+0F8h],edi

TRAP_FRAME:  b0562b40 -- (.trap 0xffffffffb0562b40)
ErrCode = 00000000
eax=00000808 ebx=e37f8200 ecx=e4ae1c68 edx=e37f8328 esi=e37f8400 edi=00000000
eip=bd0a2fd0 esp=b0562bb4 ebp=e37e09c0 iopl=0         nv up ei pl nz na po nc
cs=0008  ss=0010  ds=0023  es=0023  fs=0030  gs=0000             efl=00010202
nv4_disp+0x90fd0:
bd0a2fd0 39b8f8000000    cmp     dword ptr [eax+0F8h],edi ds:0023:00000900=????????
Resetting default scope

CUSTOMER_CRASH_COUNT:  3

DEFAULT_BUCKET_ID:  DRIVER_FAULT

BUGCHECK_STR:  0x8E

LAST_CONTROL_TRANSFER:  from bd0a2e33 to bd0a2fd0

STACK_TEXT:  
WARNING: Stack unwind information not available. Following frames may be wrong.
b0562bc4 bd0a2e33 e37f8200 e37f8200 e4ae1c68 nv4_disp+0x90fd0
b0562c3c bf8edd6b b0562cfc e2601714 e4ae1c58 nv4_disp+0x90e33
b0562c74 bd009530 b0562cfc bf8ede06 e2601714 win32k!WatchdogDdDestroySurface+0x38
b0562d30 bd00b3a4 e2601008 e4ae1c58 b0562d50 dxg!vDdDisableSurfaceObject+0x294
b0562d54 8054161c e2601008 00000001 0012c518 dxg!DxDdDestroySurface+0x42
b0562d54 7c90e4f4 e2601008 00000001 0012c518 nt!KiFastCallEntry+0xfc
0012c518 00000000 00000000 00000000 00000000 0x7c90e4f4


STACK_COMMAND:  kb

FOLLOWUP_IP: 
nv4_disp+90fd0
bd0a2fd0 39b8f8000000    cmp     dword ptr [eax+0F8h],edi

SYMBOL_STACK_INDEX:  0

SYMBOL_NAME:  nv4_disp+90fd0

FOLLOWUP_NAME:  MachineOwner

MODULE_NAME: nv4_disp

IMAGE_NAME:  nv4_disp.dll

DEBUG_FLR_IMAGE_TIMESTAMP:  4e390d56

FAILURE_BUCKET_ID:  0x8E_nv4_disp+90fd0

BUCKET_ID:  0x8E_nv4_disp+90fd0

Followup: MachineOwner

Hello and good day to you.

Need a bit of assitance here:

Situation:
I have an obscure DirectX 9 application (name and application details are irrelevant to the question) that causes blue screen of death on all nvidia cards (GeForce 8400GS and up) since certain driver version. I believe that the problem is indirectly caused by DirectX 9 call or a flag that triggers driver bug.

Goal:
I'd like to track down offending flag/function call (for fun, this isn't my job/homework) and bypass error condition by writing proxy dll. I already have a finished proxy dll that provides wrappers for IDirect3D9, IDirect3DDevice9, IDirect3DVertexBuffer9 and IDirect3DIndexBuffer9 and provides basic logging/tracing of Direct3D calls. However, I can't pinpoint function which causes crash.

Problems:

  1. No source code or technical support is available. There will be no assitance, and nobody else will fix the problem.
  2. Memory dump produced by kernel wasn't helpful - apparently an access violation happens within nv4_disp.dll, but I can't use stacktrace to go to IDirect3DDevice9 method call, plus there's a chance that bug happens asynchronously.
  3. (Main problem) Because of large number of Direct3D9Device method calls, I can't reliably log them into file or over network:
    1. Logging into file causes significant slowdown even without flushing, and because of that all last contents of the log are lost when system BSODs.
    2. Logging over network (using UDP and WINSOck's sendto)also causes significant slowdown and must not be done asynchronously (asynchronous packets are lost on BSOD), plus packets (the ones around the crash) are sometimes lost even when sent synchronously.
    3. When application is "slowed" down by logging routines, BSOD is less likely to happen, which makes tracking it down harder.

Question:
I normally don't write drivers, and don't do this level of debugging, so I have impression that I'm missing something important there's a more trivial way to track down the problem than writing IDirect3DDevice9 proxy dll with custom logging mechanism. What is it? What is the standard way of diagnosing/handling/fixing problem like this (no source code, COM interface method triggers BSOD)?

Minidump analysis(WinDBG):

Loading User Symbols
Loading unloaded module list
...........
Unable to load image nv4_disp.dll, Win32 error 0n2
*** WARNING: Unable to verify timestamp for nv4_disp.dll
*** ERROR: Module load completed but symbols could not be loaded for nv4_disp.dll
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

Use !analyze -v to get detailed debugging information.

BugCheck 1000008E, {c0000005, bd0a2fd0, b0562b40, 0}

Probably caused by : nv4_disp.dll ( nv4_disp+90fd0 )

Followup: MachineOwner
---------

0: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

KERNEL_MODE_EXCEPTION_NOT_HANDLED_M (1000008e)
This is a very common bugcheck.  Usually the exception address pinpoints
the driver/function that caused the problem.  Always note this address
as well as the link date of the driver/image that contains this address.
Some common problems are exception code 0x80000003.  This means a hard
coded breakpoint or assertion was hit, but this system was booted
/NODEBUG.  This is not supposed to happen as developers should never have
hardcoded breakpoints in retail code, but ...
If this happens, make sure a debugger gets connected, and the
system is booted /DEBUG.  This will let us see why this breakpoint is
happening.
Arguments:
Arg1: c0000005, The exception code that was not handled
Arg2: bd0a2fd0, The address that the exception occurred at
Arg3: b0562b40, Trap Frame
Arg4: 00000000

Debugging Details:
------------------


EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - The instruction at "0x%08lx" referenced memory at "0x%08lx". The memory could not be "%s".

FAULTING_IP: 
nv4_disp+90fd0
bd0a2fd0 39b8f8000000    cmp     dword ptr [eax+0F8h],edi

TRAP_FRAME:  b0562b40 -- (.trap 0xffffffffb0562b40)
ErrCode = 00000000
eax=00000808 ebx=e37f8200 ecx=e4ae1c68 edx=e37f8328 esi=e37f8400 edi=00000000
eip=bd0a2fd0 esp=b0562bb4 ebp=e37e09c0 iopl=0         nv up ei pl nz na po nc
cs=0008  ss=0010  ds=0023  es=0023  fs=0030  gs=0000             efl=00010202
nv4_disp+0x90fd0:
bd0a2fd0 39b8f8000000    cmp     dword ptr [eax+0F8h],edi ds:0023:00000900=????????
Resetting default scope

CUSTOMER_CRASH_COUNT:  3

DEFAULT_BUCKET_ID:  DRIVER_FAULT

BUGCHECK_STR:  0x8E

LAST_CONTROL_TRANSFER:  from bd0a2e33 to bd0a2fd0

STACK_TEXT:  
WARNING: Stack unwind information not available. Following frames may be wrong.
b0562bc4 bd0a2e33 e37f8200 e37f8200 e4ae1c68 nv4_disp+0x90fd0
b0562c3c bf8edd6b b0562cfc e2601714 e4ae1c58 nv4_disp+0x90e33
b0562c74 bd009530 b0562cfc bf8ede06 e2601714 win32k!WatchdogDdDestroySurface+0x38
b0562d30 bd00b3a4 e2601008 e4ae1c58 b0562d50 dxg!vDdDisableSurfaceObject+0x294
b0562d54 8054161c e2601008 00000001 0012c518 dxg!DxDdDestroySurface+0x42
b0562d54 7c90e4f4 e2601008 00000001 0012c518 nt!KiFastCallEntry+0xfc
0012c518 00000000 00000000 00000000 00000000 0x7c90e4f4


STACK_COMMAND:  kb

FOLLOWUP_IP: 
nv4_disp+90fd0
bd0a2fd0 39b8f8000000    cmp     dword ptr [eax+0F8h],edi

SYMBOL_STACK_INDEX:  0

SYMBOL_NAME:  nv4_disp+90fd0

FOLLOWUP_NAME:  MachineOwner

MODULE_NAME: nv4_disp

IMAGE_NAME:  nv4_disp.dll

DEBUG_FLR_IMAGE_TIMESTAMP:  4e390d56

FAILURE_BUCKET_ID:  0x8E_nv4_disp+90fd0

BUCKET_ID:  0x8E_nv4_disp+90fd0

Followup: MachineOwner

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

抱猫软卧 2024-12-14 06:33:38
nv4_disp+90fd0
bd0a2fd0 39b8f8000000    cmp     dword ptr [eax+0F8h],edi

这是重要的部分。从这个角度来看,很可能 eax 无效,因此尝试访问无效的内存地址。

你需要做的是将nv4_disp.dll加载到IDA中(你可以获得免费版本),检查IDA加载nv4_disp的映像库并点击'g'转到地址,尝试将90fd0添加到IDA正在使用的映像库中,它应该直接将您带到有问题的指令(取决于章节结构)。

从这里您可以分析控制流,以及如何设置和使用 eax。如果您有一个好的内核级调试器,您可以在此地址上设置一个断点并尝试使其命中。

分析该函数时,您应该尝试找出该函数的作用、此时 eax 指向什么、它实际指向什么以及原因。这是最难的部分,也是逆向工程难度和技巧的很大一部分。

nv4_disp+90fd0
bd0a2fd0 39b8f8000000    cmp     dword ptr [eax+0F8h],edi

This is the important part. Looking at this, it is most probable that eax is invalid, hence attempting to access an invalid memory address.

What you need to do is load nv4_disp.dll into IDA (you can get a free version), check the image base that IDA loads nv4_disp at and hit 'g' to goto address, try adding 90fd0 to the image base IDA is using, and it should take you directly to the offending instruction (depending on section structure).

From here you can analyze the control flow, and how eax is set and used. If you have a good kernel level debugger you can set a breakpoint on this address and try and get it to hit.

Analysing the function, you should attempt to figure out what the function does, what eax is meant to be pointing to at that point, what its actually pointing to, and why. This is the hard part and is a great part of the difficulty and skill of reverse engineering.

暮凉 2024-12-14 06:33:38

找到了解决方案。

问题
日志记录不可靠,因为消息(转储到文件时)在蓝屏死机期间消失,通过网络进行日志记录时数据包有时会丢失,并且由于日志记录而导致速度减慢。

解决方案
配置系统以在 BSOD 上生成完整的物理内存转储并将所有消息记录到任何内存缓冲区中,而不是记录到文件或通过网络。会更快。一旦系统崩溃,它会将整个内存转储到文件中,并且可以使用 WinDBG 的 dt (如果有调试符号)命令查看日志文件缓冲区的内容,或者您​​'将能够使用“内存”视图搜索和定位存储在内存中的日志文件。

我使用 std::strings 的循环缓冲区来存储消息和单独的 const char* 数组,以便在 WinDBG 中更容易阅读,但您可以简单地创建巨大的 char 数组并以纯文本形式存储其中的所有消息。

详细信息
winxp 上的整个过程:

  1. 确保最小页面文件大小等于或大于 RAM 总量 + 1 MB。 (右键单击“我的电脑”->属性->高级->性能->高级->更改)
  2. 配置系统以在 BSOD 上生成完整的内存转储(右键单击“我的电脑”->属性-> ;高级->启动和恢复->设置->写入调试信息选择“完整内存转储”并指定您的路径。确保
  3. 磁盘(将写入文件的位置)具有所需的可用空间(系统上的 RAM 总量)。
  4. 使用调试符号构建 app/dll(执行日志记录的文件),然后触发 BSOD。
  5. 等到 并重新启动时,请随意向驱动程序开发人员发誓。
  6. 将系统生成的 MEMORY.DMP 复制到安全位置,这样如果系统再次崩溃,您将不会丢失所有内容。
  7. 转储
  8. 内存转储完成,重新启动。在系统写入内存 倾倒(文件 -> 打开故障转储)。
  9. 如果您想查看发生了什么,请
  10. 使用以下方法之一访问存储日志消息的内存缓冲区:
    1. 要查看全局变量的内容,请使用dt module!variable,其中“module”是库的名称(不带*.dll),“variable”是变量的名称。您可以使用通配符。您可以在没有 module!variable
    2. 的情况下使用地址

    3. 要查看全局变量的一个字段的内容(如果全局变量是结构体),请使用 dt module!variable field,其中“field”是变量成员。
    4. 要查看有关变量(数组和子结构的内容)的更多详细信息,请使用 dt -b module!variable fielddt -b module!variable
    5. 如果您没有符号,则需要使用内存窗口搜索“日志文件”。

此时,您将能够看到存储在内存中的日志内容,此外您还将获得整个系统崩溃时的快照。

另外...

  1. 要查看有关导致系统崩溃的进程的信息,请使用 !process
  2. 要查看加载的模块,请使用 lm
  3. 有关线程的信息,其中有 !thread id,其中 id 是您在 !process 输出中看到的十六进制 ID。

Found a solution.

Problem:
Logging is unreliable since messages (when dumped to file) disappear during bsod, packets are sometimes lost when logging over network, and there's slowdown due to logging.

Solution:
Instead of logging to file or over network, configure system to produce full physical memory dump on BSOD and log all messages into any memory buffer. It'll be faster. Once system crashed, it'll dump entire memory into file, and it'll be possible to either view contents of log-file buffer using WinDBG's dt (if you have debug symbols) command, or you'll be able to search and locate logfile stored in memory using "memory" view.

I used circular buffer of std::strings to store messages and separate array of const char* to make things easier to read in WinDBG, but you could simply create huge array of char and store all messages within it in plaintext.

Details:
Entire process on winxp:

  1. Ensure that minimum page file size is equal or larger than total amount of RAM + 1 megabytes. (Right Click "My Computer"->Properties->Advanced->Performance->Advanced->Change)
  2. Configure system to produce complete memory dump on BSOD (RIght click "My Computer'->Properties->Advanced->Startup and Recovery->Settings->Write Debugging Information . Select "Complete memory dump" and specify path you want).
  3. Ensure that disk (where the file will be written) has required amount of free space (total amount of RAM on your system.
  4. Build app/dll (the one that does logging) with debug symbol, and Trigger BSOD.
  5. Wait till memory dump is finished, reboot. Feel free to swear at driver developer while system writes memory dump and reboots.
  6. Copy MEMORY.DMP system produced to a safe place, so you won't lose everything if system crashes again.
  7. Launch windbg.
  8. Open Memory Dump (File->Open Crash Dump).
  9. If you want to see what happened, use !analyze -v command.
  10. Access memory buffer that stores logged messages using one of those methods:
    1. To see contents of global variable, use dt module!variable where "module" is name of your library (without *.dll), and "variable" is name of variable. You can use wildcards. You can use address without module!variable
    2. To see contents of one field of the global variable (if global variable is a struct), use dt module!variable field where "field" is variable member.
    3. To see more details about varaible (content of arrays and substructures) use dt -b module!variable field or dt -b module!variable
    4. If you don't have symbols, you'll need to search for your "logfile" using memory window.

At this point you'll be able to see contents of log that were stored in memory, plus you'll have snapshot of the entire system at the moment when it crashed.

Also...

  1. To see info about process that crashed the system, use !process.
  2. To see loaded modules use lm
  3. For info about thread there's !thread id where id is hexadecimal id you saw in !process output.
缱绻入梦 2024-12-14 06:33:38

看起来崩溃可能是由错误的指针或堆损坏引起的。您可以看出这一点,因为崩溃发生在内存释放函数 (DxDdDestroySurface) 中。销毁表面是你绝对需要做的事情 - 你不能只是将其删除,当程序退出时表面仍然会被释放,如果你在内核中禁用它,你将耗尽卡上内存非常快,也以这种方式崩溃。

您可以尝试找出导致堆损坏的事件顺序,但这里没有灵丹妙药 - 正如 fileoffset 建议的那样,您需要实际对驱动程序进行逆向工程以了解发生这种情况的原因(这可能有助于比较驱动程序)在有问题的驱动程序版本之前和之后也是如此!)

It looks like the crash may either be caused by a bad pointer, or heap corruption. You can tell this because the crash occurs in a memory-freeing function (DxDdDestroySurface). Destroying surfaces is something that you absolutely need to do - you can't just stub this out, the surface will still get freed when the program exits, and if you disable it inside the kernel, you'll run out of on-card memory very quickly and crash that way, as well.

You can try to figure out what sequence of events leads up to this heap corruption, but there's no silver bullet here - as fileoffset suggested, you'll need to actually reverse engineer the driver to see why this happens (it may help to compare drivers before and after the offending driver version as well!)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文