无需源代码即可调试/绕过 BSOD
你好,祝你有美好的一天。
这里需要一些帮助:
情况:
我有一个不起眼的 DirectX 9 应用程序(名称和应用程序详细信息与问题无关),自某些驱动程序版本以来,它会导致所有 nvidia 卡(GeForce 8400GS 及更高版本)出现蓝屏死机。我认为该问题是由 DirectX 9 调用或触发驱动程序错误的标志间接引起的。
目标:
我想追踪有问题的标志/函数调用(为了好玩,这不是我的工作/家庭作业)并通过编写代理 dll 来绕过错误条件。我已经有一个完成的代理 dll,它为 IDirect3D9、IDirect3DDevice9、IDirect3DVertexBuffer9 和 IDirect3DIndexBuffer9 提供包装器,并提供 Direct3D 调用的基本日志记录/跟踪。但是,我无法查明导致崩溃的函数。
问题:
- 没有可用的源代码或技术支持。不会有任何帮助,也没有其他人可以解决问题。
- 内核生成的内存转储没有帮助 - 显然 nv4_disp.dll 内发生了访问冲突,但我无法使用堆栈跟踪转到 IDirect3DDevice9 方法调用,而且错误有可能异步发生。
- (主要问题)由于大量的 Direct3D9Device 方法调用,我无法可靠地将它们记录到文件中或通过网络:
- 即使没有刷新,登录文件也会导致速度显着减慢,因此,当系统出现 BSOD 时,日志的所有最后内容都会丢失。
- 通过网络进行日志记录(使用 UDP 和 WINSOck 的
sendto
)也会导致显着的速度减慢,并且不能异步完成(异步数据包在 BSOD 时丢失),而且数据包(崩溃周围的数据包)有时会丢失即使同步发送也会丢失。 - 当应用程序因日志记录例程而“减慢”速度时,发生 BSOD 的可能性较小,这使得跟踪它变得更加困难。
问题:
我通常不编写驱动程序,也不进行这种级别的调试,所以我觉得我错过了一些重要的东西,有一种比使用自定义日志记录机制编写 IDirect3DDevice9 代理 dll 更简单的方法来追踪问题。它是什么?诊断/处理/修复这样的问题的标准方法是什么(没有源代码,COM接口方法触发BSOD)?
小型转储分析(WinDBG):
Loading User Symbols Loading unloaded module list ........... Unable to load image nv4_disp.dll, Win32 error 0n2 *** WARNING: Unable to verify timestamp for nv4_disp.dll *** ERROR: Module load completed but symbols could not be loaded for nv4_disp.dll ******************************************************************************* * * * Bugcheck Analysis * * * ******************************************************************************* Use !analyze -v to get detailed debugging information. BugCheck 1000008E, {c0000005, bd0a2fd0, b0562b40, 0} Probably caused by : nv4_disp.dll ( nv4_disp+90fd0 ) Followup: MachineOwner --------- 0: kd> !analyze -v ******************************************************************************* * * * Bugcheck Analysis * * * ******************************************************************************* KERNEL_MODE_EXCEPTION_NOT_HANDLED_M (1000008e) This is a very common bugcheck. Usually the exception address pinpoints the driver/function that caused the problem. Always note this address as well as the link date of the driver/image that contains this address. Some common problems are exception code 0x80000003. This means a hard coded breakpoint or assertion was hit, but this system was booted /NODEBUG. This is not supposed to happen as developers should never have hardcoded breakpoints in retail code, but ... If this happens, make sure a debugger gets connected, and the system is booted /DEBUG. This will let us see why this breakpoint is happening. Arguments: Arg1: c0000005, The exception code that was not handled Arg2: bd0a2fd0, The address that the exception occurred at Arg3: b0562b40, Trap Frame Arg4: 00000000 Debugging Details: ------------------ EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - The instruction at "0x%08lx" referenced memory at "0x%08lx". The memory could not be "%s". FAULTING_IP: nv4_disp+90fd0 bd0a2fd0 39b8f8000000 cmp dword ptr [eax+0F8h],edi TRAP_FRAME: b0562b40 -- (.trap 0xffffffffb0562b40) ErrCode = 00000000 eax=00000808 ebx=e37f8200 ecx=e4ae1c68 edx=e37f8328 esi=e37f8400 edi=00000000 eip=bd0a2fd0 esp=b0562bb4 ebp=e37e09c0 iopl=0 nv up ei pl nz na po nc cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000 efl=00010202 nv4_disp+0x90fd0: bd0a2fd0 39b8f8000000 cmp dword ptr [eax+0F8h],edi ds:0023:00000900=???????? Resetting default scope CUSTOMER_CRASH_COUNT: 3 DEFAULT_BUCKET_ID: DRIVER_FAULT BUGCHECK_STR: 0x8E LAST_CONTROL_TRANSFER: from bd0a2e33 to bd0a2fd0 STACK_TEXT: WARNING: Stack unwind information not available. Following frames may be wrong. b0562bc4 bd0a2e33 e37f8200 e37f8200 e4ae1c68 nv4_disp+0x90fd0 b0562c3c bf8edd6b b0562cfc e2601714 e4ae1c58 nv4_disp+0x90e33 b0562c74 bd009530 b0562cfc bf8ede06 e2601714 win32k!WatchdogDdDestroySurface+0x38 b0562d30 bd00b3a4 e2601008 e4ae1c58 b0562d50 dxg!vDdDisableSurfaceObject+0x294 b0562d54 8054161c e2601008 00000001 0012c518 dxg!DxDdDestroySurface+0x42 b0562d54 7c90e4f4 e2601008 00000001 0012c518 nt!KiFastCallEntry+0xfc 0012c518 00000000 00000000 00000000 00000000 0x7c90e4f4 STACK_COMMAND: kb FOLLOWUP_IP: nv4_disp+90fd0 bd0a2fd0 39b8f8000000 cmp dword ptr [eax+0F8h],edi SYMBOL_STACK_INDEX: 0 SYMBOL_NAME: nv4_disp+90fd0 FOLLOWUP_NAME: MachineOwner MODULE_NAME: nv4_disp IMAGE_NAME: nv4_disp.dll DEBUG_FLR_IMAGE_TIMESTAMP: 4e390d56 FAILURE_BUCKET_ID: 0x8E_nv4_disp+90fd0 BUCKET_ID: 0x8E_nv4_disp+90fd0 Followup: MachineOwner
Hello and good day to you.
Need a bit of assitance here:
Situation:
I have an obscure DirectX 9 application (name and application details are irrelevant to the question) that causes blue screen of death on all nvidia cards (GeForce 8400GS and up) since certain driver version. I believe that the problem is indirectly caused by DirectX 9 call or a flag that triggers driver bug.
Goal:
I'd like to track down offending flag/function call (for fun, this isn't my job/homework) and bypass error condition by writing proxy dll. I already have a finished proxy dll that provides wrappers for IDirect3D9, IDirect3DDevice9, IDirect3DVertexBuffer9 and IDirect3DIndexBuffer9 and provides basic logging/tracing of Direct3D calls. However, I can't pinpoint function which causes crash.
Problems:
- No source code or technical support is available. There will be no assitance, and nobody else will fix the problem.
- Memory dump produced by kernel wasn't helpful - apparently an access violation happens within nv4_disp.dll, but I can't use stacktrace to go to IDirect3DDevice9 method call, plus there's a chance that bug happens asynchronously.
- (Main problem) Because of large number of Direct3D9Device method calls, I can't reliably log them into file or over network:
- Logging into file causes significant slowdown even without flushing, and because of that all last contents of the log are lost when system BSODs.
- Logging over network (using UDP and WINSOck's
sendto
)also causes significant slowdown and must not be done asynchronously (asynchronous packets are lost on BSOD), plus packets (the ones around the crash) are sometimes lost even when sent synchronously. - When application is "slowed" down by logging routines, BSOD is less likely to happen, which makes tracking it down harder.
Question:
I normally don't write drivers, and don't do this level of debugging, so I have impression that I'm missing something important there's a more trivial way to track down the problem than writing IDirect3DDevice9 proxy dll with custom logging mechanism. What is it? What is the standard way of diagnosing/handling/fixing problem like this (no source code, COM interface method triggers BSOD)?
Minidump analysis(WinDBG):
Loading User Symbols Loading unloaded module list ........... Unable to load image nv4_disp.dll, Win32 error 0n2 *** WARNING: Unable to verify timestamp for nv4_disp.dll *** ERROR: Module load completed but symbols could not be loaded for nv4_disp.dll ******************************************************************************* * * * Bugcheck Analysis * * * ******************************************************************************* Use !analyze -v to get detailed debugging information. BugCheck 1000008E, {c0000005, bd0a2fd0, b0562b40, 0} Probably caused by : nv4_disp.dll ( nv4_disp+90fd0 ) Followup: MachineOwner --------- 0: kd> !analyze -v ******************************************************************************* * * * Bugcheck Analysis * * * ******************************************************************************* KERNEL_MODE_EXCEPTION_NOT_HANDLED_M (1000008e) This is a very common bugcheck. Usually the exception address pinpoints the driver/function that caused the problem. Always note this address as well as the link date of the driver/image that contains this address. Some common problems are exception code 0x80000003. This means a hard coded breakpoint or assertion was hit, but this system was booted /NODEBUG. This is not supposed to happen as developers should never have hardcoded breakpoints in retail code, but ... If this happens, make sure a debugger gets connected, and the system is booted /DEBUG. This will let us see why this breakpoint is happening. Arguments: Arg1: c0000005, The exception code that was not handled Arg2: bd0a2fd0, The address that the exception occurred at Arg3: b0562b40, Trap Frame Arg4: 00000000 Debugging Details: ------------------ EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - The instruction at "0x%08lx" referenced memory at "0x%08lx". The memory could not be "%s". FAULTING_IP: nv4_disp+90fd0 bd0a2fd0 39b8f8000000 cmp dword ptr [eax+0F8h],edi TRAP_FRAME: b0562b40 -- (.trap 0xffffffffb0562b40) ErrCode = 00000000 eax=00000808 ebx=e37f8200 ecx=e4ae1c68 edx=e37f8328 esi=e37f8400 edi=00000000 eip=bd0a2fd0 esp=b0562bb4 ebp=e37e09c0 iopl=0 nv up ei pl nz na po nc cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000 efl=00010202 nv4_disp+0x90fd0: bd0a2fd0 39b8f8000000 cmp dword ptr [eax+0F8h],edi ds:0023:00000900=???????? Resetting default scope CUSTOMER_CRASH_COUNT: 3 DEFAULT_BUCKET_ID: DRIVER_FAULT BUGCHECK_STR: 0x8E LAST_CONTROL_TRANSFER: from bd0a2e33 to bd0a2fd0 STACK_TEXT: WARNING: Stack unwind information not available. Following frames may be wrong. b0562bc4 bd0a2e33 e37f8200 e37f8200 e4ae1c68 nv4_disp+0x90fd0 b0562c3c bf8edd6b b0562cfc e2601714 e4ae1c58 nv4_disp+0x90e33 b0562c74 bd009530 b0562cfc bf8ede06 e2601714 win32k!WatchdogDdDestroySurface+0x38 b0562d30 bd00b3a4 e2601008 e4ae1c58 b0562d50 dxg!vDdDisableSurfaceObject+0x294 b0562d54 8054161c e2601008 00000001 0012c518 dxg!DxDdDestroySurface+0x42 b0562d54 7c90e4f4 e2601008 00000001 0012c518 nt!KiFastCallEntry+0xfc 0012c518 00000000 00000000 00000000 00000000 0x7c90e4f4 STACK_COMMAND: kb FOLLOWUP_IP: nv4_disp+90fd0 bd0a2fd0 39b8f8000000 cmp dword ptr [eax+0F8h],edi SYMBOL_STACK_INDEX: 0 SYMBOL_NAME: nv4_disp+90fd0 FOLLOWUP_NAME: MachineOwner MODULE_NAME: nv4_disp IMAGE_NAME: nv4_disp.dll DEBUG_FLR_IMAGE_TIMESTAMP: 4e390d56 FAILURE_BUCKET_ID: 0x8E_nv4_disp+90fd0 BUCKET_ID: 0x8E_nv4_disp+90fd0 Followup: MachineOwner
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这是重要的部分。从这个角度来看,很可能 eax 无效,因此尝试访问无效的内存地址。
你需要做的是将nv4_disp.dll加载到IDA中(你可以获得免费版本),检查IDA加载nv4_disp的映像库并点击'g'转到地址,尝试将90fd0添加到IDA正在使用的映像库中,它应该直接将您带到有问题的指令(取决于章节结构)。
从这里您可以分析控制流,以及如何设置和使用 eax。如果您有一个好的内核级调试器,您可以在此地址上设置一个断点并尝试使其命中。
分析该函数时,您应该尝试找出该函数的作用、此时 eax 指向什么、它实际指向什么以及原因。这是最难的部分,也是逆向工程难度和技巧的很大一部分。
This is the important part. Looking at this, it is most probable that eax is invalid, hence attempting to access an invalid memory address.
What you need to do is load nv4_disp.dll into IDA (you can get a free version), check the image base that IDA loads nv4_disp at and hit 'g' to goto address, try adding 90fd0 to the image base IDA is using, and it should take you directly to the offending instruction (depending on section structure).
From here you can analyze the control flow, and how eax is set and used. If you have a good kernel level debugger you can set a breakpoint on this address and try and get it to hit.
Analysing the function, you should attempt to figure out what the function does, what eax is meant to be pointing to at that point, what its actually pointing to, and why. This is the hard part and is a great part of the difficulty and skill of reverse engineering.
找到了解决方案。
问题:
日志记录不可靠,因为消息(转储到文件时)在蓝屏死机期间消失,通过网络进行日志记录时数据包有时会丢失,并且由于日志记录而导致速度减慢。
解决方案:
配置系统以在 BSOD 上生成完整的物理内存转储并将所有消息记录到任何内存缓冲区中,而不是记录到文件或通过网络。会更快。一旦系统崩溃,它会将整个内存转储到文件中,并且可以使用 WinDBG 的 dt (如果有调试符号)命令查看日志文件缓冲区的内容,或者您'将能够使用“内存”视图搜索和定位存储在内存中的日志文件。
我使用 std::strings 的循环缓冲区来存储消息和单独的 const char* 数组,以便在 WinDBG 中更容易阅读,但您可以简单地创建巨大的 char 数组并以纯文本形式存储其中的所有消息。
详细信息:
winxp 上的整个过程:
dt module!variable
,其中“module”是库的名称(不带*.dll),“variable”是变量的名称。您可以使用通配符。您可以在没有module!variable
的情况下使用地址
dt module!variable field
,其中“field”是变量成员。dt -b module!variable field
或dt -b module!variable
此时,您将能够看到存储在内存中的日志内容,此外您还将获得整个系统崩溃时的快照。
另外...
!process
。lm
!thread id
,其中 id 是您在!process
输出中看到的十六进制 ID。Found a solution.
Problem:
Logging is unreliable since messages (when dumped to file) disappear during bsod, packets are sometimes lost when logging over network, and there's slowdown due to logging.
Solution:
Instead of logging to file or over network, configure system to produce full physical memory dump on BSOD and log all messages into any memory buffer. It'll be faster. Once system crashed, it'll dump entire memory into file, and it'll be possible to either view contents of log-file buffer using WinDBG's
dt
(if you have debug symbols) command, or you'll be able to search and locate logfile stored in memory using "memory" view.I used circular buffer of std::strings to store messages and separate array of const char* to make things easier to read in WinDBG, but you could simply create huge array of char and store all messages within it in plaintext.
Details:
Entire process on winxp:
!analyze -v
command.dt module!variable
where "module" is name of your library (without *.dll), and "variable" is name of variable. You can use wildcards. You can use address withoutmodule!variable
dt module!variable field
where "field" is variable member.dt -b module!variable field
ordt -b module!variable
At this point you'll be able to see contents of log that were stored in memory, plus you'll have snapshot of the entire system at the moment when it crashed.
Also...
!process
.lm
!thread id
where id is hexadecimal id you saw in!process
output.看起来崩溃可能是由错误的指针或堆损坏引起的。您可以看出这一点,因为崩溃发生在内存释放函数 (
DxDdDestroySurface
) 中。销毁表面是你绝对需要做的事情 - 你不能只是将其删除,当程序退出时表面仍然会被释放,如果你在内核中禁用它,你将耗尽卡上内存非常快,也以这种方式崩溃。您可以尝试找出导致堆损坏的事件顺序,但这里没有灵丹妙药 - 正如 fileoffset 建议的那样,您需要实际对驱动程序进行逆向工程以了解发生这种情况的原因(这可能有助于比较驱动程序)在有问题的驱动程序版本之前和之后也是如此!)
It looks like the crash may either be caused by a bad pointer, or heap corruption. You can tell this because the crash occurs in a memory-freeing function (
DxDdDestroySurface
). Destroying surfaces is something that you absolutely need to do - you can't just stub this out, the surface will still get freed when the program exits, and if you disable it inside the kernel, you'll run out of on-card memory very quickly and crash that way, as well.You can try to figure out what sequence of events leads up to this heap corruption, but there's no silver bullet here - as fileoffset suggested, you'll need to actually reverse engineer the driver to see why this happens (it may help to compare drivers before and after the offending driver version as well!)