.NET Windows 服务在调度 Windows 消息时崩溃
我在 .NET Windows 服务方面遇到了重大问题。它在配置截然不同的多台服务器上运行。该服务似乎在某些服务器上容易崩溃,但在其他服务器上则稳定。不稳定是最近引入的,但到目前为止情况尚不清楚。我们的服务器运行 Windows 2003 / Windows 2003 R2 / Windows 2008。其中大多数都已完全更新。
我们尝试针对不同的目标框架版本(2.0 / 3.5 / 4.0)构建服务,但没有什么区别。服务不稳定的机器在每个版本的框架上都不稳定。我尝试修复 .NET 框架,但这也没有什么区别。据我所知,整个服务及其依赖项都位于托管代码中。
我还尝试在命令行版本中运行服务器代码。这看起来运行稳定。我们现在用它作为解决方法。但是,问题与用户帐户无关。该服务通常作为“本地服务”运行。我尝试让它在本地管理员帐户下运行,这是我用来运行命令行版本的帐户。但服务仍然不稳定。
到目前为止,我已经能够在其中一台服务器上创建可重现的情况: - 启动服务器上的服务。 - 在同一服务器上的新 RDP 会话中以域用户身份登录。 - 启动我们的客户端软件,该软件在该会话中通过 TCP 远程处理访问我们的服务。 - 关闭客户端和会话。 - 与服务器上的域用户打开一个新的 RDP 会话。 - 服务立即崩溃!
请注意,该服务在域用户登录新 RDP 会话时崩溃。此时我们的客户端软件尚未在该会话中运行。如果我在第一个会话中不打开客户端并使用 TCP 远程处理访问服务,则该服务不会在第二次登录期间崩溃。如果我以本地管理员身份打开会话,该服务也不会崩溃。
我已经能够将本机调试器 (OllyDbg) 附加到崩溃的服务。当尝试在地址 0x4bcdcee9 执行时,它会因访问冲突而崩溃。该地址在所有服务器和配置上都是相同的(我每次都在事件日志中看到该地址)。我查看了崩溃线程的堆栈。该线程似乎是在崩溃之前创建的。首先它尝试加载 Ole32.dll。它运行 Ole32 中的一些代码,然后我看到这些函数被调用:
- User32.SetTimer
- User32.GetMessageW
- User32.TranslateMessage
- User32.DispatchMessageW
崩溃发生在 DispatchMessageW 中的某个位置。我可以在堆栈上看到 DispatchMessageW 的 *MSG 参数。看起来这样通过了:
- hWnd = 0x00090082
- Message = 0x0000001e
- wParam = 0x00000000
- lParam = 0x00000000
我尝试过 Spy++。但它似乎没有检测到 Windows 服务中的任何 hWnd。
因此,服务收到此消息,尝试解析和分派它,但每次最终都会调用 0x4bc4cee9(未映射的内存)并崩溃。
编辑:根据汉斯的建议,我调查了系统事件。我调试了该服务。我向我的服务可执行文件添加了一个额外的服务,以便我可以启动辅助服务,然后附加调试器,然后启动真正的服务。这样我甚至可以调试服务的 OnStart。我在 SetWindowsHookA、SetWindowsHookW、SetWindowsHookExA 和 SetWindowsHookExW 上放置了断点,但没有一个被击中!?
编辑2:我检查了我所有的笔记,发现我得出了错误的结论,因为我的笔记中有一个拼写错误:-S 无论如何,崩溃的地址是0x4bc4cee9。在执行过程中的某个时刻,msado15.dll 会被加载到那里。我可以看到,当客户端与服务器断开连接时,调试器中有 2 个托管异常。不久之后,我看到一条 WM_Timer 消息,该消息由调度程序处理并调用 CoFreeUnusedLibraries()。这会导致卸载 msado15.dll。我在反汇编器中打开 msado15.dll 并加载来自 Microsoft 的符号。该 DLL 是 Microsoft 数据访问组件 (MDAC) 2.8 SP1 的一部分。版本是2.82.4795.0,说明是最新版本,2011年1月发布。ADOConnection和ADORecordset有Advise()和Unadvise()函数。 Advise() 调用 InitAsyncEvents(),然后调用 RegisterClassEx()。传递给 RegisterClassEx() 的 WndProc 是 FireEventOnMainThread(),位于 0x4bc4cee9!我可以看到那里的功能!应该发生的是,当对象被释放时,应该调用 Unadvise() 、 DestroyAsyncEvents() 和 UnregisterClassEx() 。但不知怎的,这并没有发生。 DLL 在取消注册类之前被卸载。这会导致下一个事件崩溃。这可能与 2 个托管异常有关。我会进一步调查。
堆栈跟踪:http://pastebin.com/dsSjMe4Y
日志:http://pastebin.com/qD2MXvHd
我真的很感激在这件事上的一些指导。例如,哪个进程可以发送此消息?该服务怎么可能完全错误地发送此消息?如何避免这种情况?
谢谢你, 希斯克利夫
I have a major problem with a .NET Windows Service. It runs on multiple servers with very different configurations. The service seems to be susceptible to crashes on some of the servers, but stable on others. The instability is introduced recently, but so far the conditions are unknown. We have servers running Windows 2003 / Windows 2003 R2 / Windows 2008. Most of them are fully updated.
We tried building the service against different target-framework-versions (2.0 / 3.5 / 4.0), but it doesn't make a difference. Machines that have an unstable service are unstable with every version of the framework. I've tried repairing the .NET frameworks, but that doesn't make a difference either. As far as I can review, the entire service and its dependencies are in managed code.
I've also tried to run the server-code in a commandline version. This seems to run stable. We use this as a work-around now. However, the problem is not related to the user-account. The service normally runs as "Local Service". I've tried to let it run under local Administrator account, which is the account the I use to run the Commandline version. But the service is still unstable.
So far, I've been able to create a reproducable situation on one of the servers:
- Start the service on the server.
- Log on as a domain-user in a new RDP session on the same server.
- Start our client-software, which accesses the our service over TCP-remoting in that session.
- Close the client and the session.
- Open a new RDP session with the domain-user on the server.
- Instant crash of service!
Note that the service crashes at the moment the domain-user logs onto the new RDP session. Our client-software has not been run in that session at that point. If I don't open the client and access the service with TCP remoting in the first session, the service won't crash during the second logon. If I open the sessions as local Administrator, the service does not crash either.
I've been able to attach a native debugger (OllyDbg) to the crashing service. It crashes with an Access Violation when trying to execute at address 0x4bcdcee9. That address is the same on all servers and configurations (I've seen that address every time in the eventlog). I have looked at the stack of the crashing thread. The thread seems to be created just before the crash. First it tries to load Ole32.dll. It runs some code from Ole32 and then I see these functions being called:
- User32.SetTimer
- User32.GetMessageW
- User32.TranslateMessage
- User32.DispatchMessageW
The crash is somewhere in DispatchMessageW. I can see the *MSG argument for DispatchMessageW on the stack. It looks like this is passed:
- hWnd = 0x00090082
- Message = 0x0000001e
- wParam = 0x00000000
- lParam = 0x00000000
I've tried Spy++. But it doesn't seem to detect any hWnd's in the Windows service.
So, the service receives this message, tries to parse and dispatch it and every time ends up calling 0x4bc4cee9, which is unmapped memory, and crashes.
EDIT: As per Hans' suggestion I investigated the systemevents. I debugged the service. I added an extra service to my service-executable, so that I could start the helper-service, then attach a debugger, and then start the real service. This way I am able to debug even the OnStart of the service. I placed breakpoints on SetWindowsHookA, SetWindowsHookW, SetWindowsHookExA and SetWindowsHookExW, but none of them was hit!?
EDIT 2: I checked all my notes and found that I jumped to the wrong conclusions, because I had a typo in my notes :-S Anyway, the address of the crash is 0x4bc4cee9. At some point in the execution, msado15.dll is loaded there. I can see that when the client disconnects from the server, there are 2 managed exceptions in the debugger. Shortly after that I see a WM_Timer message, which is handled by the dispatcher and it calls CoFreeUnusedLibraries(). That results in unloading msado15.dll. I opened the msado15.dll in a disassembler and loaded the symbols from Microsoft. The DLL is part of Microsoft Data Access Components (MDAC) 2.8 SP1. The version is 2.82.4795.0, indicating it is the latest version, released in January 2011. There are Advise() and Unadvise() functions for ADOConnection and ADORecordset. Advise() calls InitAsyncEvents() and that calls RegisterClassEx(). The WndProc that is passed to RegisterClassEx() is FireEventOnMainThread() which is at 0x4bc4cee9! I can see the function there! What should happen is that when the objects are disposed, the Unadvise() and DestroyAsyncEvents() and UnregisterClassEx() should be called. But somehow, that is not happening. The DLL gets unloaded before it can unregister the classes. Which result in a crash on the next event. This may somehow relate to the 2 managed exceptions. I will investigate further.
Stacktrace: http://pastebin.com/dsSjMe4Y
Log: http://pastebin.com/qD2MXvHd
I would really appreciate some guidance in this matter. Like, which process could be sending this message? And how is it possible that the service dispatches this completely wrong? How to avoid this?
Thank you,
Heathcliff
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我发现了问题。我花了将近 8 天的时间来确定并创建一个解决方法!
所有 ADODB 6.0 之前的版本都有一个严重的错误! ADODB 2.8 是 MDAC 2.8(适用于 XP 和 Win2003)的一部分,ADODB 6.0 是 Vista/Win2008 的一部分,ADODB 6.1 是 Win7/Win2008R2 的一部分。核心 DLL 是 msado15.dll。当实例化 Connection 或 Recordset 类时,它会使用 RegisterClass() 进行注册,并且有一个名为 __FireEventOnMainThread() 的 WndProc。再次释放所有 COM 对象后,引用计数将设置为 0。当调用 Ole32!CoFreeUnusedLibraries() 时,它将调用所有 COM DLL 的 DllCanUnloadNow()。 DllCanUnloadNow() 检查引用计数,当引用计数为 0 时,将返回 0,表示可以卸载。在 ADODB 6.1(仅针对 Win7 和 Win2008R2 发布)中,Microsoft 在 DllCanUnloadNow() 中实现了修复。他们检查 AsyncEventsWnd,如果它仍然存在,他们将不会卸载 DLL。但真正的错误仍然存在于 COM 对象处理中。引用计数减少,但由于某种原因未调用 UnregisterClass()。当 DLL 被卸载并发送广播事件时,应用程序将遇到访问冲突,因为 WndProc 不再位于内存中。碰撞!对于服务,Ole32!CDllHost 被实例化(不确定在哪里)。此类使用 TimerProc STAHostTimerProc() 启动一个计时器,每 300 秒触发一次。 STAHostTimerProc() 调用 CoFreeUnusedLibraries()。有许多不同的广播消息。例如,当终端服务器上启动新的用户会话时,它将广播 WM_TIMECHANGE。因此,在 Windows Vista/Win2008 及更高版本的计算机上,当应用程序创建 ADODB.Connection 或 ADODB.Recordset 时,它会创建 Ole32!CDllHost,然后释放所有 COM 对象,然后等待计时器卸载 msado15.dll,然后等待广播消息,该应用程序将崩溃!
可怕的是,微软在 MDAC 6.1 中修复了这个问题,但他们没有发布针对早期版本的修复程序。所有较旧的操作系统都会受到影响。
作为解决方法,我们将通过创建静态 ADODB.Connection 对象来避免 ADO COM 对象的引用计数变为 0。
I found the problem. It took me almost 8 days to pin it down and create a work-around!
All ADODB versions up to 6.0 have a serious bug! ADODB 2.8 is part of MDAC 2.8 (for XP and Win2003), ADODB 6.0 is part of Vista/Win2008 and ADODB 6.1 is part of Win7/Win2008R2. The Core DLL is msado15.dll. When a Connection or Recordset class is instantiated, it is registered with RegisterClass() and it has a WndProc called __FireEventOnMainThread(). After all COM objects are disposed again, the reference count is set to 0. When Ole32!CoFreeUnusedLibraries() is called it will call DllCanUnloadNow() of all COM DLL's. DllCanUnloadNow() checks the reference-count and when it is 0 it will return 0, indicating it can unload. In ADODB 6.1 (only released for Win7 and Win2008R2) Microsoft implemented a fix in DllCanUnloadNow(). They check for the AsyncEventsWnd and if it still exists, they will not unload the DLL. But the real bug is still there in the COM object disposal. The reference-count is decreased, but for some reason UnregisterClass() is not called. When the DLL is unloaded and a broadcast event is sent, the applicion will run into an Access Violation, because the WndProc is not in memory anymore. Crash! In case of the service, a Ole32!CDllHost is instantiated (not sure where). This class starts a timer with TimerProc STAHostTimerProc(), firing every 300 seconds. STAHostTimerProc() calls CoFreeUnusedLibraries(). There are many different broadcast-messages. For example, when a new user session is started on a terminal server, it will broadcast WM_TIMECHANGE. So, on machines with Windows up to Vista/Win2008 when an application creates an ADODB.Connection or ADODB.Recordset and it creates an Ole32!CDllHost and then disposes all COM objects, and then wait for the timer to unload msado15.dll and then wait for a broadcast-message, that application will crash!
It's terrible that Microsoft fixed this in MDAC 6.1, but they did not release a fix for earlier versions. All older operating systems are affected.
As a work-around we will avoid that the reference-count of ADO COM objects will become 0 by creating a static ADODB.Connection object.