调试进程挂在旧的第 3 方应用程序中
我们有一个基于“CT ADE”的传统第三方电话系统,该系统会定期挂起几秒钟(5 到 30 秒)然后恢复。在这些挂起期间,用户会在电话菜单中遇到令人沮丧的暂停。这种情况至少已经持续了几个星期。
这段代码不是我写的,所以我对它的了解非常有限。内部有多个“任务”(线程?),每条电话线一个,用于处理呼叫。当应用程序挂起时,所有“任务”都会挂起。
这个问题似乎与负载无关。即使在使用率低的时候也会发生这种情况。它似乎与网络无关(发生在数据库与此应用程序位于同一物理盒上的系统上)。尽管创建执行大量数据库 I/O 和文件 I/O 的示例任务可能会导致此应用程序内较短的暂停,但似乎与网络或磁盘无关。
问题发生时,该进程不会显示任何内存或 CPU 峰值。
此时此刻,我只是想尝试一切......
We have a legacy third-party telephony system built on something called "CT ADE" that periodically hangs for a few seconds (5 to 30) then resumes. During these hangs, users experience frustrating pauses in the phone menu. This has been going on for several weeks at least.
This code was not written by me, so my knowledge of it is very limited. Internally there are multiple "tasks" (threads?), one per phone line, that handle calls. When the application hangs, all "tasks" are hung.
This issue does not seem to be load related. It occurs even during times of low usage. It does not appear to be network related (occurs on systems where the DB is located on the same physical box as this app). Does not appear to be network or disk related, although creating sample tasks that do lots of DB I/O and File I/O can cause shorter pauses within this application.
The process does not show any memory or cpu spikes when the problem occurs.
At this point I'm just grasping for anything to try...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
使用遗留代码是痛苦的 - 根据我的经验,你只需要深入研究并尝试通过任何适合你的方法来理解代码在做什么 - 无论是通过阅读代码并尝试找出它的作用,还是调试各种场景并单步执行每行代码。
这需要一段时间,并且有些代码你永远无法理解,但是如果有足够的时间盯着代码并尝试它的作用,你最终应该能够足够理解并找出问题所在。
有本书有效处理遗留代码,我从未读过,但本来就很好。
Working with legacy code is painful - in my experience you just need to dive in and try and understand what the code is doing through whatever means works for you - be it by reading the code and trying to figure out what it does, or debugging various scenarios and stepping through each line of code executed.
It will take a while, and there will be parts of the code you will never understand, but given enough time staring at the code and experimenting with what it does you should eventually be able to understand enough to figure out what the problem is.
There is a book Working Effectively with Legacy Code which I have never read but is meant to be very good.
尝试在其中一次挂起期间运行采样分析器,以查看 CPU 时间都花在哪里。
Try running a sampling profiler during one of these hangs to see where CPU time is being spent.
如果问题与高 CPU 使用率无关,配置文件可能不会给您带来任何好处。
对我来说,这听起来像是一个多线程问题。如果可能的话,附加一个调试器并在问题出现时暂停。查看所有线程当前执行的代码/调用堆栈。多个线程可能尝试访问单个资源或线程安全函数,并且必须等待,因为另一个线程具有对该资源的独占访问权限。
这可能是一些不起眼的事情,比如尝试写入日志。
If the the problem is not related to high cpu usage a profile probably will not gain you anything.
For me it sounds like a multi-threading issue. If possible attach with a debugger and pause when the problem shows. Look at the currently executed code / call stacks of all threads. It might be that multiple threads try to access a single resource or thread-safe function and have to wait because another thread has exclusive access of this resource.
This might be something inconspicuous like trying to write to a log.