调试难以重现的并发错误的技巧?
对于调试难以重现的并发错误(例如,每运行一千次测试才会发生一次)有哪些技巧?我有其中之一,但我不知道如何调试它。我无法在各处放置打印语句或调试器监视来观察内部状态,因为当错误未成功重现时,这会改变计时并产生大量信息。
What are some tips for debugging hard to reproduce concurrency bugs that only happen, say, once every thousand runs of a test? I have one of these and I have no idea how to go about debugging it. I can't put print statements or debugger watches all over the place to observe internal state, because that would change timings and produce overwhelming amounts of information when the bug is not successfully reproduced.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
这是我的技术:我通常使用大量的assert()来尽可能频繁地检查数据一致性/有效性。当一个断言失败时,程序崩溃并生成核心文件。然后我使用带有核心文件的调试器来了解哪些线程配置导致了数据损坏。
Here is my technique : I generally use a lot of assert() to check the data consistency/validity as often as possible. When one assert fails, the program crashes generating a core file. Then I use a debugger with the core file to understand what thread configuration led to data corruption.
这可能对您没有帮助,但可能会帮助将来有人看到这个问题。
如果您使用 .Net 语言,则可以使用 CHESS 项目微软研究。它对各种线程交错运行单元测试,并显示哪些线程会导致错误发生。
您所使用的语言可能有类似的工具。
This might not help you but will probably help someone seeing this question in the future.
If you're using a .Net language you can use the CHESS project from Microsoft research. It runs unit tests with every kind of thread interleaving and shows you which ones cause the bug to happen.
There may be a similar tool for the language you're using.
这在很大程度上取决于问题的性质。通常有用的是二分法(缩小搜索空间)+代码“检测”,其中包含用于访问线程 ID、锁定/解锁计数、锁定顺序等的断言,希望下次问题重现时应用程序将记录日志详细消息或将核心转储为您提供解决方案。
It highly depends on the nature of the problem. Commonly useful are bisection (to narrow down the search space) + code "instrumentation" with assertions for accessing thread IDs, lock/unlock counts, locking order, etc. in the hope that when the problem will reproduce next time the application will either log a verbose message or will core-dump giving you the solution.
查找并发错误导致的数据损坏的一种方法是:
。
One method for finding data corruption caused by concurrency bug:
根据我的经验,有针对性的单元测试代码既耗时又有效。
尽可能缩小失败代码的范围。编写特定于明显的罪魁祸首代码的测试代码,并在调试器中运行它,直到重现问题为止。
Targeted unit test code is time-consuming but effective, in my experience.
Narrow down the failing code as much as you can. Write test code that's specific to the apparent culprit code and run it in a debugger for as long as it takes to reproduce the problem.
我使用的策略之一是通过引入自旋等待来模拟线程的交错。需要注意的是,您不应该在您的平台上使用标准的自旋等待机制,因为它们可能会引入内存障碍。如果您尝试解决的问题是由于缺乏内存屏障引起的(因为在使用无锁策略时很难正确设置屏障),那么标准自旋等待机制只会掩盖问题。相反,在您希望代码暂停片刻的地方放置一个空循环。这可以增加重现并发错误的可能性,但这并不是灵丹妙药。
One of the strategies I use is to simulate interleaving of the threads is by introducing spin waits. The caveat is that you should not utilize the standard spin wait mechanisms for your platform because they will likely introduce memory barriers. If the issue you are trying to troubleshoot is caused by a lack of a memory barrier (because it is difficult to get the barriers correct when using lock-free strategies) then the standard spin wait mechanisms will just mask the problem. Instead, place an empty loop at the points where you want your code to stall for a moment. This can increase the probability of reproducing a concurrency bug, but it is not a magic bullet.
如果 bug 是死锁,只需在死锁发生后将调试工具(例如
gdb
或strace
)附加到程序中,并观察发生在哪里每个线程被卡住,通常可以让您获得足够的信息来快速追踪错误的来源。If the bug is a deadlock, simply attaching a debugging tool (like
gdb
orstrace
) to the program after the deadlock happens, and observing where each thread is stuck, can often get you enough information to track down the source of the error quickly.我用一些调试技术制作了一个小图表,以便在调试多线程代码时记住。图表正在增长,请留下评论和提示来添加。 http://adec.altervista.org/blog/multithreading-debugging-chart/
A little chart I've made with some debugging techniques to take in mind in debugging multithreaded code. The chart is growing, please leave comments and tips to be added. http://adec.altervista.org/blog/multithreading-debugging-chart/