调试取决于所选调度程序的奇怪错误
我正在开发的软件中遇到了奇怪的行为。它是一个实时机器控制器,用 C++ 编写,在 Linux 上运行,并且广泛使用多线程。
当我运行程序而不要求它是实时的时,一切都按照我的预期进行。但是,当我要求它切换到实时模式时,存在一个明显可重现的错误,导致应用程序崩溃。我猜这一定是某种死锁,因为它是一个互斥锁,会遇到超时并最终触发断言。
我的问题是,如何追捕这个人。查看生成的核心的回溯并不是很有帮助,因为问题的原因在于过去的某个地方。
以下代码在“正常”和“实时”行为之间进行切换:
在 main.cpp 中(简化,通过断言检查返回码):
if(startAsRealtime){
struct sched_param sp;
memset(&sp, 0, sizeof(sched_param));
sp.sched_priority = 99;
sched_setscheduler(getpid(), SCHED_RR, &sp);}
在每个线程中(简化,通过断言检查返回码):
if(startAsRealtime){
sched_param param;
pthread_attr_setinheritsched(&attr, PTHREAD_EXPLICIT_SCHED);
pthread_attr_getschedparam(&attr, ¶m);
param.sched_priority = priority;
pthread_attr_setschedpolicy(&attr, SCHED_RR);
pthread_attr_setschedparam(&attr, ¶m);}
提前致谢
I am experiencing a strange behavior in a software I am working on. It is a realtime-machine-controller, written in C++, running on Linux and it is making extensive use of multithreading.
When I run the program without asking it to be realtime, everything works like I expect it to. But when I ask it to switch to its realtime mode, there is a clearly reproducible bug that lets the application crash. It must be some deadlock-thing I guess, because it is a mutex that runs into a timeout and ultimately triggers a assertion.
My Question is, how to hunt this one down. Looking at the backtrace from the produced core is not very helpful as the reason for the problem lies somewhere in the past.
The following code does the switching between 'normal' and 'realtime' behaviour:
In main.cpp (simplified, return-codes are checked via assertions):
if(startAsRealtime){
struct sched_param sp;
memset(&sp, 0, sizeof(sched_param));
sp.sched_priority = 99;
sched_setscheduler(getpid(), SCHED_RR, &sp);}
In every thread (simplified, return-codes are checked via assertions):
if(startAsRealtime){
sched_param param;
pthread_attr_setinheritsched(&attr, PTHREAD_EXPLICIT_SCHED);
pthread_attr_getschedparam(&attr, ¶m);
param.sched_priority = priority;
pthread_attr_setschedpolicy(&attr, SCHED_RR);
pthread_attr_setschedparam(&attr, ¶m);}
Thanks in advance
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果您使用
glibc
作为 C 库,您可以使用问题的答案是可以列出线程持有的互斥体,以找出持有超时互斥体的线程。这应该开始缩小范围——然后您可以检查该线程并找出它不放弃互斥锁的原因。If you're using
glibc
as your C library, you could use the answer to the question Is it possible to list mutexs which a thread holds to find out the thread that is holding the mutex which is timing out. That should start to narrow things down - you can then inspect that thread and find out why it's not giving up the mutex.您的实时线程之一可能会在循环中旋转(不产生),从而使其他线程挨饿并导致互斥体超时。
还可能存在竞争条件,仅当您切换到“实时模式”时才会显现出来。实时模式下事件的计时恰好会触发某种死锁。
如果您在某些地方获取了多层锁或递归锁定,那么这些地方应该是您首先怀疑的地方。
如果您确实不知道问题出在哪里,请尝试使用二分搜索方法来解决问题。递归地删除一半的功能,直到将其范围缩小到实际问题。您可能必须模拟一些暂时被删除的子系统。
您可以将这种二分搜索技术应用于互斥锁获取超时,以找出罪魁祸首。
One of your realtime threads might be spinning in a loop (not yielding), thus starving other threads and resulting in a mutex timeout.
There could also be a race condition that only manifests itself when you switch to "realtime mode". The timing of events in realtime mode happens to trigger some kind of deadlock.
If you have places where you acquire multiple levels of locks, or lock recursively, those should be the first places you suspect.
If you really have no clue where the problem is, try the binary search approach for bracketing the problem. Recursively cut out half of the functionality until you narrow it down to the actual problem. You might have to mock some subsystems that are temporarily cut out.
You can apply this binary search technique to your mutex acquisition timeouts to find which one is the culprit.