如何判断哪个任务死了?
我有一个嵌入式系统,有多个(> 20)任务以不同的优先级运行。 我还有看门狗任务,它运行以检查所有其他任务是否没有卡住。 我的看门狗正在工作,因为每隔一段时间,它就会重新启动系统,因为任务没有签入。
我如何确定哪个任务死亡?
我不能仅仅责怪最旧的任务来踢看门狗,因为它可能已被未产生的更高优先级任务所推迟。
有什么建议么?
I have an embedded system that has multiple (>20) tasks running at different priorities. I also have watchdog task that runs to check that all the other tasks are not stuck. My watchdog is working because every once in a blue moon, it will reboot the system because a task did not check in.
How do I determine which task died?
I can't just blame the oldest task to kick the watchdog because it might have been held off by a higher priority task that is not yielding.
Any suggestions?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
每个任务的看门狗要求较高优先级的任务让出足够的时间,以便所有任务都可以踢看门狗。 要确定哪项任务出了问题,您必须找到导致其他任务挨饿的任务。 您需要测量看门狗检查之间的任务执行时间,以找到真正的罪魁祸首。
A per-task watchdog requires that the higher priority tasks yield for an adequate time so that all may kick the watchdog. To determine which task is at fault, you'll have to find the one that's starving the others. You'll need to measure task execution times between watchdog checks to locate the actual culprit.
这是先发制人吗? 我这么认为是因为,否则,如果其他任务之一被卡住,看门狗任务就不会运行。
您没有提及操作系统,但是,如果看门狗任务可以检查单个任务是否尚未签入,则每个任务和看门狗之间必须有单独的通信通道。
您可能需要修改看门狗以某种方式转储尚未签入的任务编号,并转储任务控制块和内存,以便您可以进行事后分析。
根据操作系统的不同,这可能很容易也可能很困难。
Is this pre-emptive? I gather so since otherwise a watchdog task would not run if one of the others had gotten stuck.
You make no mention of the OS but, if a watchdog task can check if a single task has not checked in, there must be separate channels of communication between each task and the watchdog.
You'll probably have to modify the watchdog to somehow dump the task number of the one that hasn't checked in and dump the task control blocks and memory so you can do a post-mortem.
Depending on the OS, this could be easy or hard.
即使我最近几周也在研究看门狗重置问题。 但幸运的是,在 ramdump 文件(在 ARM 开发环境中)中,它有一个中断处理程序跟踪缓冲区,在每个中断处都包含 PC 和 SLR。 因此,从跟踪缓冲区中,我可以准确地找出 WD 重置之前正在运行的代码部分。
我认为如果你有在每次中断时存储 PC、SLR 的相同机制,那么你可以准确地找出罪魁祸首的任务。
Even I was working last few weeks on Watchdog reset problem. But fortunately for me in the ramdump files (in ARM development environment), which has one Interrupt handler trace buffer, containing PC and SLR at each of the interrupts. Thus from the trace buffer I could exactly find out which part of code was running before WD reset.
I think if you have same kind of mechanism of storing PC, SLR at each interrupt then you can precisely find out culprit task.
根据您的系统和操作系统,可能有不同的方法。 我使用的一种非常低级的方法是在每个任务运行时使 LED 闪烁。 您可能需要在 LED 上放置一个示波器才能看到非常快速的任务切换。
Depending on your system and OS, there may be different approaches. One very low level approach I have used is to blink an LED on when each of the tasks is running. You may need to put a scope on the LEDs to see very fast task switching.
对于中断驱动的看门狗,您只需让任务切换器在每次更改时更新当前正在运行的任务编号,从而允许您识别哪个任务没有产生。
不过,你建议你自己把看门狗写成一个任务,这样在重启之前,看门狗肯定能识别出饥饿的任务吗? 您可以将其存储在内存中,并在热重启后持续存在,或者通过调试接口发送它。 问题在于,饥饿任务可能不是有问题的任务:您可能想知道最后几次任务切换(和时间),以便找出原因。
For an interrupt-driven watchdog, you'd just make the task switcher update the currently running task number each time it is changed, allowing you to identify which one didn't yield.
However, you suggest you wrote the watchdog as a task yourself, so before rebooting, surely the watchdog can identify the starved task? You can store this in memory that persists beyond a warm reboot, or send it over a debug interface. The problem with this is that the starved task is probably not the problematic one: you'll probably want to know the last few task switches (and times) in order to identify the cause.
一个简单的餐巾纸背面方法是这样的:
A simplistic, back of the napkin approach would be something like this:
您的系统究竟运行得如何? 我总是结合使用软件和硬件看门狗。 让我解释一下...
我的示例假设您正在使用抢占式实时内核,并且您的 cpu/微控制器中有看门狗支持。 如果在一定时间内没有被踢出,该看门狗将执行重置。 您需要检查两件事:
1)周期性系统计时器(“RTOS 时钟”)正在运行(如果没有,“睡眠”等功能将不再工作,您的系统将无法使用)。
2) 所有线程都可以在合理的时间内运行。
我的 RTOS (www.lieron.be/micror2k) 提供了在 RTOS 时钟中断处理程序中运行代码的可能性。 这是刷新硬件看门狗的唯一位置,因此您可以确定时钟一直在运行(如果没有,看门狗将重置您的系统)。
在空闲线程中(始终以最低优先级运行),刷新“软件看门狗”。 这只是将变量设置为某个值(例如 1000)。 在 RTOS 时钟中断(启动硬件看门狗)中,您可以递减并检查该值。 如果它达到 0,则意味着空闲线程尚未运行 1000 个时钟周期,并且您重新启动系统(可以通过在中断处理程序内无限循环以让硬件看门狗重新启动来完成)。
现在回答你原来的问题。 我假设系统时钟保持运行,因此软件看门狗会重置系统。 在 RTOS 时钟中断处理程序中,您可以进行一些“统计收集”,以防出现软件看门狗情况。 您无需重置系统,而是可以查看每个时钟周期(问题发生后)正在运行的线程,并尝试找出发生了什么。 这并不理想,但会有帮助。
另一种选择是按不同的优先级添加多个软件看门狗。 让空闲线程将变量 A 设置为 1000,并将(专用)中等优先级线程设置为变量 B。在 RTOS 时钟中断处理程序中,检查这两个变量。 通过此信息,您可以知道循环线程的优先级是否高于“中”或低于“中”。 如果您愿意,您可以添加第三个或第四个或您喜欢的软件看门狗数量。 最坏的情况是,为所使用的每个优先级添加一个软件看门狗(但这会花费您许多额外的线程)。
How is your system working exactly? I always use a combination of software and hardware watchdogs. Let me explain...
My example assumes you're working with a preemptive real time kernel and you have watchdog support in your cpu/microcontroller. This watchdog will perform a reset if it was not kicked withing a certain period of time. You want to check two things:
1) The periodic system timer ("RTOS clock") is running (if not, functions like "sleep" would no longer work and your system is unusable).
2) All threads can run withing a reasonable period of time.
My RTOS (www.lieron.be/micror2k) provides the possibility to run code in the RTOS clock interrupt handler. This is the only place where you refresh the hardware watchdog, so you're sure the clock is running all the time (if not the watchdog will reset your system).
In the idle thread (always running at lowest priority), a "software watchdog" is refreshed. This is simply setting a variable to a certain value (e.g. 1000). In the RTOS clock interrupt (where you kick the hardware watchdog), you decrement and check this value. If it reaches 0, it means that the idle thread has not run for 1000 clock ticks and you reboot the system (can be done by looping indefinitely inside the interrupt handler to let the hardware watchdog reboot).
Now for your original question. I assume the system clock keeps running, so it's the software watchdog that resets the system. In the RTOS clock interrupt handler, you can do some "statistics gathering" in case the software watchdog situation occurs. Instead of resetting the system, you can see what thread is running at each clock tick (after the problem occurs) and try to find out what's going on. It's not ideal, but it will help.
Another option is to add several software watchdogs at different priorities. Have the idle thread set VariableA to 1000 and have a (dedicated) medium priority thread set Variable B. In the RTOS clock interrupt handler, you check both variables. With this information you know if the looping thread has a priority higher then "medium" or lower then "medium". If you wish you can add a 3rd or 4th or how many software watchdogs you like. Worst case, add a software watchdog for each priority that's used (will cost you as many extra threads though).