嵌入式系统:重启前的最后一口气
当嵌入式系统出现严重错误时,我倾向于将错误写入闪存中的特殊日志文件,然后重新启动(如果内存不足,则没有太多选择)。
我意识到即使这样也可能会出错,所以我尝试将其最小化(通过在最终写入期间不分配任何内存,并提高写入进程的优先级)。
但这依赖于有人检索日志文件。现在我正在考虑在重新启动之前通过管间发送一条消息来报告错误。
当然,再考虑一下,最好在重新启动后发送该消息,但这确实让我思考......
如果我发现不可恢复的错误,我应该做什么,以及我该如何做在处于不稳定状态的系统中尽可能安全?
When things go badly awry in embedded systems I tend to write an error to a special log file in flash and then reboot (there's not much option if, say, you run out of memory).
I realize even that can go wrong, so I try to minimize it (by not allocating any memory during the final write, and boosting the write processes priority).
But that relies on someone retrieving the log file. Now I was considering sending a message over the intertubes to report the error before rebooting.
On second thoughts, of course, it would be better to send that message after reboot, but it did get me to thinking...
What sort of things ought I be doing if I discover an irrecoverable error, and how can I do them as safely as possible in a system which is in an unstable state?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
我认为最著名的正确异常处理的例子是导弹自毁。该异常是由软件算术溢出引起的。显然涉及大量跟踪/记录介质,因为根本原因是已知的。发现调试了。
因此,每个嵌入式设计都必须包括两个功能:记录媒体(例如日志文件)和优雅的停止(例如禁用所有计时器/中断、关闭所有端口并处于无限循环中或在出现导弹时自毁)。
I think the most well known example of proper exception handling is a missile self-destruction. The exception was caused by arithmetic overflow in software. There obviously was a lot of tracing/recording media involved because the root cause is known. It was discovered debugged.
So, every embedded design must include 2 features: recording media like your log file and graceful halt, like disabling all timers/interrupts, shutting all ports and sitting in infinite loop or in case of a missile - self-destruction.
在嵌入式系统中重新启动之前将消息写入闪存通常是一个坏主意。正如您所指出的,没有人会阅读该消息,如果问题不是暂时性的,那么您就会磨损闪光灯。
当系统处于不一致状态时,您几乎无能为力,最好的办法是尽快重新启动系统,以便可以从暂时性故障(计时、特殊外部事件等)中恢复。 。在某些系统中,我编写了一个陷阱处理程序,它使用一些保留的内存,以便它可以设置串行端口,然后发出堆栈转储和寄存器内容,而不需要额外的堆栈空间或破坏寄存器。
使用这样的转储进行简单的重新启动是合理的,因为如果问题是暂时的,重新启动将解决问题,并且您希望保持简单并让设备继续运行。如果问题不是暂时的,您无论如何都不会取得进展,有人可以过来并连接诊断设备。
关于故障和恢复的非常有趣的论文:为什么要做计算机停止运行,该怎么办?
Writing messages to flash before reboot in embedded systems is often a bad idea. As you point out, no one is going to read the message, and if the problem is not transient you wear out the flash.
When the system is in an inconsistent state, there is almost nothing you can do reliably and the best thing to do is to restart the system as quickly as possible so that you can recover from transient failures (timing, special external events, etc.). In some systems I have written a trap handler that uses some reserved memory so that it can, set up the serial port and then emit a stack dump and register contents without requiring extra stack space or clobbering registers.
A simple restart with a dump like that is reasonable because if the problem is transient the restart will resolve the problem and you want to keep it simple and let the device continue. If the problem is not transient you are not going to make forward progress anyway and someone can come along and connect a diagnostic device.
Very interesting paper on failures and recovery: WHY DO COMPUTERS STOP AND WHAT CAN BE DONE ABOUT IT?
对于一个非常简单的系统,您有可以摆动的引脚吗?例如,当您启动时将其配置为具有高输出,如果情况进展顺利(即看门狗重置挂起),则将其设置为低电平。
For a very simple system, do you have a pin you can wiggle? For example, when you start up configure it to have high output, if things go way south (i.e. watchdog reset pending) then set it to low.
您是否考虑过使用垃圾收集器?
我不是在开玩笑。
如果您在嵌入式系统中在运行时进行动态分配,
为什么不预留一个标记缓冲区,当排泄物撞击旋转的吹风机时进行标记和清扫。
您可能已经获得了 malloc(或其他)实现的源代码,对吧?
如果您没有嵌入式系统的库源,请忘记我曾经建议过它,但请告诉我们其他人它位于什么设备中,这样我们就可以避免使用它。哎呀(没有库源如何调试?)。
如果你的系统已经死了......谁在乎需要多长时间。显然,此刻运行并不重要;
如果是的话,你无论如何也不能冒这样的风险“死”吗?
Have you ever considered using a garbage collector ?
And I'm not joking.
If you do dynamic allocation at runtime in embedded systems,
why not reserve a mark buffer and mark and sweep when the excrement hits the rotating air blower.
You've probably got the malloc (or whatever) implementation's source, right ?
If you don't have library sources for your embedded system forget I ever suggested it, but tell the rest of us what equipment it is in so we can avoid ever using it. Yikes (how do you debug without library sources?).
If you're system is already dead.... who cares how long it takes. It obviously isn't critical that it be running this instant;
if it was you couldn't risk "dieing" like this anyway ?
一种策略是使用开机/重启期间未初始化的 RAM 部分。它可以用于存储重新启动后仍然存在的数据,然后当您的应用程序重新启动时,在代码的早期,它可以检查该内存并查看它是否包含任何有用的数据。如果是,则将其写入日志,或通过通信通道发送。
如何保留未初始化的 RAM 部分取决于平台,并且取决于您是否正在运行管理 RAM 初始化的成熟操作系统 (Linux)。如果您使用的是小型系统,其中 RAM 初始化是由 C 启动代码完成的,那么您的编译器可能有办法将数据(文件范围变量)放在不同的部分中(除了通常的例如
.bss
),它不是由 C 启动代码初始化的。如果数据未初始化,那么它可能会在上电时包含随机数据。要确定它是否包含随机数据或有效数据,请使用哈希(例如 CRC-32)来确定其有效性。如果您的处理器有办法告诉您是否处于重新启动或加电重置状态,那么您还应该使用它来确定数据在加电后无效。
One strategy is to use a section of RAM that is not initialised by during power-on/reboot. That can be used to store data that survives a reboot, and then when your app restarts, early on in the code it can check that memory and see if it contains any useful data. If it does, then write it to a log, or send it over a comms channel.
How to reserve a section of RAM that is non-initialised is platform-dependent, and depends if you're running a full-blown OS (Linux) that manages RAM initialisation or not. If you're on a small system where RAM initialisation is done by the C start-up code, then your compiler probably has a way to put data (a file-scope variable) in a different section (besides the usual e.g.
.bss
) which is not initialised by the C start-up code.If the data is not initialised, then it will probably contain random data at power-up. To determine whether it contains random data or valid data, use a hash, e.g. CRC-32, to determine its validity. If your processor has a way to tell you if you're in a reboot vs a power-up reset, then you should also use that to decide that the data is invalid after a power-up.
对此没有单一的答案。我将从看门狗计时器开始。如果出现严重问题,这会重新启动系统。
其他需要考虑的事情 - 日志文件中不的内容也很重要。如果您记录了各种任务/操作的例行更新,那么您可以从缺少的内容中学习。
最后,如果情况变坏而您仍在运行:进入关键部分,关闭尽可能多的操作系统,关闭外围设备,记录尽可能多的状态信息,然后重新启动!
There is no single answer to this. I would start with a Watchdog timer. This reboots the system if things go terribly awry.
Something else to consider - what is not in a log file is also important. If you have routine updates from various tasks/actions logged then you can learn from what is missing.
Finally, in the case that things go bad and you are still running: enter a critical section, turn off as much of the OS a possible, shut down peripherals, log as much state info as possible, then reboot!
您要确保做的一件事是不要损坏可能合法存在于闪存中的数据,因此,如果您尝试在崩溃情况下写入信息,您需要小心行事,并了解系统可能是一个状态非常糟糕,所以你所做的任何事情都需要以不会让事情变得更糟的方式完成。
一般来说,当我检测到崩溃状态时,我会尝试从串行端口吐出信息。可从崩溃状态访问的 UART 驱动程序通常非常简单 - 它只需要是一个简单的轮询驱动程序,当忙位清零时将字符写入发送数据寄存器 - 崩溃处理程序通常不需要很好地配合多任务处理,所以轮询就可以了。而且它一般不需要担心传入的数据;或者至少不需要担心以轮询无法处理的方式传入的数据。事实上,崩溃处理程序通常不能期望多任务处理和中断处理能够正常工作,因为系统已经搞砸了。
我尝试让它写入寄存器文件、堆栈的一部分以及任何可能可用且有趣的重要操作系统数据结构(当前任务控制块或其他内容)。看门狗定时器通常负责在此状态下重置系统,因此崩溃处理程序可能没有机会写入所有内容,因此首先转储最重要的内容(不要让崩溃处理程序踢掉看门狗 - 您不希望有一些错误错误地阻止看门狗重置系统)。
当然,这在开发设置中最有用,因为当设备发布时,它可能没有任何连接到串行端口的东西。如果您希望能够在发布后捕获此类故障转储,那么它们需要写入适当的位置(例如闪存的保留部分 - 只要确保它不是正常数据/文件系统区域的一部分,除非您“确保它不会损坏该数据)。当然,您需要在启动时检查该区域,以便可以检测到它并将其发送到有用的地方,否则就没有意义,除非您可以在事后将单元带回并可以将它们连接到可以查看的调试设置数据。
The one thing you want to make sure you do is to not corrupt data that might legitimately be in flash, so if you try to write information in a crash situation you need to do so carefully and with the knowledge that the system might be an a very bad state so anything you do needs to be done in a way that doesn't make things worse.
Generally, when I detect a crash state I try to spit information out a serial port. A UART driver that's accessible from a crashed state is usually pretty simple - it just needs to be a simple polling driver that writes characters to the transmit data register when the busy bit is clear - a crash handler generally doesn't need to play nice with multitasking, so polling is fine. And it generally doesn't need to worry about incoming data; or at least not needing to worry about incoming data in a fashion that can't be handled by polling. In fact, a crash handler generally cannot expect that multitasking and interrupt handling will be working since the system is screwed up.
I try to have it write the register file, a portion of the stack and any important OS data structures (the current task control block or something) that might be available and interesting. A watchdog timer usually is responsible for resetting the system in this state, so the crash handler might not have the opportunity to write everything, so dump the most important stuff first (do not have the crash handler kick the watchdog - you don't want to have some bug mistakenly prevent the watchdog from resetting the system).
Of course this is most useful in a development setup, since when the device is released it might not have anything attached to the serial port. If you want to be able to capture these kinds of crash dumps after release, then they need to get written somewhere appropriate (like maybe a reserved section of flash - just make sure it's not part of the normal data/file system area unless you're sure it can't corrupt that data). Of course you'd need to have something examine that area at boot so it can be detected and sent somewhere useful or there's no point, unless you might get units back post-mortem and can hook them up to a debugging setup that can look at the data.