十亿英里的软件补丁

发布于 2024-09-06 00:07:54 字数 427 浏览 4 评论 0 原文

这里有人可以透露一下 NASA 如何设计他们的航天器架构以确保他们能够修补已部署代码中的错误吗?

我从未构建过任何“实时”类型系统,这是阅读本文后想到的一个问题:

http://pluto.jhuapl.edu/overview/piPerspective.php?page=piPerspective_05_21_2010

“我们要做的第一件大事之一 当我们下次唤醒航天器时做什么 一周将上传近20个小 错误修复和其他代码增强 我们的故障保护(或“自动驾驶仪 响应”)软件。”

Could someone here shed some light about how NASA goes about designing their spacecraft architecture to ensure that they are able to patch bugs in the deployed code?

I have never built any “real time” type systems and this is a question that has come to mind after reading this article:

http://pluto.jhuapl.edu/overview/piPerspective.php?page=piPerspective_05_21_2010

“One of the first major things we’ll
do when we wake the spacecraft up next
week will be uploading almost 20 minor
bug fixes and other code enhancements
to our fault protection (or “autopilot
response”) software.”

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

醉城メ夜风 2024-09-13 00:07:54

我一直是公共电话交换系统软件的开发人员,该软件对可靠性、可用性、生存能力和性能有相当严格的限制,无法满足航天器系统的需求。我没有从事过航天器工作(尽管我在 IBM 期间确实与许多前航天飞机程序员一起工作过),并且我不熟悉 VXworks,这是许多航天器(包括火星漫游者,其拥有惊人的运行记录)上使用的操作系统)。

可修补性的核心要求之一是系统应该从头开始设计以进行修补。这包括模块结构,以便可以添加新变量并替换方法,而不会中断当前操作。这通常意味着更改方法的旧代码和新代码都将驻留,并且修补操作只是更新类或模块的调度向量。

修补(和取消修补)软件必须集成到操作系统中。

当我在电话系统上工作时,我们通常在系统中使用修补和模块替换功能来加载和测试我们的新功能以及错误修复,早在这些更改提交构建之前。每个开发人员都需要将修补和替换模块作为日常工作的一部分。它在这些组件中建立了一定程度的信任,并确保定期执行修补和替换代码。

这些系统上的测试比您在任何其他项目中遇到的任何测试都要严格得多。部署系统的完整和部分模型将随时可用。也可能存在虚拟机环境,可以在其中运行和测试完整的负载。单元测试之上的所有级别的测试计划都将被编写并正式审查,就像正式的代码检查一样(这些也将是例行公事)。

容错系统设计,包括软件设计,至关重要。我不具体了解航天器系统,但像高可用性集群之类的东西可能是标准的,具有同步和不同步运行的附​​加功能,以及在故障转移期间在双方之间传输信息的能力。这种系统结构的另一个好处是,您可以拆分系统(如果需要),用新的软件负载重新加载非活动端,并在生产系统中对其进行测试,而无需连接到系统网络或总线。当您对新软件正常运行感到满意时,您可以简单地故障转移到它。

与打补丁一样,每个开发人员都应该知道如何进行故障转移,并且应该在开发和测试期间进行这些操作。此外,开发人员应该了解每个可能强制进行故障转移的软件更新问题,并且应该知道如何编写补丁和模块替换,以尽可能避免所需的故障转移。

一般来说,这些系统是针对这些环境从头开始设计的(硬件、操作系统、编译器和可能的编程语言)。我不认为 Windows、Mac OSX、Linux 或任何 UNIX 变体足够强大。其中一部分是实时要求,但可靠性和生存能力的整个问题也同样重要。

更新:作为另一个兴趣点,这里有一个由一位火星探测器司机撰写的博客。这将使您了解维护运行中的航天器的日常生活。整洁的东西!

I've been a developer on public telephone switching system software, which has pretty severe constraints on reliability, availability, survivability, and performance that approach what spacecraft systems need. I haven't worked on spacecraft (although I did work with many former shuttle programmers while at IBM), and I'm not familiar with VXworks, the operating system used on many spacecraft (including the Mars rovers, which have a phenomenal operating record).

One of the core requirements for patchability is that a system should be designed from the ground up for patching. This includes module structure, so that new variables can be added, and methods replaced, without disrupting current operations. This often means that both old and new code for a changed method will be resident, and the patching operation simply updates the dispatching vector for the class or module.

It is just about mandatory that the patching (and un-patching) software is integrated into the operating system.

When I worked on telephone systems, we generally used patching and module-replacement functions in the system to load and test our new features as well as bug fixes, long before these changes were submitted for builds. Every developer needs to be comfortable with patching and replacing modules as part of their daly work. It builds a level of trust in these components, and makes sure that the patching and replacement code is exercised routinely.

Testing is far more stringent on these systems than anything you've ever encountered on any other project. Complete and partial mock-ups of the deployment system will be readily available. There will likely be virtual machine environments as well, where the complete load can be run and tested. Test plans at all levels above unit test will be written and formally reviewed, just like formal code inspections (and those will be routine as well).

Fault tolerant system design, including software design, is essential. I don't know about spacecraft systems specifically, but something like high-availability clusters is probably standard, with the added capability to run both synchronized and unsynchronized, and with the ability to transfer information between sides during a failover. An added benefit of this system structure is that you can split the system (if necessary), reload the inactive side with a new software load, and test it in the production system without being connected to the system network or bus. When you're satisfied that the new software is running properly, you can simply failover to it.

As with patching, every developer should know how to do failovers, and should do them both during development and testing. In addition, developers should know every software update issue that can force a failover, and should know how to write patches and module replacement that avoid required failovers whenever possible.

In general, these systems are designed from the ground up (hardware, operating system, compilers, and possibly programming language) for these environments. I would not consider Windows, Mac OSX, Linux, or any unix variant, to be sufficiently robust. Part of that is realtime requirements, but the whole issue of reliability and survivability is just as critical.

UPDATE: As another point of interest, here's a blog by one of the Mars rover drivers. This will give you a perspective on the daily life of maintaining an operating spacecraft. Neat stuff!

于我来说 2024-09-13 00:07:54

我也从未构建过实时系统,但在那些系统中,我怀疑他们的系统不会有内存保护机制。他们不需要它,因为他们自己编写了所有自己的软件。如果没有内存保护,一个程序写入另一个程序的内存位置将是微不足道的,这可以用来热修补正在运行的程序(编写自修改代码是过去流行的技术,没有内存保护用于自修改代码的相同技术可用于修改另一个程序的代码)。

Linux 已经能够使用 Ksplice 在一段时间内无需重新启动即可进行较小的内核修补。这对于任何停机都可能造成灾难性的情况是必要的。我自己从未使用过它,但我认为他们使用的技术基本上是这样的:

Ksplice可以给Linux打补丁
内核,无需重新启动计算机。
Ksplice 将统一差异作为输入
和原始内核源代码,
它更新了正在运行的内核
记忆。使用 Ksplice 不需要
系统上线前的准备工作
最初启动(正在运行的内核
不需要特别
例如,已编译)。为了
生成更新,Ksplice 必须
确定内核中的代码
已被源代码更改
修补。 Ksplice 执行此分析
在 ELF 目标代码层,而不是
比在 C 源代码层。

要应用补丁,请先 Ksplice
冻结计算机的执行,因此
是唯一正在运行的程序。这
系统验证没有处理器
正在执行中
将被修改的功能
修补。 Ksplice修改开头
改变功能,以便它们
相反,指向新的、更新的版本
这些功能,并修改数据
以及内存中需要的结构
被改变。最后,Ksplice 恢复
每个处理器都在原来的位置运行
关闭。

(来自维基百科)

I've never build real-time system either, but in those system, I suspect their system would not have memory protection mechanism. They do not need it since they wrote all their own software themselves. Without memory protection, it will be trivial for a program to write the memory location of another program and this can be used to hot-patch a running program (writing a self-modifying code was a popular technique in the past, without memory protection the same techniques used for self-modifying code can be used to modify another program's code).

Linux has been able to do minor kernel patching without rebooting for some time with Ksplice. This is necessary for use in situations where any downtime can be catastrophic. I've never used it myself, but I think the technique they uses is basically this:

Ksplice can apply patches to the Linux
kernel without rebooting the computer.
Ksplice takes as input a unified diff
and the original kernel source code,
and it updates the running kernel in
memory. Using Ksplice does not require
any preparation before the system is
originally booted (the running kernel
does not need to have been specially
compiled, for example). In order to
generate an update, Ksplice must
determine what code within the kernel
has been changed by the source code
patch. Ksplice performs this analysis
at the ELF object code layer, rather
than at the C source code layer.

To apply a patch, Ksplice first
freezes execution of a computer so it
is the only program running. The
system verifies that no processors
were in the middle of executing
functions that will be modified by the
patch. Ksplice modifies the beginning
of changed functions so that they
instead point to new, updated versions
of those functions, and modifies data
and structures in memory that need to
be changed. Finally, Ksplice resumes
each processor running where it left
off.

(from Wikipedia)

一身软味 2024-09-13 00:07:54

嗯,我确信他们有用于测试的模拟器和热补丁机制。看看下面的链接文章 - 其中对航天器设计有很好的概述。第 5 节讨论计算机器。

http://www.boulder.swri.edu/pkb/ssr/ ssr-fountain.pdf

值得注意的是:

  • 冗余处理器
  • 通过上行链路卡进行命令切换,不需要处理器帮助
  • 时间滞后规则

Well I'm sure they have simulators to test with and mechanisms for hot-patching. Take a look at the linked article below - there's a pretty good overview of the spacecraft design. Section 5 discusses the computation machinery.

http://www.boulder.swri.edu/pkb/ssr/ssr-fountain.pdf

Of note:

  • Redundant processors
  • Command switching by the uplink card that does not require processor help
  • Time-lagged rules
请叫√我孤独 2024-09-13 00:07:54

我没有在航天器上工作过,但我工作过的机器都具有稳定的空闲状态,可以短暂关闭机器以修补固件。适应“实时”更新的系统是那些被分解为交互组件的系统,您可以在其中关闭系统的一个部分足够长的时间来更新它,而其他组件可以继续正常运行,因为它们可以容忍临时停机所服务组件的。

实现此目的的一种方法是具有并行(冗余)功能,例如全部执行相同任务的并行机器,以便可以在服务下的机器周围路由流程。这种方法的好处是,您可以将其关闭更长时间以进行更重要的服务,例如定期硬件预防性维护。一旦您具备了这种能力,支持固件补丁的停机就相当容易了。

I haven't worked on spacecraft, but the machines I've worked on have all been built to have a stable idle state where it's possible to shut down the machine briefly to patch the firmware. The systems that have accommodated 'live' updates are those that were broken into interacting components, where you can bring down one segment of the system long enough to update it and the other components can continue operating as normal, as they can tolerate the temporary downtime of the serviced component.

One way you can do this is to have parallel (redundant) capabilities, such as parallel machines that all perform the same task, so that the process can be routed around the machine under service. The benefit of this approach is that you can bring it down for longer periods for more significant service, such as regular hardware preventative maintenance. Once you have this capability, supporting downtime for a firmware patch is fairly easy.

自由如风 2024-09-13 00:07:54

过去使用的方法之一是使用 LISP。

One of the approaches that's been used in the past is to use LISP.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文