十亿英里的软件补丁

发布于 2024-09-06 00:07:54 字数 427 浏览 4 评论 0 原文

这里有人可以透露一下 NASA 如何设计他们的航天器架构以确保他们能够修补已部署代码中的错误吗？

我从未构建过任何“实时”类型系统，这是阅读本文后想到的一个问题：

http://pluto.jhuapl.edu/overview/piPerspective.php?page=piPerspective_05_21_2010

“我们要做的第一件大事之一当我们下次唤醒航天器时做什么一周将上传近20个小错误修复和其他代码增强我们的故障保护（或“自动驾驶仪响应”）软件。”

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

醉城メ夜风 2024-09-13 00:07:54

我一直是公共电话交换系统软件的开发人员，该软件对可靠性、可用性、生存能力和性能有相当严格的限制，无法满足航天器系统的需求。我没有从事过航天器工作（尽管我在 IBM 期间确实与许多前航天飞机程序员一起工作过），并且我不熟悉 VXworks，这是许多航天器（包括火星漫游者，其拥有惊人的运行记录）上使用的操作系统）。

可修补性的核心要求之一是系统应该从头开始设计以进行修补。这包括模块结构，以便可以添加新变量并替换方法，而不会中断当前操作。这通常意味着更改方法的旧代码和新代码都将驻留，并且修补操作只是更新类或模块的调度向量。

修补（和取消修补）软件必须集成到操作系统中。

当我在电话系统上工作时，我们通常在系统中使用修补和模块替换功能来加载和测试我们的新功能以及错误修复，早在这些更改提交构建之前。每个开发人员都需要将修补和替换模块作为日常工作的一部分。它在这些组件中建立了一定程度的信任，并确保定期执行修补和替换代码。

这些系统上的测试比您在任何其他项目中遇到的任何测试都要严格得多。部署系统的完整和部分模型将随时可用。也可能存在虚拟机环境，可以在其中运行和测试完整的负载。单元测试之上的所有级别的测试计划都将被编写并正式审查，就像正式的代码检查一样（这些也将是例行公事）。

容错系统设计，包括软件设计，至关重要。我不具体了解航天器系统，但像高可用性集群之类的东西可能是标准的，具有同步和不同步运行的附加功能，以及在故障转移期间在双方之间传输信息的能力。这种系统结构的另一个好处是，您可以拆分系统（如果需要），用新的软件负载重新加载非活动端，并在生产系统中对其进行测试，而无需连接到系统网络或总线。当您对新软件正常运行感到满意时，您可以简单地故障转移到它。

与打补丁一样，每个开发人员都应该知道如何进行故障转移，并且应该在开发和测试期间进行这些操作。此外，开发人员应该了解每个可能强制进行故障转移的软件更新问题，并且应该知道如何编写补丁和模块替换，以尽可能避免所需的故障转移。

一般来说，这些系统是针对这些环境从头开始设计的（硬件、操作系统、编译器和可能的编程语言）。我不认为 Windows、Mac OSX、Linux 或任何 UNIX 变体足够强大。其中一部分是实时要求，但可靠性和生存能力的整个问题也同样重要。

更新：作为另一个兴趣点，这里有一个由一位火星探测器司机撰写的博客。这将使您了解维护运行中的航天器的日常生活。整洁的东西！

I've been a developer on public telephone switching system software, which has pretty severe constraints on reliability, availability, survivability, and performance that approach what spacecraft systems need. I haven't worked on spacecraft (although I did work with many former shuttle programmers while at IBM), and I'm not familiar with VXworks, the operating system used on many spacecraft (including the Mars rovers, which have a phenomenal operating record).

One of the core requirements for patchability is that a system should be designed from the ground up for patching. This includes module structure, so that new variables can be added, and methods replaced, without disrupting current operations. This often means that both old and new code for a changed method will be resident, and the patching operation simply updates the dispatching vector for the class or module.

It is just about mandatory that the patching (and un-patching) software is integrated into the operating system.

When I worked on telephone systems, we generally used patching and module-replacement functions in the system to load and test our new features as well as bug fixes, long before these changes were submitted for builds. Every developer needs to be comfortable with patching and replacing modules as part of their daly work. It builds a level of trust in these components, and makes sure that the patching and replacement code is exercised routinely.

Testing is far more stringent on these systems than anything you've ever encountered on any other project. Complete and partial mock-ups of the deployment system will be readily available. There will likely be virtual machine environments as well, where the complete load can be run and tested. Test plans at all levels above unit test will be written and formally reviewed, just like formal code inspections (and those will be routine as well).

Fault tolerant system design, including software design, is essential. I don't know about spacecraft systems specifically, but something like high-availability clusters is probably standard, with the added capability to run both synchronized and unsynchronized, and with the ability to transfer information between sides during a failover. An added benefit of this system structure is that you can split the system (if necessary), reload the inactive side with a new software load, and test it in the production system without being connected to the system network or bus. When you're satisfied that the new software is running properly, you can simply failover to it.

As with patching, every developer should know how to do failovers, and should do them both during development and testing. In addition, developers should know every software update issue that can force a failover, and should know how to write patches and module replacement that avoid required failovers whenever possible.

In general, these systems are designed from the ground up (hardware, operating system, compilers, and possibly programming language) for these environments. I would not consider Windows, Mac OSX, Linux, or any unix variant, to be sufficiently robust. Part of that is realtime requirements, but the whole issue of reliability and survivability is just as critical.

UPDATE: As another point of interest, here's a blog by one of the Mars rover drivers. This will give you a perspective on the daily life of maintaining an operating spacecraft. Neat stuff!

回复收藏 0 原文

于我来说 2024-09-13 00:07:54

我也从未构建过实时系统，但在那些系统中，我怀疑他们的系统不会有内存保护机制。他们不需要它，因为他们自己编写了所有自己的软件。如果没有内存保护，一个程序写入另一个程序的内存位置将是微不足道的，这可以用来热修补正在运行的程序（编写自修改代码是过去流行的技术，没有内存保护用于自修改代码的相同技术可用于修改另一个程序的代码）。

Linux 已经能够使用 Ksplice 在一段时间内无需重新启动即可进行较小的内核修补。这对于任何停机都可能造成灾难性的情况是必要的。我自己从未使用过它，但我认为他们使用的技术基本上是这样的：

Ksplice可以给Linux打补丁
内核，无需重新启动计算机。
Ksplice 将统一差异作为输入
和原始内核源代码，
它更新了正在运行的内核
记忆。使用 Ksplice 不需要
系统上线前的准备工作
最初启动（正在运行的内核
不需要特别
例如，已编译）。为了
生成更新，Ksplice 必须
确定内核中的代码
已被源代码更改
修补。 Ksplice 执行此分析
在 ELF 目标代码层，而不是
比在 C 源代码层。

要应用补丁，请先 Ksplice
冻结计算机的执行，因此
是唯一正在运行的程序。这
系统验证没有处理器
正在执行中
将被修改的功能
修补。 Ksplice修改开头
改变功能，以便它们
相反，指向新的、更新的版本
这些功能，并修改数据
以及内存中需要的结构
被改变。最后，Ksplice 恢复
每个处理器都在原来的位置运行
关闭。

（来自维基百科）