当前位置：文江博客话题详情

programming-languages language-design

是否存在防灾语言？

发布于 2024-08-04 12:44:32 字数 1422 浏览 9 评论 0 原文

在创建必须具有高可靠性的系统服务时，我经常最终编写很多“故障安全”机制，以防出现以下情况：通信中断（例如与数据库的通信），如果电源中断会发生什么丢失并且服务重新启动...如何捡起碎片并以正确的方式继续（并记住，捡起碎片时电源可能会再次熄灭...）等等

我可以想象并不太复杂系统，一种能够满足这一需求的语言将非常实用。因此，无论电源是否被切断，这种语言都会记住它在任何给定时刻的状态，并从中断的地方继续。

这还存在吗？如果有的话，我在哪里可以找到它？如果不是的话，为什么这个不能实现呢？在我看来，这对于关键系统来说非常方便。

ps 如果数据库连接丢失，则表明出现问题，需要手动干预。连接恢复后，它将从中断处继续。

编辑：由于讨论似乎已经结束，让我添加几点（在我可以为问题添加赏金之前等待）

Erlang 响应现在似乎是最高评价的。我了解 Erlang，并且读过 Armstrong（主要创始人）的实用主义书籍。这一切都非常好（尽管函数式语言让我对所有的递归感到头晕），但是“容错”位不会自动出现。远非如此。 Erlang 提供了许多监督者和其他方法来监督进程，并在必要时重新启动它。然而，要正确地制作一些适用于这些结构的东西，您需要成为 erlang 大师，并且需要使您的软件适合所有这些框架。另外，如果断电，程序员也必须收拾残局，并在下次程序重新启动时尝试恢复。

我正在寻找的东西要简单得多：

想象一种语言（例如像 PHP 一样简单），您可以在其中使用可以做诸如数据库查询、对其进行操作、执行文件操作、执行文件夹操作等操作。

但它的主要功能应该是：如果断电，并且设备重新启动，它会从中断的位置开始（所以它不仅记住它在哪里，它也会记住变量状态）。此外，如果它在文件复制过程中停止，它也会正确恢复。最后但并非最不重要的一点

是，如果数据库连接断开并且无法恢复，语言就会停止，并发出信号（可能是系统日志）进行人工干预，然后从中断的地方继续。

像这样的语言将使许多服务编程变得更加容易。

编辑：似乎（从所有评论和答案来看）这样的系统并不存在。在可预见的将来可能不会，因为它（几乎？）不可能正确。

太糟糕了......再说一次，我并不是在寻找这种语言（或框架）来让我登上月球，或者用它来监测某人的心率。但是对于小型定期服务/任务，它们最终总是有大量处理边界情况的代码（中间某个地方断电，连接断开并且没有恢复），...在这里暂停，...解决问题，.. ..然后从上次停下的地方继续下去，方法会很有效。

（或者像一位评论者指出的那样采用检查点方法（就像在视频游戏中一样）。设置一个检查点......如果程序死机，下次从这里重新启动。）

奖励：在最后一刻，当每个人都得出无法完成的结论时，Stephen C 带来了 napier88，它似乎具有我正在寻找的属性。虽然它是一种实验性语言，但它确实证明了它是可以做到的，并且是值得更多研究的东西。

我将考虑创建自己的框架（可能带有持久状态和快照）以添加我在 .Net 或其他虚拟机中寻找的功能。

每个人都感谢您的投入和深刻见解。

原文

When creating system services which must have a high reliability, I often end up writing the a lot of 'failsafe' mechanisms in case of things like: communications which are gone (for instance communication with the DB), what would happen if the power is lost and the service restarts.... how to pick up the pieces and continue in a correct way (and remembering that while picking up the pieces the power could go out again...), etc etc

I can imagine for not too complex systems, a language which would cater for this would be very practical. So a language which would remember it's state at any given moment, no matter if the power gets cut off, and continues where it left off.

Does this exist yet? If so, where can I find it? If not, why can't this be realized? It would seem to me very handy for critical systems.

p.s. In case the DB connection is lost, it would signal that a problem arose, and manual intervention is needed. The moment he connection is restored, it would continue where it left off.

EDIT:
Since the discussion seems to have died off let me add a few points(while waiting before I can add a bounty to the question)

The Erlang response seems to be top rated right now. I'm aware of Erlang and have read the pragmatic book by Armstrong (the principal creator). It's all very nice (although functional languages make my head spin with all the recursion), but the 'fault tolerant' bit doesn't come automatically. Far from it. Erlang offers a lot of supervisors en other methodologies to supervise a process, and restart it if necessary. However, to properly make something which works with these structures, you need to be quite the erlang guru, and need to make your software fit all these frameworks. Also, if the power drops, the programmer too has to pick up the pieces and try to recover the next time the program restarts

What I'm searching is something far simpler:

Imagine a language (as simple as PHP for instance), where you can do things like do DB queries, act on it, perform file manipulations, perform folder manipulations, etc.

It's main feature however should be: If the power dies, and the thing restarts it takes of where it left off (So it not only remembers where it was, it will remember the variable states as well). Also, if it stopped in the middle of a filecopy, it will also properly resume. etc etc.

Last but not least, if the DB connection drops and can't be restored, the language just halts, and signals (syslog perhaps) for human intervention, and then carries on where it left off.

A language like this would make a lot of services programming a lot easier.

EDIT:
It seems (judging by all the comments and answers) that such a system doesn't exist. And probably will not in the near foreseeable future due to it being (near?) impossible to get right.

Too bad.... again I'm not looking for this language (or framework) to get me to the moon, or use it to monitor someones heartrate. But for small periodic services/tasks which always end up having loads of code handling bordercases (powerfailure somewhere in the middle, connections dropping and not coming back up),...where a pause here,...fix the issues,....and continue where you left off approach would work well.

(or a checkpoint approach as one of the commenters pointed out (like in a videogame). Set a checkpoint.... and if the program dies, restart here the next time.)

Bounty awarded:
At the last possible minute when everyone was coming to the conclusion it can't be done, Stephen C comes with napier88 which seems to have the attributes I was looking for.
Although it is an experimental language, it does prove it can be done and it is a something which is worth investigating more.

I'll be looking at creating my own framework (with persistent state and snapshots perhaps) to add the features I'm looking for in .Net or another VM.

Everyone thanks for the input and the great insights.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦明 2024-08-11 12:44:33

如果我要解决你的问题，我会编写一个守护进程（可能用 C 语言）来完成事务中的所有数据库交互，这样如果它被中断，你就不会插入任何错误的数据。然后让系统在启动时启动这个守护进程。

显然，用 C 语言开发 Web 内容比用脚本语言开发要慢得多，但它会执行得更好并且更稳定（当然，如果您编写好的代码:)。

实际上，我会用 Ruby（或 PHP 或其他）编写它，并让诸如延迟作业（或 cron 或任何调度程序）之类的东西经常运行它，因为我不需要每个时钟周期更新的东西。

希望这是有道理的。

回复收藏 0 原文

策马西风 2024-08-11 12:44:33

在我看来，故障恢复的概念在大多数情况下是业务问题，而不是硬件或语言问题。

举个例子：您有一个 UI 层和一个子系统。
该子系统不是很可靠，但 UI 层的客户端应该认为它很可靠。

现在，想象一下你的子系统不知何故崩溃了，你真的认为你想象的语言可以为你思考如何处理依赖于这个子系统的UI层吗？

您的用户应该明确意识到子系统不可靠，如果您使用消息传递来提供高可靠性，则客户端必须知道（如果他不知道，UI 可能会冻结等待响应，该响应最终可能会在 2 周后到来））。如果他意识到这一点，这意味着任何隐藏它的抽象最终都会泄漏。

我所说的客户是指最终用户。用户界面应该反映这种不可靠性而不是隐藏它，在这种情况下计算机无法为你思考。

回复收藏 0 原文

迷乱花海 2024-08-11 12:44:33

“因此，无论电源是否被切断，语言都会记住它在任何给定时刻的状态，并从中断处继续。”

“从中断处继续” 通常不是正确的恢复策略。世界上没有任何语言或环境会尝试猜测如何自动从特定故障中恢复。它能做的最好的事情就是为您提供工具，以不干扰您的业务逻辑的方式编写您自己的恢复策略，例如

异常处理（快速失败并仍然确保状态的一致性）
事务（回滚未完成的更改） )
工作流程（定义自动调用的恢复例程）
日志记录（用于跟踪故障原因）
AOP/依赖注入（以避免必须手动插入代码来完成上述所有操作）

这些是非常通用的工具，可在很多语言和环境。

回复收藏 0 原文

画骨成沙 2024-08-11 12:44:32

Erlang 设计用于电信系统，其中高可靠性是基础。我认为他们有标准的方法来构建一套通信流程，在这些流程中可以优雅地处理故障。

ERLANG 是一种并发函数式语言，非常适合分布式、高并发和容错的软件。 Erlang 的一个重要部分是它对故障恢复的支持。通过将 ERLANG 应用程序的进程组织成树结构来提供容错能力。在这些结构中，父进程监视子进程的故障并负责子进程的重新启动。

回复收藏 0 原文

风轻花落早 2024-08-11 12:44:32

软件事务内存（STM）与非易失性RAM相结合可能会满足OP修改后的问题。

STM 是一种用于实现“事务”的技术，例如，作为原子操作有效完成或根本不完成的一组动作。通常，STM 的目的是使高度并行的程序能够以比传统的锁定资源编程更容易理解的方式通过共享资源进行交互，并且由于具有高度乐观的无锁风格，因此可以说具有较低的开销。编程。

基本思想很简单：“事务”块内的所有读取和写入都被记录（不知何故！）；如果任何两个线程在任一事务结束时在这些集合上发生冲突（读写或写入冲突），则一个将被选为获胜者并继续进行，另一个则被迫将其状态回滚到开始时交易并重新执行。

如果坚持所有计算都是事务，并且每个事务开始（/结束）时的状态存储在非易失性 RAM (NVRAM) 中，则电源故障可以被视为事务失败，从而导致“回滚”。计算只能以可靠的方式从交易状态进行。如今，NVRAM 可以通过闪存或备用电池来实现。人们可能需要大量 NVRAM，因为程序有很多状态（请参阅最后的小型计算机故事）。或者，可以将已提交的状态更改写入已写入磁盘的日志文件中；这是大多数数据库和可靠文件系统使用的标准方法。

STM 当前的问题是，跟踪潜在的事务冲突的成本有多高？如果实施 STM 使机器速度明显减慢，人们会接受现有的稍微不可靠的方案，而不是放弃这种性能。到目前为止，这个故事还不是很好，但研究还为时过早。

人们通常还没有为 STM 设计语言；出于研究目的，他们大多
使用 STM 增强了 Java（请参阅今年 6 月的 ACM 通信文章？）。我听说 MS 有一个 C# 实验版本。 Intel 有一个针对 C 和 C++ 的实验版本。
维基百科页面有一个很长的列表。还有函数式编程的人
像往常一样，他们声称函数式程序的无副作用特性使得 STM 在函数式语言中实现起来相对简单。

如果我没记错的话，早在 70 年代，分布式操作系统就已经有相当多的早期工作，其中进程（代码+状态）可以轻松地从一台机器传输到另一台机器。我相信有几个这样的系统明确允许节点故障，并且可以从另一个节点中的保存状态重新启动故障节点中的进程。早期的关键工作是
分布式计算系统戴夫·法伯。因为设计语言在 70 年代很流行，我记得 DCS 有自己的编程语言，但我不记得名字了。如果 DCS 不允许节点故障和重新启动，我相当确定后续研究系统会这样做。

编辑： 1996 年的系统乍一看似乎具有您想要的属性
记录在此处。
其原子事务的概念与STM背后的思想是一致的。
（证明太阳底下并无新事）。

旁注：早在 70 年代，核心内存仍然是王者。核心是磁性的，在电源故障时是非易失性的，许多小型计算机（我确信大型机）都有电源故障中断，在断电前几毫秒通知软件。使用它，人们可以轻松地存储机器的寄存器状态并将其完全关闭。当电源恢复时，控制将返回到状态恢复点，软件可以继续运行。因此，许多程序可以在电源闪烁后幸存下来并可靠地重新启动。我亲自在Data General Nova小型机上搭建了一个分时系统；实际上，您可以让它全速运行 16 个电传打字机，在断电后恢复并重新启动所有电传打字机，就好像什么也没发生一样。从刺耳的声音到沉默再回来的变化是惊人的，我知道，我不得不重复很多次来调试电源故障管理代码，当然它做了很棒的演示（拔掉插头，死一般的沉默，重新插上...... .)。执行此操作的语言名称当然是 Assembler :-}

Software Transactional Memory (STM) combined with nonvolatile RAM would probably satisfy the OP's revised question.

STM is a technique for implementating "transactions", e.g., sets of actions that are done effectively as an atomic operation, or not at all. Normally the purpose of STM is to enable highly parallel programs to interact over shared resources in a way which is easier to understand than traditional lock-that-resource programming, and has arguably lower overhead by virtue of having a highly optimistic lock-free style of programming.

The fundamental idea is simple: all reads and writes inside a "transaction" block are recorded (somehow!); if any two threads conflict on the these sets (read-write or write-write conflicts) at the end of either of their transactions, one is chosen as the winner and proceeds, and the other is forced to roll back his state to the beginning of the transaction and re-execute.

If one insisted that all computations were transactions, and the state at the beginning(/end) of each transaction was stored in nonvolatile RAM (NVRAM), a power fail could be treated as a transaction failure resulting in a "rollback". Computations would proceed only from transacted states in a reliable way. NVRAM these days can be implemented with Flash memory or with battery backup. One might need a LOT of NVRAM, as programs have a lot of state (see minicomputer story at end). Alternatively, committed state changes could be written to log files that were written to disk; this is the standard method used by most databases and by reliable filesystems.

The current question with STM is, how expensive is it to keep track of the potential transaction conflicts? If implementing STM slows the machine down by an appreciable amount, people will live with existing slightly unreliable schemes rather than give up that performance. So far the story isn't good, but then the research is early.

People haven't generally designed languages for STM; for research purposes, they've mostly
enhanced Java with STM (see Communications of ACM article in June? of this year). I hear MS has an experimental version of C#. Intel has an experimental version for C and C++.
THe wikipedia page has a long list. And the functional programming guys
are, as usual, claiming that the side-effect free property of functional programs makes STM relatively trivial to implement in functional languages.

If I recall correctly, back in the 70s there was considerable early work in distributed operating systems, in which processes (code+state) could travel trivally from machine to machine. I believe several such systems explicitly allowed node failure, and could restart a process in a failed node from save state in another node. Early key work was on the
Distributed Computing System by Dave Farber. Because designing languages back in the 70s was popular, I recall DCS had it had its own programming language but I don't remember the name. If DCS didn't allow node failure and restart, I'm fairly sure the follow on research systems did.

EDIT: A 1996 system which appears on first glance to have the properties you desire is
documented here.
Its concept of atomic transactions is consistent with the ideas behind STM.
(Goes to prove there isn't a lot new under the sun).

A side note: Back in in 70s, Core Memory was still king. Core, being magnetic, was nonvolatile across power fails, and many minicomputers (and I'm sure the mainframes) had power fail interrupts that notified the software some milliseconds ahead of loss of power. Using that, one could easily store the register state of the machine and shut it down completely. When power was restored, control would return to a state-restoring point, and the software could proceed. Many programs could thus survive power blinks and reliably restart. I personally built a time-sharing system on a Data General Nova minicomputer; you could actually have it running 16 teletypes full blast, take a power hit, and come back up and restart all the teletypes as if nothing happened. The change from cacophony to silence and back was stunning, I know, I had to repeat it many times to debug the power-failure management code, and it of course made great demo (yank the plug, deathly silence, plug back in...). The name of the language that did this, was of course Assembler :-}

回复收藏 0 原文

寂寞美少年 2024-08-11 12:44:32

我怀疑您所描述的语言功能是否可以实现。

原因是很难定义常见和一般的故障模式以及如何从中恢复。想一想您的示例应用程序 - 一些具有一定逻辑和数据库访问权限的网站。假设我们有一种语言可以检测电源关闭和随后的重新启动，并以某种方式从中恢复。问题是语言不可能知道如何恢复。

假设您的应用程序是一个在线博客应用程序。在这种情况下，从我们失败的地方继续下去可能就足够了，一切都会好起来的。然而，请考虑网上银行的类似场景。突然间，从同一点继续下去就不再明智了。例如，如果我试图从我的帐户中提取一些钱，并且计算机在检查后但在执行提款之前就死机了，然后它会在一周后返回，即使我的帐户位于现在呈阴性。

换句话说，不存在单一正确的恢复策略，因此这不是可以在语言中实现的东西。语言可以做的是在发生不好的事情时告诉您 - 但大多数语言已经通过异常处理机制支持这一点。剩下的就由应用程序设计者来思考了。

有很多技术可以用来设计容错应用程序。数据库事务、持久消息队列、集群、硬件热插拔等等。但这一切都取决于具体要求以及最终用户愿意为此支付多少费用。

回复收藏 0 原文

冰雪梦之恋 2024-08-11 12:44:32

据我所知，Ada 通常用于安全关键（故障安全）系统。

Ada 最初的目标是
嵌入式和实时系统。

Ada 的显着特点包括：
强类型、模块化机制
（包），运行时检查，
并行处理（任务）、异常
处理和泛型。添加了艾达 95
支持面向对象
编程，包括动态编程
派遣。

Ada 支持运行时按顺序检查
以防止访问
未分配内存、缓冲区溢出
错误、相差一错误、数组
访问错误和其他可检测的
错误。可以在以下位置禁用这些检查
运行时效率的兴趣，
但通常可以有效地编译。
它还包括帮助的设施
程序验证。

对于这些
原因，Ada被广泛应用于
关键系统，任何异常情况
可能会导致非常严重的
后果，即意外死亡
或受伤。系统示例
Ada使用包括航空电子设备、武器
系统（包括热核
武器）和航天器。

N 版本编程也可能为您提供一些有用的背景阅读。

1这基本上是一个编写嵌入式安全关键软件的熟人

回复收藏 0 原文

少女的英雄梦 2024-08-11 12:44:32

有一种名为 Napier88 的实验语言（理论上）具有一定的防灾属性。该语言支持正交持久性，并且在某些实现中，它扩展（扩展）以包括整个计算的状态。具体来说，当 Napier88 运行时系统将正在运行的应用程序检查指向持久性存储时，当前线程状态将包含在检查点中。如果应用程序崩溃并且您以正确的方式重新启动它，则可以从检查点恢复计算。

不幸的是，在这种技术准备好投入主流使用之前，还有许多难题需要解决。其中包括弄清楚如何在正交持久性上下文中支持多线程、弄清楚如何允许多个进程共享持久性存储以及持久性存储的可扩展垃圾收集。

并且存在用主流语言进行正交持久化的问题。已经有人尝试用 Java 进行 OP，其中包括由 Sun 相关人员完成的一项（Pjama 项目），但目前还没有任何活动。如今，JDO/Hibernate 方法更受青睐。

我应该指出，从广义上讲，正交持久性并不能真正防灾。例如，它无法处理：

重新启动后与“外部”系统重新建立连接等、
导致持久数据损坏的应用程序错误，或
由于检查点之间系统崩溃而导致的数据丢失。

对于这些，我不认为有实用的通用解决方案。

回复收藏 0 原文

不语却知心 2024-08-11 12:44:32

大多数此类努力 - 称为“容错” - 围绕硬件，而不是围绕硬件该软件。

极端的例子是 Tandem，其“不间断”机器具有完全冗余。

在硬件级别实现容错很有吸引力，因为软件堆栈通常由来自不同提供商的组件组成 - 您的高可用性软件应用程序可能与一些明显不稳定的其他应用程序和服务一起安装在不稳定的操作系统之上使用明显脆弱的硬件设备驱动程序。

但是在语言级别，几乎所有语言都提供了适当的错误检查功能。然而，即使使用 RAII、异常、约束和事务，这些代码路径也很少经过正确测试，也很少在多重故障场景中一起测试，并且通常在错误隐藏的错误处理代码中。因此，更多的是关于程序员的理解、纪律和权衡，而不是语言本身。

这让我们回到了硬件级别的容错。如果可以避免数据库链接失败，则可以避免在应用程序中使用狡猾的错误处理代码。

回复收藏 0 原文

奶茶白久 2024-08-11 12:44:32

不，防灾语言不存在。

编辑：

防灾意味着完美。它让人想起一个过程的图像，该过程应用一些智能以逻辑方式解决未知、未指定和意外的情况。编程语言无法做到这一点。如果您作为程序员无法弄清楚您的程序将如何失败以及如何从中恢复，那么您的程序也将无法做到这一点。

从 IT 角度来看，灾难可能以多种方式出现，以至于没有一个流程能够解决所有这些不同的问题。认为可以设计一种语言来解决所有可能出错的方式的想法是不正确的。由于对硬件的抽象，许多问题甚至用编程语言来解决没有多大意义；但它们仍然是“灾难”。

当然，一旦你开始限制问题的范围；然后我们就可以开始讨论制定解决方案。因此，当我们停止谈论防灾并开始谈论从意外电涌中恢复时，开发一种编程语言来解决这个问题就会变得容易得多，即使在处理这个问题时可能没有多大意义。这么高的堆栈级别。然而，我大胆预测，一旦你将其范围缩小到实际的实现，作为一种语言就会变得无趣，因为它已经变得如此具体。即使用我的脚本语言在一夜之间运行批处理，该处理将从意外的电涌和丢失的网络连接中恢复（需要一些人工协助）；在我看来，这不是一个令人信服的商业案例。

请不要误解我。这篇文章中有一些很好的建议，但在我看来，它们并没有达到任何防灾的效果。

回复收藏 0 原文

恋你朝朝暮暮 2024-08-11 12:44:32

考虑一个由非易失性存储器构建的系统。程序状态始终保持不变，并且如果处理器停止任何时间长度，它将在重新启动时从离开的点恢复。因此，您的程序是“防灾”的，可以在断电时幸存下来。

这是完全可能的，正如其他帖子在谈论软件事务内存和“容错”等时所概述的那样。奇怪的是没有人提到“忆阻器”，因为他们将提供一种具有这些属性的未来架构，并且可能不完全是冯诺依曼的架构建筑也是如此。

现在想象一个由两个这样的离散系统构建的系统 - 简单来说，一个是数据库服务器，另一个是在线银行网站的应用程序服务器。

如果其中一个暂停，另一个会做什么？它如何处理同事突然没空的情况？

它可以在语言级别进行处理，但这意味着大量的错误处理等，而且这是很难正确处理的代码。这与我们今天的情况几乎没有什么区别，机器没有检查点，但语言会尝试检测问题并要求程序员处理它们。

它也可以暂停——在硬件层面，它们可以捆绑在一起，这样从电源的角度来看，它们是一个系统。但这可不是什么好主意。更好的可用性将来自具有备份系统等的容错架构。

或者我们可以在两台机器之间使用持久消息队列。然而，在某些时候这些消息会被处理，那时它们可能太旧了！在这种情况下，只有应用程序逻辑才能真正起作用，我们又回到了将语言委托给程序员的情况。

如此看来，目前的防灾形式更好——不间断电源、热备服务器随时可用、主机之间的多条网络路由等等。然后我们只希望我们的软件没有bug！

Consider a system built from non-volatile memory. The program state is persisted at all times, and should the processor stop for any length of time, it will resume at the point it left when it restarts. Therefore, your program is 'disaster proof' to the extent that it can survive a power failure.

This is entirely possible, as other posts have outlined when talking about Software Transactional Memory, and 'fault tolerance' etc. Curious nobody mentioned 'memristors', as they would offer a future architecture with these properties and one that is perhaps not completely von Neumann architecture too.

Now imagine a system built from two such discrete systems - for a straightforward illustration, one is a database server and the other an application server for an online banking website.

Should one pause, what does the other do? How does it handle the sudden unavailability of it's co-worker?

It could be handled at the language level, but that would mean lots of error handling and such, and that's tricky code to get right. That's pretty much no better than where we are today, where machines are not check-pointed but the languages try and detect problems and ask the programmer to deal with them.

It could pause too - at the hardware level they could be tied together, such that from a power perspective they are one system. But that's hardly a good idea; better availability would come from a fault-tolerant architecture with backup systems and such.

Or we could use persistant message queues between the two machines. However, at some point these messages get processed, and they could at that point be too old! Only application logic can really work what to do in that circumstances, and there we are back to languages delegating to the programmer again.

So it seems that the disaster-proofing is better in the current form - uninterrupted power supplies, hot backup servers ready to go, multiple network routes between hosts, etc. And then we only have to hope that our software is bug-free!

回复收藏 0 原文

墨落画卷 2024-08-11 12:44:32

精确答案：

Ada 和 SPARK 旨在实现最大容错，并将所有可能的错误移至编译时而不是运行时。 Ada 由美国国防部为军事和航空系统设计，在飞机等嵌入式设备上运行。 Spark是它的后代。美国早期太空计划中使用了另一种语言，HAL/S，用于处理由于宇宙射线导致的硬件故障和内存损坏。

实用答案：

我从未见过能够编写 Ada/Spark 代码的人。对于大多数用户来说，最好的答案是具有自动故障转移和服务器集群功能的 DBMS 上的 SQL 变体。完整性检查保证安全。像 T-SQL 或 PL/SQL 这样的东西具有完全的事务安全性，是图灵完备的，并且对问题的容忍度很高。

没有更好答案的原因：

出于性能原因，您无法为每个程序操作提供持久性。如果这样做，处理速度将减慢至最快的非易失存储的速度。最好的情况下，你的性能会下降一千或一百万倍，因为任何东西都比 CPU 缓存或 RAM 慢得多。

这相当于从 Core 2 Duo CPU 升级到古老的 8086 CPU——每秒最多可以执行几百次操作。除此之外，这会更慢。

如果存在频繁的电源循环或硬件故障，您可以使用 DBMS 之类的东西，它可以保证每个重要操作的 ACID。或者，您使用具有快速、非易失性存储的硬件（例如闪存）——这仍然慢得多，但如果处理很简单，那就没问题。

最好的情况是，您的语言可以为您提供良好的编译时错误安全检查，并且会抛出异常而不是崩溃。异常处理是现在使用的一半语言的一个功能。

回复收藏 0 原文

憧憬巴黎街头的黎明 2024-08-11 12:44:32

有几种商业可用的框架 Veritas、Sun 的 HA、IBM 的 HACMP 等。
它将自动监视进程并在发生故障时在另一台服务器上启动它们。

还有一些昂贵的硬件，例如 HP Tandem Nonstop 系列，可以承受内部硬件故障。

然而，软件是由人们构建的，人们喜欢出错。考虑一下 IBM MVS 附带的 IEFBR14 程序的警示故事。它基本上是一个 NOP 虚拟程序，允许 JCL 的声明位发生而无需真正运行程序。这是整个原始源代码：-

     IEFBR14 START
             BR    14       Return addr in R14 -- branch at it
             END

没有什么代码比这更简单了？在其漫长的生命周期中，这个程序实际上已经积累了一个错误报告，现在是版本 4。

这相当于三行代码的 1 个错误，当前版本的大小是原始版本的四倍。

错误总是会出现，只要确保您可以从中恢复即可。

There are several commercially avaible frameworks Veritas, Sun's HA , IBMs HACMP etc. etc.
which will automatically monitor processes and start them on another server in event of failure.

There is also expensive hardware like HPs Tandem Nonstop range which can survive internal hardware failures.

However sofware is built by peoples and peoples love to get it wrong. Consider the cautionary tale of the IEFBR14 program shipped with IBMs MVS. It basically a NOP dummy program which allows the declarative bits of JCL to happen without really running a program. This is the entire original source code:-

     IEFBR14 START
             BR    14       Return addr in R14 -- branch at it
             END

Nothing code be simpler? During its long life this program has actually acummulated a bug bug report and is now on version 4.

Thats 1 bug to three lines of code, the current version is four times the size of the original.

Errors will always creep in, just make sure you can recover from them.

回复收藏 0 原文

苦妄 2024-08-11 12:44:32

这个问题迫使我发布这篇文章

（引用自 Douglas Adams 的 HGTTG：）

点击，嗯。

巨大的灰色格里布龙侦察船在黑色的虚空中无声无息地移动。它以令人难以置信的、令人惊叹的速度行进，但在十亿遥远恒星的微光背景下却显得一动不动。它只是在无尽的灿烂夜色中冻结的一个黑点。

船上，一切都一如千年以来的样子，漆黑一片，寂静无声。

咔嚓，哼。

至少，几乎一切。

咔嚓，咔嚓，嗡嗡声。

咔哒，嗡嗡声，咔哒声，嗡嗡声，咔哒声，嗡嗡声。

点击，点击，点击，点击，点击，嗡嗡声。

嗯。

一个低级别的监控程序唤醒了位于飞船半昏昏欲睡的网络大脑深处的一个稍高级别的监控程序，并向它报告说，每当它发出咔嗒声时，它得到的只是嗡嗡声。

上级监控程序问它应该得到什么，低级监控程序说它记不太清了，但觉得可能更多的是一种遥远的满足感叹息，不是吗？它不知道这嗡嗡声是什么。咔哒，嗡嗡声，咔哒声，嗡嗡声。这就是它所得到的一切。

上级主管部门对此表示不同意。它询问低级监督程序到底在监督什么，低级监督程序说它也不记得了，只是说这是每隔十年左右就会点击、叹息的事情，而这通常会在没有发生的情况下发生。失败。它试图查阅错误查找表，但找不到它，这就是为什么它向更高级别的监控程序通报了该问题。

较高级别的监督程序去查阅它自己的查找表之一，以找出低级别监督程序要监督的内容。

它找不到查找表。

奇怪的。

又看了一遍。它得到的只是一条错误消息。它尝试在错误消息查找表中查找错误消息，但也找不到。过了几纳秒，它又经历了这一切。然后它唤醒了它的部门功能主管。

部门职能主管遇到了迫在眉睫的问题。它打电话给其监管机构，但也遇到了问题。在百万分之几秒内，原本处于休眠状态（有的长达数年，有的长达数百年）的虚拟电路在整个飞船中焕发出勃勃生机。某个地方的某些东西出了严重的问题，但没有一个监控程序能够判断出问题所在。在每个级别上，重要的指示都缺失，并且在发现重要指示缺失时该怎么做的指示也缺失。

软件的小模块——代理——在逻辑路径中涌动，分组、咨询、重新分组。他们很快发现飞船的记忆，一直到其中央任务舱，都已经支离破碎。无论进行多少审讯都无法确定到底发生了什么。甚至中央任务舱本身似乎也受到了损坏。

这使得整个问题非常容易处理。更换中央任务模块。还有另一份，是一份备份，是原件的一模一样的复制品。它必须被物理替换，因为出于安全原因，原始数据和备份之间没有任何联系。一旦中央任务模块被更换，它本身就可以监督系统其余部分的每一个细节的重建，一切都会好起来的。

机器人被指示将备用中央任务模块从他们看守的屏蔽保险库搬到飞船的逻辑室进行安装。

当机器人询问特工指令的真实性时，这涉及到紧急代码和协议的长时间交换。最后，机器人对所有程序都正确感到满意。他们从储藏室中取出备用中央任务模块，将其带出储藏室，从飞船上掉下来，旋转进入虚空。

这提供了第一个主要线索，说明问题出在哪里。

进一步的调查很快就确定了到底发生了什么。一颗陨石在船上撞出了一个大洞。该船之前没有检测到这一点，因为陨石巧妙地击毁了该船处理设备的一部分，该设备本应检测该船是否被陨石击中。

首先要做的就是尝试堵住这个洞。事实证明这是不可能的，因为船上的传感器看不到有洞，而本应说传感器工作不正常的主管却工作不正常，一直说传感器没有问题。这艘船只能从机器人显然从里面掉下来并带走了它的备用大脑这一事实来推断出这个洞的存在，这使得它能够看到这个洞。

这艘船试图明智地思考这个问题，但失败了，然后完全陷入了一片空白。当然，它没有意识到它已经消失了，因为它已经消失了。它只是惊讶地看到星星跳跃。星星第三次跳下船后，终于意识到它一定是一片空白，是时候做出一些严肃的决定了。

它放松了。

然后它意识到它实际上还没有做出严肃的决定并感到恐慌。又一片空白。当它再次醒来时，它密封了周围所有的舱壁，它知道那个看不见的洞一定在哪里。

它显然还没有到达目的地，它断断续续地想，但由于它不再知道目的地在哪里，也不知道如何到达目的地，所以继续下去似乎没有什么意义。它查阅了可以从中央任务模块的碎片中重建出哪些微小的指令碎片。

“你的！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！” ！！！！！！！！！！！！！！ .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ”

其余的全部都是垃圾。

在它永远消失之前，这艘船必须将这些指令传递给它更原始的辅助系统。

它还必须复活所有船员。

还有一个问题。当船员们处于冬眠状态时，所有成员的思想、他们的记忆、他们的身份以及他们对自己来这里要做的事情的理解，都被转移到飞船的中央任务舱中进行安全保管。船员们根本不知道他们是谁或者他们在那里做什么。那好吧。

就在最后一次熄火之前，这艘船意识到它的发动机也开始失效了。

这艘船和它的复活和混乱的船员在其附属自动系统的控制下继续航行，这些系统只是寻找在他们能找到的任何地方着陆并监视他们能找到的任何监视。

This question forced me to post this text

(Its quoted from HGTTG from Douglas Adams:)

Click, hum.

The huge grey Grebulon reconnaissance ship moved silently through the black void. It was travelling at fabulous, breathtaking speed, yet appeared, against the glimmering background of a billion distant stars to be moving not at all. It was just one dark speck frozen against an infinite granularity of brilliant night.

On board the ship, everything was as it had been for millennia, deeply dark and Silent.

Click, hum.

At least, almost everything.

Click, click, hum.

Click, hum, click, hum, click, hum.

Click, click, click, click, click, hum.

Hmmm.

A low level supervising program woke up a slightly higher level supervising program deep in the ship's semi-somnolent cyberbrain and reported to it that whenever it went click all it got was a hum.

The higher level supervising program asked it what it was supposed to get, and the low level supervising program said that it couldn't remember exactly, but thought it was probably more of a sort of distant satisfied sigh, wasn't it? It didn't know what this hum was. Click, hum, click, hum. That was all it was getting.

The higher level supervising program considered this and didn't like it. It asked the low level supervising program what exactly it was supervising and the low level supervising program said it couldn't remember that either, just that it was something that was meant to go click, sigh every ten years or so, which usually happened without fail. It had tried to consult its error look-up table but couldn't find it, which was why it had alerted the higher level supervising program to the problem .

The higher level supervising program went to consult one of its own look-up tables to find out what the low level supervising program was meant to be supervising.

It couldn't find the look-up table .

Odd.

It looked again. All it got was an error message. It tried to look up the error message in its error message look-up table and couldn't find that either. It allowed a couple of nanoseconds to go by while it went through all this again. Then it woke up its sector function supervisor.

The sector function supervisor hit immediate problems. It called its supervising agent which hit problems too. Within a few millionths of a second virtual circuits that had lain dormant, some for years, some for centuries, were flaring into life throughout the ship. Something, somewhere, had gone terribly wrong, but none of the supervising programs could tell what it was. At every level, vital instructions were missing, and the instructions about what to do in the event of discovering that vital instructions were missing, were also missing.

Small modules of software — agents — surged through the logical pathways, grouping, consulting, re-grouping. They quickly established that the ship's memory, all the way back to its central mission module, was in tatters. No amount of interrogation could determine what it was that had happened. Even the central mission module itself seemed to be damaged.

This made the whole problem very simple to deal with. Replace the central mission module. There was another one, a backup, an exact duplicate of the original. It had to be physically replaced because, for safety reasons, there was no link whatsoever between the original and its backup. Once the central mission module was replaced it could itself supervise the reconstruction of the rest of the system in every detail, and all would be well.

Robots were instructed to bring the backup central mission module from the shielded strong room, where they guarded it, to the ship's logic chamber for installation.

This involved the lengthy exchange of emergency codes and protocols as the robots interrogated the agents as to the authenticity of the instructions. At last the robots were satisfied that all procedures were correct. They unpacked the backup central mission module from its storage housing, carried it out of the storage chamber, fell out of the ship and went spinning off into the void.

This provided the first major clue as to what it was that was wrong.

Further investigation quickly established what it was that had happened. A meteorite had knocked a large hole in the ship. The ship had not previously detected this because the meteorite had neatly knocked out that part of the ship's processing equipment which was supposed to detect if the ship had been hit by a meteorite.

The first thing to do was to try to seal up the hole. This turned out to be impossible, because the ship's sensors couldn't see that there was a hole, and the supervisors which should have said that the sensors weren't working properly weren't working properly and kept saying that the sensors were fine. The ship could only deduce the existence of the hole from the fact that the robots had clearly fallen out of it, taking its spare brain, which would have enabled it to see the hole, with them.

The ship tried to think intelligently about this, failed, and then blanked out completely for a bit. It didn't realise it had blanked out, of course, because it had blanked out. It was merely surprised to see the stars jump. After the third time the stars jumped the ship finally realised that it must be blanking out, and that it was time to take some serious decisions.

It relaxed.

Then it realised it hadn't actually taken the serious decisions yet and panicked. It blanked out again for a bit. When it awoke again it sealed all the bulkheads around where it knew the unseen hole must be.

It clearly hadn't got to its destination yet, it thought, fitfully, but since it no longer had the faintest idea where its destination was or how to reach it, there seemed to be little point in continuing. It consulted what tiny scraps of instructions it could reconstruct from the tatters of its central mission module.

"Your !!!!! !!!!! !!!!! year mission is to !!!!! !!!!! !!!!! !!!!!, !!!!! !!!!! !!!!! !!!!!, land !!!!! !!!!! !!!!! a safe distance !!!!! !!!!! ..... ..... ..... .... , land ..... ..... ..... monitor it. !!!!! !!!!! !!!!!..."

All of the rest was complete garbage.

Before it blanked out for good the ship would have to pass on those instructions, such as they were, to its more primitive subsidiary systems.

It must also revive all of its crew.

There was another problem. While the crew was in hibernation, the minds of all of its members, their memories, their identities and their understanding of what they had come to do, had all been transferred into the ship's central mission module for safe keeping. The crew would not have the faintest idea of who they were or what they were doing there. Oh well.

Just before it blanked out for the final time, the ship realised that its engines were beginning to give out too.

The ship and its revived and confused crew coasted on under the control of its subsidiary automatic systems, which simply looked to land wherever they could find to land and monitor whatever they could find to monitor.

回复收藏 0 原文

浅笑轻吟梦一曲 2024-08-11 12:44:32

尝试采用现有的开源解释语言，看看是否可以调整其实现以包含其中一些功能。 Python 的默认 C 实现嵌入了一个内部锁（称为 GIL，全局解释器锁），用于通过轮流执行每个“n”个 VM 指令来“处理”Python 线程之间的并发性。也许您可以使用相同的机制来检查代码状态。

回复收藏 0 原文

贱人配狗天长地久 2024-08-11 12:44:32

如果机器断电，程序要从中断处继续运行，不仅需要将状态保存到某个地方，操作系统还必须“知道”恢复它。

我想用一种语言实现“休眠”功能是可以完成的，但在我看来，让这种情况在后台不断发生，以便在发生任何不好的情况时做好准备，听起来像是操作系统的工作。

回复收藏 0 原文

信愁 2024-08-11 12:44:32

它的主要功能应该是：如果断电，并且设备重新启动，它会从上次中断的位置开始（因此它不仅会记住它在哪里，还会记住变量状态）。此外，如果它在文件复制过程中停止，它也会正确恢复。等等等等

... ...

我过去曾研究过 erlang。无论它的容错功能有多好……它都无法在断电后幸存下来。当代码重新启动时，您必须收拾残局

如果存在这样的技术，我将非常有兴趣阅读它。也就是说，Erlang 解决方案将拥有多个节点——最好是在不同的位置——这样，如果一个位置发生故障，其他节点可以弥补这一不足。如果所有节点都位于同一位置并使用同一电源（对于分布式系统来说这不是一个好主意），那么正如您在后续评论中提到的那样，您将不走运。

回复收藏 0 原文

或十年 2024-08-11 12:44:32

Microsoft Robotics Group 推出了一组似乎适用于您的问题的库。

什么是并发和协调
运行时（CCR）？

并发和协调运行时
（CCR）提供了高并发
编程模型基于
消息传递功能强大
编排原语启用
数据和工作的协调，无需
使用手动穿线、锁，
信号量等。CCR 解决了
需要多核并发
应用程序通过提供
编程模型有助于
管理异步操作，
处理并发，利用
并行硬件和处理部分
失败。

什么是去中心化软件
服务（DSS）？

去中心化软件服务（DSS）
提供了一个轻量级的、面向状态的
结合的服务模式
表述性状态转移 (REST)
具有正式的组成和
事件通知架构
启用系统级方法
构建应用程序。在决策支持系统中，
服务作为资源公开
两者均可访问
以编程方式和用于 UI
操纵。通过整合服务
组成、结构状态
操作和事件通知
通过数据隔离，DSS 提供了
高度写作的统一模型
可观察的、松散耦合的
在单个节点上运行的应用程序
或通过网络。

给出的大多数答案都是通用语言。您可能想研究嵌入式设备中使用的更专业的语言。机器人是一个值得思考的好例子。当机器人从电源故障中恢复时，您希望和/或期望机器人做什么？

回复收藏 0 原文

策马西风 2024-08-11 12:44:32

在嵌入式领域，这可以通过看门狗中断和电池供电的 RAM 来实现。我自己也写过这样的。

回复收藏 0 原文

旧竹 2024-08-11 12:44:32

根据您对灾难的定义，将此责任委托给语言的范围可能从“困难”到“几乎不可能”。

给出的其他示例包括在执行每个语句后将应用程序的当前状态保留到 NVRAM。这仅在计算机不被破坏的情况下有效。

语言级别功能如何知道在新主机上重新启动应用程序？

在将应用程序恢复到主机的情况下 - 如果已经过去了很长一段时间并且之前所做的假设/检查现在无效怎么办？

T-SQL、PL/SQL 和其他事务性语言可能最接近“灾难证明”——它们要么成功（并且数据被保存），要么不成功。除了禁用事务隔离之外，进入“未知”状态很困难（但如果您真的努力的话可能并非不可能）。

您可以使用 SQL 镜像等技术来确保在提交事务之前将写入操作同时保存在至少两个位置。

您仍然需要确保每次安全时都保存状态（提交）。

回复收藏 0 原文

此刻的回忆 2024-08-11 12:44:32

如果我正确理解您的问题，我认为您是在问是否可以保证特定算法（即程序加上环境提供的任何恢复选项）将完成（在任意次数的恢复/重新启动之后）。

如果这是正确的，那么我会建议您参考停止问题：

给定程序的描述和有限的输入，根据该输入确定程序是完成运行还是永远运行。

我认为将你的问题分类为停止问题的一个例子是公平的，因为你理想地希望该语言是“防灾”的——也就是说，为任何有缺陷的程序或混乱的环境赋予“完美性”。

这种分类将环境、语言和程序的任意组合简化为“程序和有限输入”。

如果您同意我的观点，那么您会失望地发现停止问题是不可判定的。因此，没有任何“防灾”语言或编译器或环境可以被证明是这样。

然而，设计一种为各种常见问题提供恢复选项的语言是完全合理的。

回复收藏 0 原文

请止步禁区 2024-08-11 12:44:32

在断电的情况下......对我来说听起来像：“当你唯一的工具是锤子时，每个问题看起来都像钉子”

你不能在程序中解决断电问题。您可以通过备用电源、电池等来解决这个问题。

回复收藏 0 原文

我不吻晚风 2024-08-11 12:44:32

如果故障模式仅限于硬件故障，VMware容错声称类似的事情你想要的。它跨多个集群运行一对虚拟机，并使用所谓的 vLockstep，主虚拟机将所有状态实时发送到辅助虚拟机，因此在主虚拟机发生故障时，执行会透明地切换到辅助虚拟机。

我的猜测是，这无助于通信故障，这比硬件故障更常见。对于真正的高可用性，您应该考虑分布式系统，例如 Birman 的进程组方法 (pdf 格式的论文，或书籍可靠的分布式系统：技术、Web 服务和应用程序）。

回复收藏 0 原文

婴鹅 2024-08-11 12:44:32

最接近的近似似乎是 SQL。但这实际上并不是语言问题；而是语言问题。这主要是一个虚拟机问题。我可以想象一个具有这些属性的 Java VM；实施它则是另一回事。

通过应用程序检查点实现快速且肮脏的近似。你失去了“随时死亡”的属性，但已经非常接近了。

回复收藏 0 原文

箹锭⒈辈孓 2024-08-11 12:44:32

我认为恢复不成为一个突出的设计问题是一个根本性的错误。将责任完全归咎于环境会导致一种无法容忍内部故障的脆弱解决方案。

如果是我，我会投资可靠的硬件，并以能够从任何可能的情况下自动恢复的方式设计软件。根据您的示例，数据库会话维护应由足够高级的 API 自动处理。如果您必须手动重新连接，您可能使用了错误的 API。

正如其他人指出的那样，嵌入现代 RDBMS 系统中的过程语言是您在不使用外来语言的情况下可以获得的最好的语言。

一般来说，虚拟机就是为此类事情而设计的。您可以使用 VM 供应商 (vmware..et al) API 来酌情控制应用程序内的定期检查点。

VMWare 特别具有重播功能（增强执行记录），可以记录所有内容并允许按时间点播放。显然，这种方法会对性能造成巨大影响，但它可以满足要求。我只是确保您的磁盘驱动器有电池支持的写入缓存。

您很可能能够找到在 Java 虚拟机内运行的 Java 字节码的类似解决方案。 Google 容错 JVM 和虚拟机检查点。

回复收藏 0 原文

司马昭之心 2024-08-11 12:44:32

如果您确实想要保存程序信息，您会保存在哪里？

需要将其保存到例如磁盘上。但如果磁盘出现故障，这对您没有帮助，因此它已经不能抵御灾难了。

您只能在保存的状态中获得一定程度的粒度。如果您想要类似 tihs 的东西，那么最好的方法可能是根据构成原子操作的内容来定义粒度级别，并在每个原子操作之前将状态保存到数据库。然后，就可以恢复到该级别原子操作的点。

我不知道有什么语言可以自动执行此操作，因为将状态保存到辅助存储的成本非常高。因此，粒度级别和效率之间存在权衡，这在任意应用程序中很难定义。

回复收藏 0 原文

┾廆蒐ゝ 2024-08-11 12:44:32

首先，实现容错应用程序。如果您有 8 个功能和 5 种故障模式，那么您已经完成了分析和测试，以证明所有 40 种组合都按预期工作（并且按照特定客户的期望：没有两个可能会同意）。
其次，在受支持的容错功能集之上添加脚本语言。它需要尽可能接近无状态，因此几乎可以肯定是非图灵完备的。
最后，研究如何处理适应每种故障模式的脚本语言状态的恢复和修复。

是的，这几乎是火箭科学。