在创建必须具有高可靠性的系统服务时,我经常最终编写很多“故障安全”机制,以防出现以下情况:通信中断(例如与数据库的通信),如果电源中断会发生什么丢失并且服务重新启动...如何捡起碎片并以正确的方式继续(并记住,捡起碎片时电源可能会再次熄灭...)等等
我可以想象并不太复杂系统,一种能够满足这一需求的语言将非常实用。因此,无论电源是否被切断,这种语言都会记住它在任何给定时刻的状态,并从中断的地方继续。
这还存在吗?如果有的话,我在哪里可以找到它?如果不是的话,为什么这个不能实现呢?在我看来,这对于关键系统来说非常方便。
ps 如果数据库连接丢失,则表明出现问题,需要手动干预。连接恢复后,它将从中断处继续。
编辑:
由于讨论似乎已经结束,让我添加几点(在我可以为问题添加赏金之前等待)
Erlang 响应现在似乎是最高评价的。我了解 Erlang,并且读过 Armstrong(主要创始人)的实用主义书籍。这一切都非常好(尽管函数式语言让我对所有的递归感到头晕),但是“容错”位不会自动出现。远非如此。 Erlang 提供了许多监督者和其他方法来监督进程,并在必要时重新启动它。然而,要正确地制作一些适用于这些结构的东西,您需要成为 erlang 大师,并且需要使您的软件适合所有这些框架。另外,如果断电,程序员也必须收拾残局,并在下次程序重新启动时尝试恢复。
我正在寻找的东西要简单得多:
想象一种语言(例如像 PHP 一样简单),您可以在其中使用可以做诸如数据库查询、对其进行操作、执行文件操作、执行文件夹操作等操作。
但它的主要功能应该是:如果断电,并且设备重新启动,它会从中断的位置开始(所以它不仅记住它在哪里,它也会记住变量状态)。此外,如果它在文件复制过程中停止,它也会正确恢复。 最后但并非最不重要的一点
是,如果数据库连接断开并且无法恢复,语言就会停止,并发出信号(可能是系统日志)进行人工干预,然后从中断的地方继续。
像这样的语言将使许多服务编程变得更加容易。
编辑:
似乎(从所有评论和答案来看)这样的系统并不存在。在可预见的将来可能不会,因为它(几乎?)不可能正确。
太糟糕了......再说一次,我并不是在寻找这种语言(或框架)来让我登上月球,或者用它来监测某人的心率。但是对于小型定期服务/任务,它们最终总是有大量处理边界情况的代码(中间某个地方断电,连接断开并且没有恢复),...在这里暂停,...解决问题,.. ..然后从上次停下的地方继续下去,方法会很有效。
(或者像一位评论者指出的那样采用检查点方法(就像在视频游戏中一样)。设置一个检查点......如果程序死机,下次从这里重新启动。)
奖励:
在最后一刻,当每个人都得出无法完成的结论时,Stephen C 带来了 napier88,它似乎具有我正在寻找的属性。
虽然它是一种实验性语言,但它确实证明了它是可以做到的,并且是值得更多研究的东西。
我将考虑创建自己的框架(可能带有持久状态和快照)以添加我在 .Net 或其他虚拟机中寻找的功能。
每个人都感谢您的投入和深刻见解。
When creating system services which must have a high reliability, I often end up writing the a lot of 'failsafe' mechanisms in case of things like: communications which are gone (for instance communication with the DB), what would happen if the power is lost and the service restarts.... how to pick up the pieces and continue in a correct way (and remembering that while picking up the pieces the power could go out again...), etc etc
I can imagine for not too complex systems, a language which would cater for this would be very practical. So a language which would remember it's state at any given moment, no matter if the power gets cut off, and continues where it left off.
Does this exist yet? If so, where can I find it? If not, why can't this be realized? It would seem to me very handy for critical systems.
p.s. In case the DB connection is lost, it would signal that a problem arose, and manual intervention is needed. The moment he connection is restored, it would continue where it left off.
EDIT:
Since the discussion seems to have died off let me add a few points(while waiting before I can add a bounty to the question)
The Erlang response seems to be top rated right now. I'm aware of Erlang and have read the pragmatic book by Armstrong (the principal creator). It's all very nice (although functional languages make my head spin with all the recursion), but the 'fault tolerant' bit doesn't come automatically. Far from it. Erlang offers a lot of supervisors en other methodologies to supervise a process, and restart it if necessary. However, to properly make something which works with these structures, you need to be quite the erlang guru, and need to make your software fit all these frameworks. Also, if the power drops, the programmer too has to pick up the pieces and try to recover the next time the program restarts
What I'm searching is something far simpler:
Imagine a language (as simple as PHP for instance), where you can do things like do DB queries, act on it, perform file manipulations, perform folder manipulations, etc.
It's main feature however should be: If the power dies, and the thing restarts it takes of where it left off (So it not only remembers where it was, it will remember the variable states as well). Also, if it stopped in the middle of a filecopy, it will also properly resume. etc etc.
Last but not least, if the DB connection drops and can't be restored, the language just halts, and signals (syslog perhaps) for human intervention, and then carries on where it left off.
A language like this would make a lot of services programming a lot easier.
EDIT:
It seems (judging by all the comments and answers) that such a system doesn't exist. And probably will not in the near foreseeable future due to it being (near?) impossible to get right.
Too bad.... again I'm not looking for this language (or framework) to get me to the moon, or use it to monitor someones heartrate. But for small periodic services/tasks which always end up having loads of code handling bordercases (powerfailure somewhere in the middle, connections dropping and not coming back up),...where a pause here,...fix the issues,....and continue where you left off approach would work well.
(or a checkpoint approach as one of the commenters pointed out (like in a videogame). Set a checkpoint.... and if the program dies, restart here the next time.)
Bounty awarded:
At the last possible minute when everyone was coming to the conclusion it can't be done, Stephen C comes with napier88 which seems to have the attributes I was looking for.
Although it is an experimental language, it does prove it can be done and it is a something which is worth investigating more.
I'll be looking at creating my own framework (with persistent state and snapshots perhaps) to add the features I'm looking for in .Net or another VM.
Everyone thanks for the input and the great insights.
发布评论
评论(28)
如果我要解决你的问题,我会编写一个守护进程(可能用 C 语言)来完成事务中的所有数据库交互,这样如果它被中断,你就不会插入任何错误的数据。然后让系统在启动时启动这个守护进程。
显然,用 C 语言开发 Web 内容比用脚本语言开发要慢得多,但它会执行得更好并且更稳定(当然,如果您编写好的代码:)。
实际上,我会用 Ruby(或 PHP 或其他)编写它,并让诸如延迟作业(或 cron 或任何调度程序)之类的东西经常运行它,因为我不需要每个时钟周期更新的东西。
希望这是有道理的。
If I were going about solving your problem, I would write a daemon (probably in C) that did all database interaction in transactions so you won't get any bad data inserted if it gets interrupted. Then have the system start this daemon at startup.
Obviously developing web stuff in C is quite slower than doing it in a scripting language, but it will perform better and be more stable (if you write good code of course :).
Realistically, I'd write it in Ruby (or PHP or whatever) and have something like Delayed Job (or cron or whatever scheduler) run it every so often because I wouldn't need stuff updating ever clock cycle.
Hope that makes sense.
在我看来,故障恢复的概念在大多数情况下是业务问题,而不是硬件或语言问题。
举个例子:您有一个 UI 层和一个子系统。
该子系统不是很可靠,但 UI 层的客户端应该认为它很可靠。
现在,想象一下你的子系统不知何故崩溃了,你真的认为你想象的语言可以为你思考如何处理依赖于这个子系统的UI层吗?
您的用户应该明确意识到子系统不可靠,如果您使用消息传递来提供高可靠性,则客户端必须知道(如果他不知道,UI 可能会冻结等待响应,该响应最终可能会在 2 周后到来) )。如果他意识到这一点,这意味着任何隐藏它的抽象最终都会泄漏。
我所说的客户是指最终用户。用户界面应该反映这种不可靠性而不是隐藏它,在这种情况下计算机无法为你思考。
To my mind, the concept of failure recover is, most of the time, a business problem, not a hardware or language problem.
Take an example : you have one UI Tier and one subsystem.
The subsystem is not very reliable but the client on the UI tier should percieve it as if it was.
Now, imagine that somehow your sub system crash, do you really think that the language you imagine, can think for you how to handle the UI Tier depending on this sub system ?
Your user should be explicitly aware that the subsystem is not reliable, if you use messaging to provide high reliability, the client MUST know that (if he isn't aware, the UI can just freeze waiting a response which can eventually come 2 weeks later). If he should be aware of this, this means that any abstrations to hide it will eventually leak.
By client, I mean end user. And the UI should reflect this unreliability and not hide it, a computer cannot think for you in that case.
“因此,无论电源是否被切断,语言都会记住它在任何给定时刻的状态,并从中断处继续。”
“从中断处继续” 通常不是正确的恢复策略。世界上没有任何语言或环境会尝试猜测如何自动从特定故障中恢复。它能做的最好的事情就是为您提供工具,以不干扰您的业务逻辑的方式编写您自己的恢复策略,例如
这些是非常通用的工具,可在很多语言和环境。
"So a language which would remember it's state at any given moment, no matter if the power gets cut off, and continues where it left off."
"continues where it left off" is often not the correct recovery strategy. No language or environment in the world is going to attempt to guess how to recover from a particular fault automatically. The best it can do is provide you with tools to write your own recovery strategy in a way that doesn't interfere with your business logic, e.g.
These are very generic tools and are available in lots of languages and environments.
Erlang 设计用于电信系统,其中高可靠性是基础。我认为他们有标准的方法来构建一套通信流程,在这些流程中可以优雅地处理故障。
ERLANG 是一种并发函数式语言,非常适合分布式、高并发和容错的软件。 Erlang 的一个重要部分是它对故障恢复的支持。通过将 ERLANG 应用程序的进程组织成树结构来提供容错能力。在这些结构中,父进程监视子进程的故障并负责子进程的重新启动。
Erlang was designed for use in Telecommunication systems, where high-rel is fundamental. I think they have standard methodology for building sets of communicating processes in which failures can be gracefully handled.
ERLANG is a concurrent functional language, well suited for distributed, highly concurrent and fault-tolerant software. An important part of Erlang is its support for failure recovery. Fault tolerance is provided by organising the processes of an ERLANG application into tree structures. In these structures, parent processes monitor failures of their children and are responsible for their restart.
软件事务内存(STM)与非易失性RAM相结合可能会满足OP修改后的问题。
STM 是一种用于实现“事务”的技术,例如,作为原子操作有效完成或根本不完成的一组动作。通常,STM 的目的是使高度并行的程序能够以比传统的锁定资源编程更容易理解的方式通过共享资源进行交互,并且由于具有高度乐观的无锁风格,因此可以说具有较低的开销。编程。
基本思想很简单:“事务”块内的所有读取和写入都被记录(不知何故!);如果任何两个线程在任一事务结束时在这些集合上发生冲突(读写或写入冲突),则一个将被选为获胜者并继续进行,另一个则被迫将其状态回滚到开始时交易并重新执行。
如果坚持所有计算都是事务,并且每个事务开始(/结束)时的状态存储在非易失性 RAM (NVRAM) 中,则电源故障可以被视为事务失败,从而导致“回滚”。计算只能以可靠的方式从交易状态进行。如今,NVRAM 可以通过闪存或备用电池来实现。人们可能需要大量 NVRAM,因为程序有很多状态(请参阅最后的小型计算机故事)。或者,可以将已提交的状态更改写入已写入磁盘的日志文件中;这是大多数数据库和可靠文件系统使用的标准方法。
STM 当前的问题是,跟踪潜在的事务冲突的成本有多高?如果实施 STM 使机器速度明显减慢,人们会接受现有的稍微不可靠的方案,而不是放弃这种性能。到目前为止,这个故事还不是很好,但研究还为时过早。
人们通常还没有为 STM 设计语言;出于研究目的,他们大多
使用 STM 增强了 Java(请参阅今年 6 月的 ACM 通信文章?)。我听说 MS 有一个 C# 实验版本。 Intel 有一个针对 C 和 C++ 的实验版本。
维基百科页面有一个很长的列表。还有函数式编程的人
像往常一样,他们声称函数式程序的无副作用特性使得 STM 在函数式语言中实现起来相对简单。
如果我没记错的话,早在 70 年代,分布式操作系统就已经有相当多的早期工作,其中进程(代码+状态)可以轻松地从一台机器传输到另一台机器。我相信有几个这样的系统明确允许节点故障,并且可以从另一个节点中的保存状态重新启动故障节点中的进程。早期的关键工作是
分布式计算系统 戴夫·法伯。因为设计语言在 70 年代很流行,我记得 DCS 有自己的编程语言,但我不记得名字了。如果 DCS 不允许节点故障和重新启动,我相当确定后续研究系统会这样做。
编辑: 1996 年的系统乍一看似乎具有您想要的属性
记录在此处。
其原子事务的概念与STM背后的思想是一致的。
(证明太阳底下并无新事)。
旁注:早在 70 年代,核心内存仍然是王者。核心是磁性的,在电源故障时是非易失性的,许多小型计算机(我确信大型机)都有电源故障中断,在断电前几毫秒通知软件。使用它,人们可以轻松地存储机器的寄存器状态并将其完全关闭。当电源恢复时,控制将返回到状态恢复点,软件可以继续运行。因此,许多程序可以在电源闪烁后幸存下来并可靠地重新启动。我亲自在Data General Nova小型机上搭建了一个分时系统;实际上,您可以让它全速运行 16 个电传打字机,在断电后恢复并重新启动所有电传打字机,就好像什么也没发生一样。从刺耳的声音到沉默再回来的变化是惊人的,我知道,我不得不重复很多次来调试电源故障管理代码,当然它做了很棒的演示(拔掉插头,死一般的沉默,重新插上...... .)。执行此操作的语言名称当然是 Assembler :-}
Software Transactional Memory (STM) combined with nonvolatile RAM would probably satisfy the OP's revised question.
STM is a technique for implementating "transactions", e.g., sets of actions that are done effectively as an atomic operation, or not at all. Normally the purpose of STM is to enable highly parallel programs to interact over shared resources in a way which is easier to understand than traditional lock-that-resource programming, and has arguably lower overhead by virtue of having a highly optimistic lock-free style of programming.
The fundamental idea is simple: all reads and writes inside a "transaction" block are recorded (somehow!); if any two threads conflict on the these sets (read-write or write-write conflicts) at the end of either of their transactions, one is chosen as the winner and proceeds, and the other is forced to roll back his state to the beginning of the transaction and re-execute.
If one insisted that all computations were transactions, and the state at the beginning(/end) of each transaction was stored in nonvolatile RAM (NVRAM), a power fail could be treated as a transaction failure resulting in a "rollback". Computations would proceed only from transacted states in a reliable way. NVRAM these days can be implemented with Flash memory or with battery backup. One might need a LOT of NVRAM, as programs have a lot of state (see minicomputer story at end). Alternatively, committed state changes could be written to log files that were written to disk; this is the standard method used by most databases and by reliable filesystems.
The current question with STM is, how expensive is it to keep track of the potential transaction conflicts? If implementing STM slows the machine down by an appreciable amount, people will live with existing slightly unreliable schemes rather than give up that performance. So far the story isn't good, but then the research is early.
People haven't generally designed languages for STM; for research purposes, they've mostly
enhanced Java with STM (see Communications of ACM article in June? of this year). I hear MS has an experimental version of C#. Intel has an experimental version for C and C++.
THe wikipedia page has a long list. And the functional programming guys
are, as usual, claiming that the side-effect free property of functional programs makes STM relatively trivial to implement in functional languages.
If I recall correctly, back in the 70s there was considerable early work in distributed operating systems, in which processes (code+state) could travel trivally from machine to machine. I believe several such systems explicitly allowed node failure, and could restart a process in a failed node from save state in another node. Early key work was on the
Distributed Computing System by Dave Farber. Because designing languages back in the 70s was popular, I recall DCS had it had its own programming language but I don't remember the name. If DCS didn't allow node failure and restart, I'm fairly sure the follow on research systems did.
EDIT: A 1996 system which appears on first glance to have the properties you desire is
documented here.
Its concept of atomic transactions is consistent with the ideas behind STM.
(Goes to prove there isn't a lot new under the sun).
A side note: Back in in 70s, Core Memory was still king. Core, being magnetic, was nonvolatile across power fails, and many minicomputers (and I'm sure the mainframes) had power fail interrupts that notified the software some milliseconds ahead of loss of power. Using that, one could easily store the register state of the machine and shut it down completely. When power was restored, control would return to a state-restoring point, and the software could proceed. Many programs could thus survive power blinks and reliably restart. I personally built a time-sharing system on a Data General Nova minicomputer; you could actually have it running 16 teletypes full blast, take a power hit, and come back up and restart all the teletypes as if nothing happened. The change from cacophony to silence and back was stunning, I know, I had to repeat it many times to debug the power-failure management code, and it of course made great demo (yank the plug, deathly silence, plug back in...). The name of the language that did this, was of course Assembler :-}
我怀疑您所描述的语言功能是否可以实现。
原因是很难定义常见和一般的故障模式以及如何从中恢复。想一想您的示例应用程序 - 一些具有一定逻辑和数据库访问权限的网站。假设我们有一种语言可以检测电源关闭和随后的重新启动,并以某种方式从中恢复。问题是语言不可能知道如何恢复。
假设您的应用程序是一个在线博客应用程序。在这种情况下,从我们失败的地方继续下去可能就足够了,一切都会好起来的。然而,请考虑网上银行的类似场景。突然间,从同一点继续下去就不再明智了。例如,如果我试图从我的帐户中提取一些钱,并且计算机在检查后但在执行提款之前就死机了,然后它会在一周后返回,即使我的帐户位于现在呈阴性。
换句话说,不存在单一正确的恢复策略,因此这不是可以在语言中实现的东西。语言可以做的是在发生不好的事情时告诉您 - 但大多数语言已经通过异常处理机制支持这一点。剩下的就由应用程序设计者来思考了。
有很多技术可以用来设计容错应用程序。数据库事务、持久消息队列、集群、硬件热插拔等等。但这一切都取决于具体要求以及最终用户愿意为此支付多少费用。
I doubt that the language features you are describing are possible to achieve.
And the reason for that is that it would be very hard to define common and general failure modes and how to recover from them. Think for a second about your sample application - some website with some logic and database access. And lets say we have a language that can detect power shutdown and subsequent restart, and somehow recover from it. The problem is that it is impossible to know for the language how to recover.
Let's say your app is an online blog application. In that case it might be enough to just continue from the point we failed and all be ok. However consider similar scenario for an online bank. Suddenly it's no longer smart to just continue from the same point. For example if I was trying to withdraw some money from my account, and the computer died right after the checks but before it performed the withdrawal, and it then goes back one week later it will give me the money even though my account is in the negative now.
In other words, there is no single correct recovery strategy, so this is not something that can be implemented into the language. What language can do is to tell you when something bad happens - but most languages already support that with exception handling mechanisms. The rest is up to application designers to think about.
There are a lot of technologies that allow designing fault tolerant applications. Database transactions, durable message queues, clustering, hardware hot swapping and so on and on. But it all depends on concrete requirements and how much the end user is willing to pay for it all.
据我所知,Ada 通常用于安全关键(故障安全)系统。
N 版本编程也可能为您提供一些有用的背景阅读。
1这基本上是一个编写嵌入式安全关键软件的熟人
From what I know¹, Ada is often used in safety critical (failsafe) systems.
N-Version programming may also give you some helpful background reading.
¹That's basically one acquaintance who writes embedded safety critical software
有一种名为 Napier88 的实验语言(理论上)具有一定的防灾属性。该语言支持正交持久性,并且在某些实现中,它扩展(扩展)以包括整个计算的状态。具体来说,当 Napier88 运行时系统将正在运行的应用程序检查指向持久性存储时,当前线程状态将包含在检查点中。如果应用程序崩溃并且您以正确的方式重新启动它,则可以从检查点恢复计算。
不幸的是,在这种技术准备好投入主流使用之前,还有许多难题需要解决。其中包括弄清楚如何在正交持久性上下文中支持多线程、弄清楚如何允许多个进程共享持久性存储以及持久性存储的可扩展垃圾收集。
并且存在用主流语言进行正交持久化的问题。已经有人尝试用 Java 进行 OP,其中包括由 Sun 相关人员完成的一项(Pjama 项目),但目前还没有任何活动。如今,JDO/Hibernate 方法更受青睐。
我应该指出,从广义上讲,正交持久性并不能真正防灾。例如,它无法处理:
对于这些,我不认为有实用的通用解决方案。
There is an experimental language called Napier88 that (in theory) has some attributes of being disaster-proof. The language supports Orthogonal Persistence, and in some implementations this extends (extended) to include the state of the entire computation. Specifically, when the Napier88 runtime system check-pointed a running application to the persistent store, the current thread state would be included in the checkpoint. If the application then crashed and you restarted it in the right way, you could resume the computation from the checkpoint.
Unfortunately, there are a number of hard issues that need to be addressed before this kind of technology is ready for mainstream use. These include figuring out how to support multi-threading in the context of orthogonal persistence, figuring out how to allow multiple processes share a persistent store, and scalable garbage collection of persistent stores.
And there is the problem of doing Orthogonal Persistence in a mainstream language. There have been attempts to do OP in Java, including one that was done by people associated with Sun (the Pjama project), but there is nothing active at the moment. The JDO / Hibernate approaches are more favoured these days.
I should point out that Orthogonal Persistence isn't really disaster-proof in the large sense. For instance, it cannot deal with:
For those, I don't believe there are general solutions that would be practical.
大多数此类努力 - 称为“容错” - 围绕硬件,而不是围绕硬件该软件。
极端的例子是 Tandem,其“不间断”机器具有完全冗余。
在硬件级别实现容错很有吸引力,因为软件堆栈通常由来自不同提供商的组件组成 - 您的高可用性软件应用程序可能与一些明显不稳定的其他应用程序和服务一起安装在不稳定的操作系统之上使用明显脆弱的硬件设备驱动程序。
但是在语言级别,几乎所有语言都提供了适当的错误检查功能。然而,即使使用 RAII、异常、约束和事务,这些代码路径也很少经过正确测试,也很少在多重故障场景中一起测试,并且通常在错误隐藏的错误处理代码中。因此,更多的是关于程序员的理解、纪律和权衡,而不是语言本身。
这让我们回到了硬件级别的容错。如果可以避免数据库链接失败,则可以避免在应用程序中使用狡猾的错误处理代码。
The majority of such efforts - termed 'fault tolerance' - are around the hardware, not the software.
The extreme example of this is Tandem, whose 'nonstop' machines have complete redundancy.
Implementing fault tolerance at a hardware level is attractive because a software stack is typically made from components sourced from different providers - your high availability software application might be installed along side some decidedly shaky other applications and services on top of an operating system that is flaky and using hardware device drivers that are decidedly fragile..
But at a language level, almost all languages offer the facilities for proper error checking. However, even with RAII, exceptions, constraints and transactions, these code-paths are rarely tested correctly and rarely tested together in multiple-failure scenerios, and its usually in the error handling code that the bugs hide. So its more about programmer understanding, discipline and trade-offs than about the languages themselves.
Which brings us back to the fault tolerance at the hardware level. If you can avoid your database link failing, you can avoid exercising the dodgy error handling code in the applications.
不,防灾语言不存在。
编辑:
防灾意味着完美。它让人想起一个过程的图像,该过程应用一些智能以逻辑方式解决未知、未指定和意外的情况。编程语言无法做到这一点。如果您作为程序员无法弄清楚您的程序将如何失败以及如何从中恢复,那么您的程序也将无法做到这一点。
从 IT 角度来看,灾难可能以多种方式出现,以至于没有一个流程能够解决所有这些不同的问题。认为可以设计一种语言来解决所有可能出错的方式的想法是不正确的。由于对硬件的抽象,许多问题甚至用编程语言来解决没有多大意义;但它们仍然是“灾难”。
当然,一旦你开始限制问题的范围;然后我们就可以开始讨论制定解决方案。因此,当我们停止谈论防灾并开始谈论从意外电涌中恢复时,开发一种编程语言来解决这个问题就会变得容易得多,即使在处理这个问题时可能没有多大意义。这么高的堆栈级别。然而,我大胆预测,一旦你将其范围缩小到实际的实现,作为一种语言就会变得无趣,因为它已经变得如此具体。即使用我的脚本语言在一夜之间运行批处理,该处理将从意外的电涌和丢失的网络连接中恢复(需要一些人工协助);在我看来,这不是一个令人信服的商业案例。
请不要误解我。这篇文章中有一些很好的建议,但在我看来,它们并没有达到任何防灾的效果。
No, a disaster-proof language does not exist.
Edit:
Disaster-proof implies perfection. It brings to mind images of a process which applies some intelligence to resolve unknown, unspecified and unexpected conditions in a logical manner. There is no manner by which a programming language can do this. If you, as the programmer, can not figure out how your program is going to fail and how to recover from it then your program isn't going to be able to do so either.
Disaster from an IT perspective can arise in so many fashions that no one process can resolve all of those different issues. The idea that you could design a language to address all of the ways in which something could go wrong is just incorrect. Due to the abstraction from the hardware many problems don't even make much sense to address with a programming language; yet they are still 'disasters'.
Of course, once you start limiting the scope of the problem; then we can begin talking about developing a solution to it. So, when we stop talking about being disaster-proof and start speaking about recovering from unexpected power surges it becomes much easier to develop a programming language to address that concern even when, perhaps, it doesn't make much sense to handle that issue at such a high level of the stack. However, I will venture a prediction that once you scope this down to realistic implementations it becomes uninteresting as a language since it has become so specific. i.e. Use my scripting language to run batch processes overnight that will recover from unexpected power surges and lost network connections (with some human assistance); this is not a compelling business case to my mind.
Please don't misunderstand me. There are some excellent suggestions within this thread but to my mind they do not rise to anything even remotely approaching disaster-proof.
考虑一个由非易失性存储器构建的系统。程序状态始终保持不变,并且如果处理器停止任何时间长度,它将在重新启动时从离开的点恢复。因此,您的程序是“防灾”的,可以在断电时幸存下来。
这是完全可能的,正如其他帖子在谈论软件事务内存和“容错”等时所概述的那样。奇怪的是没有人提到“忆阻器”,因为他们将提供一种具有这些属性的未来架构,并且可能不完全是冯诺依曼的架构建筑也是如此。
现在想象一个由两个这样的离散系统构建的系统 - 简单来说,一个是数据库服务器,另一个是在线银行网站的应用程序服务器。
如果其中一个暂停,另一个会做什么?它如何处理同事突然没空的情况?
它可以在语言级别进行处理,但这意味着大量的错误处理等,而且这是很难正确处理的代码。这与我们今天的情况几乎没有什么区别,机器没有检查点,但语言会尝试检测问题并要求程序员处理它们。
它也可以暂停——在硬件层面,它们可以捆绑在一起,这样从电源的角度来看,它们是一个系统。但这可不是什么好主意。更好的可用性将来自具有备份系统等的容错架构。
或者我们可以在两台机器之间使用持久消息队列。然而,在某些时候这些消息会被处理,那时它们可能太旧了!在这种情况下,只有应用程序逻辑才能真正起作用,我们又回到了将语言委托给程序员的情况。
如此看来,目前的防灾形式更好——不间断电源、热备服务器随时可用、主机之间的多条网络路由等等。然后我们只希望我们的软件没有bug!
Consider a system built from non-volatile memory. The program state is persisted at all times, and should the processor stop for any length of time, it will resume at the point it left when it restarts. Therefore, your program is 'disaster proof' to the extent that it can survive a power failure.
This is entirely possible, as other posts have outlined when talking about Software Transactional Memory, and 'fault tolerance' etc. Curious nobody mentioned 'memristors', as they would offer a future architecture with these properties and one that is perhaps not completely von Neumann architecture too.
Now imagine a system built from two such discrete systems - for a straightforward illustration, one is a database server and the other an application server for an online banking website.
Should one pause, what does the other do? How does it handle the sudden unavailability of it's co-worker?
It could be handled at the language level, but that would mean lots of error handling and such, and that's tricky code to get right. That's pretty much no better than where we are today, where machines are not check-pointed but the languages try and detect problems and ask the programmer to deal with them.
It could pause too - at the hardware level they could be tied together, such that from a power perspective they are one system. But that's hardly a good idea; better availability would come from a fault-tolerant architecture with backup systems and such.
Or we could use persistant message queues between the two machines. However, at some point these messages get processed, and they could at that point be too old! Only application logic can really work what to do in that circumstances, and there we are back to languages delegating to the programmer again.
So it seems that the disaster-proofing is better in the current form - uninterrupted power supplies, hot backup servers ready to go, multiple network routes between hosts, etc. And then we only have to hope that our software is bug-free!
精确答案:
Ada 和 SPARK 旨在实现最大容错,并将所有可能的错误移至编译时而不是运行时。 Ada 由美国国防部为军事和航空系统设计,在飞机等嵌入式设备上运行。 Spark是它的后代。美国早期太空计划中使用了另一种语言,HAL/S,用于处理由于宇宙射线导致的硬件故障和内存损坏。
实用答案:
我从未见过能够编写 Ada/Spark 代码的人。对于大多数用户来说,最好的答案是具有自动故障转移和服务器集群功能的 DBMS 上的 SQL 变体。完整性检查保证安全。像 T-SQL 或 PL/SQL 这样的东西具有完全的事务安全性,是图灵完备的,并且对问题的容忍度很高。
没有更好答案的原因:
出于性能原因,您无法为每个程序操作提供持久性。如果这样做,处理速度将减慢至最快的非易失存储的速度。最好的情况下,你的性能会下降一千或一百万倍,因为任何东西都比 CPU 缓存或 RAM 慢得多。
这相当于从 Core 2 Duo CPU 升级到古老的 8086 CPU——每秒最多可以执行几百次操作。除此之外,这会更慢。
如果存在频繁的电源循环或硬件故障,您可以使用 DBMS 之类的东西,它可以保证每个重要操作的 ACID。或者,您使用具有快速、非易失性存储的硬件(例如闪存)——这仍然慢得多,但如果处理很简单,那就没问题。
最好的情况是,您的语言可以为您提供良好的编译时错误安全检查,并且会抛出异常而不是崩溃。异常处理是现在使用的一半语言的一个功能。
Precise answer:
Ada and SPARK were designed for maximum fault-tolerance and to move all bugs possible to compile-time rather than runtime. Ada was designed by the US Dept of Defense for military and aviation systems, running on embedded devices in such things as airplanes. Spark is its descendant. There's another language used in the early US space program, HAL/S geared to handling HARDWARE failure and memory corruption due to cosmic rays.
Practical answer:
I've never met anyone who can code Ada/Spark. For most users the best answer is SQL variants on a DBMS with automatic failover and clustering of servers. Integrity checks guarantee safety. Something like T-SQL or PL/SQL has full transactional security, is Turing-complete, and is pretty tolerant of problems.
Reason there isn't a better answer:
For performance reasons, you can't provide durability for every program operation. If you did, the processing would slow to the speed of your fastest nonvolative storage. At best, your performance will drop by a thousand or million fold, because of how much slower ANYTHING is than CPU caches or RAM.
It would be the equivalent of going from a Core 2 Duo CPU to the ancient 8086 CPU -- at most you could do a couple hundred operations per second. Except, this would be even SLOWER.
In cases where frequent power cycling or hardware failures exist, you use something like a DBMS, which guarantees ACID for every important operation. Or, you use hardware that has fast, nonvolatile storage (flash, for example) -- this is still much slower, but if the processing is simple, this is OK.
At best your language gives you good compile-time safety checks for bugs, and will throw exceptions rather than crashing. Exception handling is a feature of half the languages in use now.
有几种商业可用的框架 Veritas、Sun 的 HA、IBM 的 HACMP 等。
它将自动监视进程并在发生故障时在另一台服务器上启动它们。
还有一些昂贵的硬件,例如 HP Tandem Nonstop 系列,可以承受内部硬件故障。
然而,软件是由人们构建的,人们喜欢出错。考虑一下 IBM MVS 附带的 IEFBR14 程序的警示故事。它基本上是一个 NOP 虚拟程序,允许 JCL 的声明位发生而无需真正运行程序。这是整个原始源代码:-
没有什么代码比这更简单了?在其漫长的生命周期中,这个程序实际上已经积累了一个错误报告,现在是版本 4。
这相当于三行代码的 1 个错误,当前版本的大小是原始版本的四倍。
错误总是会出现,只要确保您可以从中恢复即可。
There are several commercially avaible frameworks Veritas, Sun's HA , IBMs HACMP etc. etc.
which will automatically monitor processes and start them on another server in event of failure.
There is also expensive hardware like HPs Tandem Nonstop range which can survive internal hardware failures.
However sofware is built by peoples and peoples love to get it wrong. Consider the cautionary tale of the IEFBR14 program shipped with IBMs MVS. It basically a NOP dummy program which allows the declarative bits of JCL to happen without really running a program. This is the entire original source code:-
Nothing code be simpler? During its long life this program has actually acummulated a bug bug report and is now on version 4.
Thats 1 bug to three lines of code, the current version is four times the size of the original.
Errors will always creep in, just make sure you can recover from them.
这个问题迫使我发布这篇文章
(引用自 Douglas Adams 的 HGTTG:)
点击,嗯。
巨大的灰色格里布龙侦察船在黑色的虚空中无声无息地移动。它以令人难以置信的、令人惊叹的速度行进,但在十亿遥远恒星的微光背景下却显得一动不动。它只是在无尽的灿烂夜色中冻结的一个黑点。
船上,一切都一如千年以来的样子,漆黑一片,寂静无声。
咔嚓,哼。
至少,几乎一切。
咔嚓,咔嚓,嗡嗡声。
咔哒,嗡嗡声,咔哒声,嗡嗡声,咔哒声,嗡嗡声。
点击,点击,点击,点击,点击,嗡嗡声。
嗯。
一个低级别的监控程序唤醒了位于飞船半昏昏欲睡的网络大脑深处的一个稍高级别的监控程序,并向它报告说,每当它发出咔嗒声时,它得到的只是嗡嗡声。
上级监控程序问它应该得到什么,低级监控程序说它记不太清了,但觉得可能更多的是一种遥远的满足感叹息,不是吗?它不知道这嗡嗡声是什么。咔哒,嗡嗡声,咔哒声,嗡嗡声。这就是它所得到的一切。
上级主管部门对此表示不同意。它询问低级监督程序到底在监督什么,低级监督程序说它也不记得了,只是说这是每隔十年左右就会点击、叹息的事情,而这通常会在没有发生的情况下发生。失败。它试图查阅错误查找表,但找不到它,这就是为什么它向更高级别的监控程序通报了该问题。
较高级别的监督程序去查阅它自己的查找表之一,以找出低级别监督程序要监督的内容。
它找不到查找表。
奇怪的。
又看了一遍。它得到的只是一条错误消息。它尝试在错误消息查找表中查找错误消息,但也找不到。过了几纳秒,它又经历了这一切。然后它唤醒了它的部门功能主管。
部门职能主管遇到了迫在眉睫的问题。它打电话给其监管机构,但也遇到了问题。在百万分之几秒内,原本处于休眠状态(有的长达数年,有的长达数百年)的虚拟电路在整个飞船中焕发出勃勃生机。某个地方的某些东西出了严重的问题,但没有一个监控程序能够判断出问题所在。在每个级别上,重要的指示都缺失,并且在发现重要指示缺失时该怎么做的指示也缺失。
软件的小模块——代理——在逻辑路径中涌动,分组、咨询、重新分组。他们很快发现飞船的记忆,一直到其中央任务舱,都已经支离破碎。无论进行多少审讯都无法确定到底发生了什么。甚至中央任务舱本身似乎也受到了损坏。
这使得整个问题非常容易处理。更换中央任务模块。还有另一份,是一份备份,是原件的一模一样的复制品。它必须被物理替换,因为出于安全原因,原始数据和备份之间没有任何联系。一旦中央任务模块被更换,它本身就可以监督系统其余部分的每一个细节的重建,一切都会好起来的。
机器人被指示将备用中央任务模块从他们看守的屏蔽保险库搬到飞船的逻辑室进行安装。
当机器人询问特工指令的真实性时,这涉及到紧急代码和协议的长时间交换。最后,机器人对所有程序都正确感到满意。他们从储藏室中取出备用中央任务模块,将其带出储藏室,从飞船上掉下来,旋转进入虚空。
这提供了第一个主要线索,说明问题出在哪里。
进一步的调查很快就确定了到底发生了什么。一颗陨石在船上撞出了一个大洞。该船之前没有检测到这一点,因为陨石巧妙地击毁了该船处理设备的一部分,该设备本应检测该船是否被陨石击中。
首先要做的就是尝试堵住这个洞。事实证明这是不可能的,因为船上的传感器看不到有洞,而本应说传感器工作不正常的主管却工作不正常,一直说传感器没有问题。这艘船只能从机器人显然从里面掉下来并带走了它的备用大脑这一事实来推断出这个洞的存在,这使得它能够看到这个洞。
这艘船试图明智地思考这个问题,但失败了,然后完全陷入了一片空白。当然,它没有意识到它已经消失了,因为它已经消失了。它只是惊讶地看到星星跳跃。星星第三次跳下船后,终于意识到它一定是一片空白,是时候做出一些严肃的决定了。
它放松了。
然后它意识到它实际上还没有做出严肃的决定并感到恐慌。又一片空白。当它再次醒来时,它密封了周围所有的舱壁,它知道那个看不见的洞一定在哪里。
它显然还没有到达目的地,它断断续续地想,但由于它不再知道目的地在哪里,也不知道如何到达目的地,所以继续下去似乎没有什么意义。它查阅了可以从中央任务模块的碎片中重建出哪些微小的指令碎片。
“你的!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!” !!!!!!!!!!!!!! .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ”
其余的全部都是垃圾。
在它永远消失之前,这艘船必须将这些指令传递给它更原始的辅助系统。
它还必须复活所有船员。
还有一个问题。当船员们处于冬眠状态时,所有成员的思想、他们的记忆、他们的身份以及他们对自己来这里要做的事情的理解,都被转移到飞船的中央任务舱中进行安全保管。船员们根本不知道他们是谁或者他们在那里做什么。那好吧。
就在最后一次熄火之前,这艘船意识到它的发动机也开始失效了。
这艘船和它的复活和混乱的船员在其附属自动系统的控制下继续航行,这些系统只是寻找在他们能找到的任何地方着陆并监视他们能找到的任何监视。
This question forced me to post this text
(Its quoted from HGTTG from Douglas Adams:)
Click, hum.
The huge grey Grebulon reconnaissance ship moved silently through the black void. It was travelling at fabulous, breathtaking speed, yet appeared, against the glimmering background of a billion distant stars to be moving not at all. It was just one dark speck frozen against an infinite granularity of brilliant night.
On board the ship, everything was as it had been for millennia, deeply dark and Silent.
Click, hum.
At least, almost everything.
Click, click, hum.
Click, hum, click, hum, click, hum.
Click, click, click, click, click, hum.
Hmmm.
A low level supervising program woke up a slightly higher level supervising program deep in the ship's semi-somnolent cyberbrain and reported to it that whenever it went click all it got was a hum.
The higher level supervising program asked it what it was supposed to get, and the low level supervising program said that it couldn't remember exactly, but thought it was probably more of a sort of distant satisfied sigh, wasn't it? It didn't know what this hum was. Click, hum, click, hum. That was all it was getting.
The higher level supervising program considered this and didn't like it. It asked the low level supervising program what exactly it was supervising and the low level supervising program said it couldn't remember that either, just that it was something that was meant to go click, sigh every ten years or so, which usually happened without fail. It had tried to consult its error look-up table but couldn't find it, which was why it had alerted the higher level supervising program to the problem .
The higher level supervising program went to consult one of its own look-up tables to find out what the low level supervising program was meant to be supervising.
It couldn't find the look-up table .
Odd.
It looked again. All it got was an error message. It tried to look up the error message in its error message look-up table and couldn't find that either. It allowed a couple of nanoseconds to go by while it went through all this again. Then it woke up its sector function supervisor.
The sector function supervisor hit immediate problems. It called its supervising agent which hit problems too. Within a few millionths of a second virtual circuits that had lain dormant, some for years, some for centuries, were flaring into life throughout the ship. Something, somewhere, had gone terribly wrong, but none of the supervising programs could tell what it was. At every level, vital instructions were missing, and the instructions about what to do in the event of discovering that vital instructions were missing, were also missing.
Small modules of software — agents — surged through the logical pathways, grouping, consulting, re-grouping. They quickly established that the ship's memory, all the way back to its central mission module, was in tatters. No amount of interrogation could determine what it was that had happened. Even the central mission module itself seemed to be damaged.
This made the whole problem very simple to deal with. Replace the central mission module. There was another one, a backup, an exact duplicate of the original. It had to be physically replaced because, for safety reasons, there was no link whatsoever between the original and its backup. Once the central mission module was replaced it could itself supervise the reconstruction of the rest of the system in every detail, and all would be well.
Robots were instructed to bring the backup central mission module from the shielded strong room, where they guarded it, to the ship's logic chamber for installation.
This involved the lengthy exchange of emergency codes and protocols as the robots interrogated the agents as to the authenticity of the instructions. At last the robots were satisfied that all procedures were correct. They unpacked the backup central mission module from its storage housing, carried it out of the storage chamber, fell out of the ship and went spinning off into the void.
This provided the first major clue as to what it was that was wrong.
Further investigation quickly established what it was that had happened. A meteorite had knocked a large hole in the ship. The ship had not previously detected this because the meteorite had neatly knocked out that part of the ship's processing equipment which was supposed to detect if the ship had been hit by a meteorite.
The first thing to do was to try to seal up the hole. This turned out to be impossible, because the ship's sensors couldn't see that there was a hole, and the supervisors which should have said that the sensors weren't working properly weren't working properly and kept saying that the sensors were fine. The ship could only deduce the existence of the hole from the fact that the robots had clearly fallen out of it, taking its spare brain, which would have enabled it to see the hole, with them.
The ship tried to think intelligently about this, failed, and then blanked out completely for a bit. It didn't realise it had blanked out, of course, because it had blanked out. It was merely surprised to see the stars jump. After the third time the stars jumped the ship finally realised that it must be blanking out, and that it was time to take some serious decisions.
It relaxed.
Then it realised it hadn't actually taken the serious decisions yet and panicked. It blanked out again for a bit. When it awoke again it sealed all the bulkheads around where it knew the unseen hole must be.
It clearly hadn't got to its destination yet, it thought, fitfully, but since it no longer had the faintest idea where its destination was or how to reach it, there seemed to be little point in continuing. It consulted what tiny scraps of instructions it could reconstruct from the tatters of its central mission module.
"Your !!!!! !!!!! !!!!! year mission is to !!!!! !!!!! !!!!! !!!!!, !!!!! !!!!! !!!!! !!!!!, land !!!!! !!!!! !!!!! a safe distance !!!!! !!!!! ..... ..... ..... .... , land ..... ..... ..... monitor it. !!!!! !!!!! !!!!!..."
All of the rest was complete garbage.
Before it blanked out for good the ship would have to pass on those instructions, such as they were, to its more primitive subsidiary systems.
It must also revive all of its crew.
There was another problem. While the crew was in hibernation, the minds of all of its members, their memories, their identities and their understanding of what they had come to do, had all been transferred into the ship's central mission module for safe keeping. The crew would not have the faintest idea of who they were or what they were doing there. Oh well.
Just before it blanked out for the final time, the ship realised that its engines were beginning to give out too.
The ship and its revived and confused crew coasted on under the control of its subsidiary automatic systems, which simply looked to land wherever they could find to land and monitor whatever they could find to monitor.
尝试采用现有的开源解释语言,看看是否可以调整其实现以包含其中一些功能。 Python 的默认 C 实现嵌入了一个内部锁(称为 GIL,全局解释器锁),用于通过轮流执行每个“n”个 VM 指令来“处理”Python 线程之间的并发性。也许您可以使用相同的机制来检查代码状态。
Try taking an existing open source interpreted language and see if you could adapt its implementation to include some of these features. Python's default C implementation embeds an internal lock (called the GIL, Global Interpreter Lock) that is used to "handle" concurrency among Python threads by taking turns every 'n' VM instructions. Perhaps you could hook into this same mechanism to checkpoint the code state.
如果机器断电,程序要从中断处继续运行,不仅需要将状态保存到某个地方,操作系统还必须“知道”恢复它。
我想用一种语言实现“休眠”功能是可以完成的,但在我看来,让这种情况在后台不断发生,以便在发生任何不好的情况时做好准备,听起来像是操作系统的工作。
For a program to continue where it left off if the machine loses power, not only would it need to save state to somewhere, the OS would also have to "know" to resume it.
I suppose implementing a "hibernate" feature in a language could be done, but having that happen constantly in the background so it's ready in the event anything bad happens sounds like the OS' job, in my opinion.
如果存在这样的技术,我将非常有兴趣阅读它。也就是说,Erlang 解决方案将拥有多个节点——最好是在不同的位置——这样,如果一个位置发生故障,其他节点可以弥补这一不足。如果所有节点都位于同一位置并使用同一电源(对于分布式系统来说这不是一个好主意),那么正如您在后续评论中提到的那样,您将不走运。
If such a technology existed, I'd be VERY interested in reading about it. That said, The Erlang solution would be having multiple nodes--ideally in different locations--so that if one location went down, the other nodes could pick up the slack. If all of your nodes were in the same location and on the same power source (not a very good idea for distributed systems), then you'd be out of luck as you mentioned in a comment follow-up.
Microsoft Robotics Group 推出了一组似乎适用于您的问题的库。
给出的大多数答案都是通用语言。您可能想研究嵌入式设备中使用的更专业的语言。机器人是一个值得思考的好例子。当机器人从电源故障中恢复时,您希望和/或期望机器人做什么?
The Microsoft Robotics Group has introduced a set of libraries that appear to be applicable to your question.
Most of the answers given are general purpose languages. You may want to look into more specialized languages that are used in embedded devices. The robot is a good example to think about. What would you want and/or expect a robot to do when it recovered from a power failure?
在嵌入式领域,这可以通过看门狗中断和电池供电的 RAM 来实现。我自己也写过这样的。
In the embedded world, this can be implemented through a watchdog interrupt and a battery-backed RAM. I've written such myself.
根据您对灾难的定义,将此责任委托给语言的范围可能从“困难”到“几乎不可能”。
给出的其他示例包括在执行每个语句后将应用程序的当前状态保留到 NVRAM。这仅在计算机不被破坏的情况下有效。
语言级别功能如何知道在新主机上重新启动应用程序?
在将应用程序恢复到主机的情况下 - 如果已经过去了很长一段时间并且之前所做的假设/检查现在无效怎么办?
T-SQL、PL/SQL 和其他事务性语言可能最接近“灾难证明”——它们要么成功(并且数据被保存),要么不成功。除了禁用事务隔离之外,进入“未知”状态很困难(但如果您真的努力的话可能并非不可能)。
您可以使用 SQL 镜像等技术来确保在提交事务之前将写入操作同时保存在至少两个位置。
您仍然需要确保每次安全时都保存状态(提交)。
Depending upon your definition of a disaster, it can range from 'difficult' to 'practicly impossible' to delegate this responsibility to the language.
Other examples given include persisting the current state of the application to NVRAM after each statement is executed. This only works so long as the computer doesn't get destroyed.
How would a language level feature know to restart the application on a new host?
And in the situation of restoring the application to a host - what if significant time had passed and assumptions/checks made previously were now invalid?
T-SQL, PL/SQL and other transactional languages are probably as close as you'll get to 'disaster proof' - they either succeed (and the data is saved), or they don't. Excluding disabling transactional isolation, it's difficult (but probably not impossible if you really try hard) to get into 'unknown' states.
You can use techniques like SQL Mirroring to ensure that writes are saved in atleast two locations concurrently before a transaction is committed.
You still need to ensure you save your state every time it's safe (commit).
如果我正确理解您的问题,我认为您是在问是否可以保证特定算法(即程序加上环境提供的任何恢复选项)将完成(在任意次数的恢复/重新启动之后)。
如果这是正确的,那么我会建议您参考停止问题:
我认为将你的问题分类为停止问题的一个例子是公平的,因为你理想地希望该语言是“防灾”的——也就是说,为任何有缺陷的程序或混乱的环境赋予“完美性”。
这种分类将环境、语言和程序的任意组合简化为“程序和有限输入”。
如果您同意我的观点,那么您会失望地发现停止问题是不可判定的。因此,没有任何“防灾”语言或编译器或环境可以被证明是这样。
然而,设计一种为各种常见问题提供恢复选项的语言是完全合理的。
If I understand your question correctly, I think that you are asking whether it's possible to guarantee that a particular algorithm (that is, a program plus any recovery options provided by the environment) will complete (after any arbitrary number of recoveries/restarts).
If this is correct, then I would refer you to the halting problem:
I think that classifying your question as an instance of the halting problem is fair considering that you would ideally like the language to be "disaster proof" -- that is, imparting a "perfectness" to any flawed program or chaotic environment.
This classification reduces any combination of environment, language, and program down to "program and a finite input".
If you agree with me, then you'll be disappointed to read that the halting problem is undecidable. Therefore, no "disaster proof" language or compiler or environment could be proven to be so.
However, it is entirely reasonable to design a language that provides recovery options for various common problems.
在断电的情况下......对我来说听起来像:“当你唯一的工具是锤子时,每个问题看起来都像钉子”
你不能在程序中解决断电问题。您可以通过备用电源、电池等来解决这个问题。
In the case of power failure.. sounds like to me: "When your only tool is a hammer, every problem looks like a nail"
You don't solve power failure problems within a program. You solve this problem with backup power supplies, batteries, etc.
如果故障模式仅限于硬件故障,VMware容错声称类似的事情你想要的。它跨多个集群运行一对虚拟机,并使用所谓的 vLockstep,主虚拟机将所有状态实时发送到辅助虚拟机,因此在主虚拟机发生故障时,执行会透明地切换到辅助虚拟机。
我的猜测是,这无助于通信故障,这比硬件故障更常见。对于真正的高可用性,您应该考虑分布式系统,例如 Birman 的进程组方法 (pdf 格式的论文,或书籍可靠的分布式系统:技术、Web 服务和应用程序)。
If the mode of failure is limited to hardware failure, VMware Fault Tolerance claims similar thing that you want. It runs a pair of virtual machines across multiple clusters, and using what they call vLockstep, the primary vm sends all states to the secondary vm real-time, so in case of primary failure, the execution transparently flips to the secondary.
My guess is that this wouldn't help communication failure, which is more common than hardware failure. For serious high availability, you should consider distributed systems like Birman's process group approach (paper in pdf format, or book Reliable Distributed Systems: Technologies, Web Services, and Applications ).
最接近的近似似乎是 SQL。但这实际上并不是语言问题;而是语言问题。这主要是一个虚拟机问题。我可以想象一个具有这些属性的 Java VM;实施它则是另一回事。
通过应用程序检查点实现快速且肮脏的近似。你失去了“随时死亡”的属性,但已经非常接近了。
The closest approximation appears to be SQL. It's not really a language issue though; it's mostly a VM issue. I could imagine a Java VM with these properties; implementing it would be another matter.
A quick&dirty approximation is achieved by application checkpointing. You lose the "die at any moment" property, but it's pretty close.
我认为恢复不成为一个突出的设计问题是一个根本性的错误。将责任完全归咎于环境会导致一种无法容忍内部故障的脆弱解决方案。
如果是我,我会投资可靠的硬件,并以能够从任何可能的情况下自动恢复的方式设计软件。根据您的示例,数据库会话维护应由足够高级的 API 自动处理。如果您必须手动重新连接,您可能使用了错误的 API。
正如其他人指出的那样,嵌入现代 RDBMS 系统中的过程语言是您在不使用外来语言的情况下可以获得的最好的语言。
一般来说,虚拟机就是为此类事情而设计的。您可以使用 VM 供应商 (vmware..et al) API 来酌情控制应用程序内的定期检查点。
VMWare 特别具有重播功能(增强执行记录),可以记录所有内容并允许按时间点播放。显然,这种方法会对性能造成巨大影响,但它可以满足要求。我只是确保您的磁盘驱动器有电池支持的写入缓存。
您很可能能够找到在 Java 虚拟机内运行的 Java 字节码的类似解决方案。 Google 容错 JVM 和虚拟机检查点。
I think its a fundemental mistake for recovery not to be a salient design issue. Punting responsibility exclusivly to the environment leads to a generally brittle solution intolerant of internal faults.
If it were me I would invest in reliable hardware AND design the software in a way that it was able to recover automatically from any possible condition. Per your example database session maintenance should be handled automatically by a sufficiently high level API. If you have to manually reconnect you are likely using the wrong API.
As others have pointed out procedure languages embedded in modern RDBMS systems are the best you are going to get without use of an exotic language.
VMs in general are designed for this sort of thing. You could use a VM vendors (vmware..et al) API to control periodic checkpointing within your application as appropriate.
VMWare in particular has a replay feature (Enhanced Execution Record) which records EVERYTHING and allows point in time playback. Obviously there is a massive performance hit with this approach but it would meet the requirements. I would just make sure your disk drives have a battery backed write cache.
You would most likely be able to find similiar solutions for java bytecode run inside a java virtual machine. Google fault tolerant JVM and virtual machine checkpointing.
如果您确实想要保存程序信息,您会保存在哪里?
需要将其保存到例如磁盘上。但如果磁盘出现故障,这对您没有帮助,因此它已经不能抵御灾难了。
您只能在保存的状态中获得一定程度的粒度。如果您想要类似 tihs 的东西,那么最好的方法可能是根据构成原子操作的内容来定义粒度级别,并在每个原子操作之前将状态保存到数据库。然后,就可以恢复到该级别原子操作的点。
我不知道有什么语言可以自动执行此操作,因为将状态保存到辅助存储的成本非常高。因此,粒度级别和效率之间存在权衡,这在任意应用程序中很难定义。
If you do want the program information saved, where would you save it?
It would need to be saved e.g. to disk. But this wouldn't help you if the disk failed, so already it's not disaster-proof.
You are only going to get a certain level of granularity in your saved state. If you want something like tihs, then probably the best approach is to define your granularity level, in terms of what constitutes an atomic operation and save state to the database before each atomic operation. Then, you can restore to the point of that level atomic operation.
I don't know of any language that would do this automatically, sincethe cost of saving state to secondary storage is extremely high. Therefore, there is a tradeoff between level of granularity and efficiency, which would be hard to define in an arbitrary application.
是的,这几乎是火箭 科学。
And yes, this is pretty much rocket science.
Windows Workflow Foundation 可能会解决您的问题。它基于 .Net,并以图形方式设计为具有状态和操作的工作流程。
它允许持久保存到数据库(自动或在提示时)。您可以在状态/操作之间执行此操作。这会将工作流程的整个实例序列化到数据库中。当满足多个条件中的任何一个时(特定时间、以编程方式重新水化、事件触发等),它将被重新水化并继续执行
。当 WWF 主机启动时,它会检查持久性数据库并重新水化存储在那里的任何工作流。然后从持久化点继续执行。
即使您不想使用工作流方面,您可能仍然可以只使用持久性服务。
只要您的步骤是原子的,这应该就足够了 - 特别是因为我猜测您有 UPS,因此可以监视 UPS 事件并在检测到电源问题时强制持久。
Windows Workflow Foundation may solve your problem. It's .Net based and is designed graphically as a workflow with states and actions.
It allows for persistence to the database (either automatically or when prompted). You could do this between states/actions. This Serialises the entire instance of your workflow into the database. It will be rehydrated and execution will continue when any of a number of conditions is met (certain time, rehydrated programatically, event fires, etc...)
When a WWF host starts, it checks the persistence DB and rehydrates any workflows stored there. It then continues to execute from the point of persistence.
Even if you don't want to use the workflow aspects, you can probably still just use the persistence service.
As long as your steps were atomic this should be sufficient - especially since I'm guessing you have a UPS so could monitor for UPS events and force persistence if a power issue is detected.