Erlang 的 99.9999999%(九个九)可靠性
据报道,Erlang 已在生产系统中使用了 20 多年,正常运行时间百分比为 99.9999999%。
我算了一下:
20*365.25*24*60*60*(1 - 0.999999999) == 0.631 s
这意味着系统在20年的时间里只有不到一秒的停机时间。我并不是想挑战这一点的有效性,我只是好奇我们如何能够(有意或无意)关闭系统仅 0.631 秒。有熟悉大型软件系统的人可以给我们解释一下吗?谢谢。
有谁知道如何计算处理单元(或机器)集群上的服务的停机时间?
Erlang was reported to have been used in production systems for over 20 years with an uptime percentage of 99.9999999%.
I did the math as the following:
20*365.25*24*60*60*(1 - 0.999999999) == 0.631 s
That means the system only has less than one second of downtime during the period of 20 years. I am not trying to challenge the validity of this, I am just curious about how we can shut down a system (on purpose or by accident) for only 0.631 second. Could anyone who are familiar with large software system explain this to us? Thank you.
Does anyone know how to calculate the downtime of a service over a cluster of processing units (or machines)?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
可靠性数字不应该衡量 AXD301(相关项目)任何部分关闭 20 多年的总时间。它代表了 20 年来
AXD301
系统提供的服务离线的总时间。细微的差别。正如乔·阿姆斯特朗此处所说:如果你深入挖掘一下,在 Erlang 的原作者 Joe 撰写的博士论文(其中包括 AXD301 的案例研究)中,你会读到:
因此,只要交换机所属的网络在不停机的情况下运行,作者就可以为 AXD301 声明“九个九的可靠性”(这就是他所说的全部内容,避免具体说明)。这并不一定意味着 Erlang 是如此高可靠性的唯一原因。
编辑:事实上,“20年”本身似乎是一种误解。 Joe 在同一篇文章中提到了 20 年的数字,但它实际上与九个九的可靠性数字没有联系,后者可能来自一项更短的研究(正如其他人提到的)。
The reliability figure wasn't supposed to measure the total time any part of
AXD301
(project in question) was ever shut down for over 20 years. It represents the total time over those 20 years that the service provided by theAXD301
system was ever offline. Subtle difference. As Joe Armstrong says here:If you dig a bit deeper, in the PhD thesis written by Joe, the original author of Erlang (which includes a case study of
AXD301
), you read:So, as long as the network that the switch was a part of was running without downtime, the author can state "nine nines reliability" for
AXD301
(which was all he ever said, avoiding specifics). It doesn't necessarily mean Erlang is the only cause of such high reliability.EDIT: In fact, "20 years" itself seems like a misinterpretation. Joe mentions a figure of 20 years in the same article, but it's not actually connected to the nine-nines reliability figure, which potentially came out of a much shorter study (as others have mentioned).
虽然其他人已经解决了您所询问的具体案例,但您的问题似乎是基于误解。您提出问题的方式让我相信您认为有一个手动过程可以让系统在崩溃或停机维护后再次运行。
Erlang 有几个功能可以消除作为停机来源的人工工作时间:
热代码重新加载。在 Erlang 系统中,很容易编译和加载现有模块的替换模块。 BEAM 模拟器会自动进行交换,而不会明显停止任何操作。毫无疑问,这种传输发生的时间很短,但它是在计算机时间中自动发生的,而不是在人类时间中手动发生的。这使得在基本上零停机时间的情况下进行升级成为可能。 (如果替换模块存在导致系统崩溃的错误,您可能会遇到停机情况,但这就是您在部署到生产之前进行测试的原因。)
主管。 Erlang 的 OTP 库内置了一个监控框架,可让您定义模块崩溃时系统应如何反应。这里的标准操作是重新启动发生故障的模块。假设重新启动的模块不会立即再次崩溃,则系统的总停机时间可能只有几毫秒。一个几乎不会崩溃的可靠系统在多年的运行时间中可能确实只积累了不到一秒的总停机时间。
流程。它们大致对应于其他语言中的线程,只不过它们除了通过持久数据存储之外不共享状态。除此之外,通信是通过消息传递进行的。因为 Erlang 进程非常便宜(比操作系统线程便宜得多),这鼓励了松散耦合的设计,这样如果一个进程终止,只有系统的一小部分经历停机。通常,主管会重新启动该进程,对系统的其余部分几乎没有影响。
异步消息传递。当一个进程想要告诉另一个进程某件事时,Erlang 语言中有一个一流的运算符可以让它做到这一点。消息发送过程不必等待接收者处理消息,也不必协调所发送数据的所有权。 Erlang 消息传递系统的异步功能特性可以解决所有这些问题。这有助于保持较高的正常运行时间,因为它减少了系统某一部分的停机对其他部分的影响。
聚类。这是从上一点得出的:Erlang 的消息传递机制在网络上的机器之间透明地工作,因此发送进程甚至不必关心接收者是否位于单独的机器上。这提供了一种简单的机制,可以将工作负载分配给多台机器,每台机器都可以单独关闭,而不会损害整个系统的正常运行时间。
While the others have addressed the specific case you're asking about, your question seems to be based on a misapprehension. The way you've asked the question makes me believe you are thinking there is a manual process to getting the system running again after it crashes or is taken down for maintenance.
Erlang has several features that remove human working time as a source of downtime:
Hot code reloading. In an Erlang system, it is easy to compile and load a replacement module for an existing one. The BEAM emulator does the swap automatically without apparently stopping anything. There is doubtless some tiny amount of time during which this transfer happens, but it's happening automatically in computer time, rather than manually in human time. This makes it possible to do upgrades with essentially zero downtime. (You could have downtime if the replacement module has a bug which crashes the system, but that's why you test before deploying to production.)
Supervisors. Erlang's OTP library has a supervisory framework built into it which lets you define how the system should react if a module crashes. The standard action here is to restart the failed module. Assuming the restarted module doesn't immediately crash again, the total downtime charged against your system might be a matter of milliseconds. A solid system that hardly ever crashes might indeed accumulate only a fraction of a second of total downtime over the course of years of run time.
Processes. These correspond roughly to threads in other languages, except that they do not share state except through persistent data stores. Other than that, communication happens via message passing. Because Erlang processes are very inexpensive (far cheaper than OS threads) this encourages a loosely-coupled design, so that if a process dies, only one tiny part of the system experiences downtime. Typically, the supervisor restarts that one process, with little to no impact on the rest of the system.
Asynchronous message passing. When one process wants to tell another something, there is a first-class operator in the Erlang language that lets it do that. The message sending process doesn't have to wait for the receiver to process the message, and it doesn't have to coordinate ownership of data sent. The asynchronous functional nature of Erlang's message-passing system takes care of all that. This helps maintain high uptimes because it reduces the effect that downtime in one part of a system can have on other parts.
Clustering. This follows from the previous point: Erlang's message passing mechanism works transparently between machines on a network, so a sending process doesn't even have to care that the receiver is on a separate machine. This provides an easy mechanism for dividing a workload up among many machines, each of which can go down separately without harming overall system uptime.
99.9999999% 的可用性数字是一个经常被引用但从根本上具有误导性的统计数据。 AXD-301 团队成员之一 Mats Cronqvist 进行了演讲 (视频)(我参加)在旧金山举行的 2010 年 Erlang Factory 会议上,讨论了这一精确的可用性统计数据。据他介绍,英国电信声称使用 AXD-301 进行了“5 个节点年”的试用期(我相信是从 2002 年 1 月到 9 月)。试验结束时有 14 个节点承载实时流量。
Cronqvist 特别指出,这并不代表整个 AXD-301 历史,也不代表整个 Erlang,并且他对 Joe Armstrong 不断引用这一点感到不高兴,这导致了人们对 Erlang 可靠性的过高期望。 其他已经写道五个九是一个更现实的数字。
需要说明的是,我是一名狂热的 Erlang 支持者和开发者,他相信专家使用 Erlang 确实可以带来非常高可用性的系统,但只是想减少炒作。我当然认为克朗奎斯特对事实的描述是准确的,并且没有理由相信相反的观点。
The 99.9999999% availability figure is an often-quoted but fundamentally misleading statistic. Mats Cronqvist, one of the AXD-301 team members, gave a presentation (video) (which I attended) at the 2010 Erlang Factory conference in San Francisco, discussing this precise availability statistic. According to him, it was claimed by British Telecom for a trial period (I believe from January to September 2002) of "5 node-years" using the AXD-301. There were 14 nodes carrying live traffic by the end of the trial.
Cronqvist specifically stated that this is not representative of the entire AXD-301 history, or Erlang in general, and that he was not happy that Joe Armstrong kept quoting this, leading to overblown expectations of Erlang's reliability. Others have written that five nines is a more realistic figure.
It should be stated that I am a fervent Erlang supporter and developer, who believes that the expert use of Erlang can indeed lead to very highly available systems, but just wants to reduce the hype. I of course assume that Cronqvist's representation of the facts is accurate, and have no reason to believe otherwise.
我对这些统计数据的理解是,它是在生产中的所有 AXD301 系统上计算的。我们可以预计,当 AXD301 出现严重问题时,其宕机时间将超过 0.631 秒。在此期间,其他 AXD301 将接管以保持网络运行。
然而,当您将所有运行的 AXD301 的总小时数相加,并计算出故障的 AXD301 的比率时,您会发现 99.999999%
这就是我对这个数字的理解。
希望这有帮助。
My understanding of those statistics is that it is computed over ALL AXD301 systems in production. We can expect that when an AXD301 has a severe problem, it would be down for more than 0.631 seconds. During this pediod, other AXD301 will take over to keep the network operational.
However, when you sum the total numbers of hours of all running AXD301, make the ratio for the one failing AXD301, you find 99.999999%
That's how I understand this figure.
Hope this help.