快速重启技术而不是保持良好状态(可用性和一致性)

发布于 2024-08-04 12:13:28 字数 470 浏览 3 评论 0原文

您多久通过重新启动计算机、路由器、程序、浏览器来解决问题?或者甚至通过重新安装操作系统或软件组件?

当怀疑软件组件没有以正确的方式保持其状态时,这似乎是一种常见模式,然后您只需通过重新启动组件即可获取初始状态。

我听说亚马逊/谷歌有一个由很多节点组成的集群。每个节点的一个重要属性是它可以在几秒钟内重新启动。因此,如果其中一个失败,那么只需重新启动它即可将其恢复到初始状态。

是否有任何语言/框架/设计模式可以利用这种技术作为一等公民?

编辑 该链接描述了亚马逊背后的一些原则以及可用性和一致性的总体原则: http://www.infoq.com/presentations/availability-consistency

How often do you solve your problems by restarting a computer, router, program, browser? Or even by reinstalling the operating system or software component?

This seems to be a common pattern when there is a suspect that software component does not keep its state in the right way, then you just get the initial state by restarting the component.

I've heard that Amazon/Google has a cluster of many-many nodes. And one important property of each node is that it can restart in seconds. So, if one of them fails, then returning it back to initial state is just a matter of restarting it.

Are there any languages/frameworks/design patterns out there that leverage this techinque as a first-class citizen?

EDIT The link that describes some principles behind Amazon as well as overall principles of availability and consistency:
http://www.infoq.com/presentations/availability-consistency

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

二智少女猫性小仙女 2024-08-11 12:13:28

这在unix/linux世界里其实是非常罕见的。这些操作系统(Windows 也是如此)旨在保护自身免受不良进程的影响。我确信谷歌不会依靠硬重启来纠正行为不当的软件。我想说的是,不应采用这种技术,如果有人说这是恢复其软件的最快途径,您应该寻找其他方法!

This is actually very rare in the unix/linux world. Those oses were designed (and so was windows) to protect themselves from badly behaved processes. I am sure google is not relying on hard restarts to correct misbehaved software. I would say this technique should not be employed and if someone says that the fatest route to recovery for their software you should look for something else!

李不 2024-08-11 12:13:28

这在嵌入式系统领域和电信领域很常见。这在基于服务器的世界中不太常见。

您可能会对一个研究小组感兴趣。他们一直致力于面向恢复的计算或“鹏”。 ROC 的关键原则是任何程序在启动后都可以处于最干净、最好、最可靠的状态。因此,在检测到故障时,他们更愿意重新启动软件,而不是尝试从故障中恢复。

听起来很简单,对吧?嗯,大部分研究都是为了实现这个想法。原因正是您和其他评论者所指出的:操作系统重新启动太慢,无法成为可行的恢复方法。

ROC 依赖于三个主要部分:

  1. 尽早检测故障的方法。
  2. 一种隔离故障组件同时保留系统其余部分的方法。
  3. 组件级重新启动。

ROC 与典型的“每晚重新启动”方法之间的真正关键区别在于,ROC 是一种策略,其中重新启动是一种反应。我的意思是,大多数软件都是通过某种程度的错误处理和恢复(抛出和捕获、日志记录、重试循环等)编写的。ROC 程序会检测到故障(异常)并立即 退出。混合这两种范式只会让你面临两全其美的情况——低可靠性和错误。

This is common in the embedded systems world, and in telecommunications. It's much less common in the server based world.

There's a research group you might be interested in. They've been working on Recovery-Oriented Computing or "ROC". The key principle in ROC is that the cleanest, best, most reliable state that any program can be in is right after starting up. Therefore, on detecting a fault, they prefer to restart the software rather than attempt to recover from the fault.

Sounds simple enough, right? Well, most of the research has gone into implementing that idea. The reason is exactly what you and other commenters have pointed out: OS restarts are too slow to be a viable recovery method.

ROC relies on three major parts:

  1. A method to detect faults as early as possible.
  2. A means of isolating the faulty component while preserving the rest of the system.
  3. Component-level restarts.

The real key difference between ROC and the typical "nightly restart" approach is that ROC is a strategy where the reboots are a reaction. What I mean is that most software is written with some degree of error handling and recovery (throw-and-catch, logging, retry loops, etc.) A ROC program would detect the fault (exception) and immediately exit. Mixing up the two paradigms just leaves you with the worst of both worlds---low reliability and errors.

挽心 2024-08-11 12:13:28

微控制器通常有一个看门狗定时器,必须经常重置(通过一行代码),否则微控制器将重置。这可以防止固件陷入无限循环、等待输入等。

未使用的内存有时会被设置为导致重置的指令,或者跳转到微控制器重置时启动的同一位置。如果微控制器以某种方式跳转到程序存储器之外的位置,这将重置微控制器。

Microcontrollers typically have a watchdog timer, which must be reset (by a line of code) every so often or else the microcontroller will reset. This keeps the firmware from getting stuck in an endless loop, stuck waiting for input, etc.

Unused memory is sometimes set to an instruction which causes a reset, or a jump to a the same location that the microcontroller starts at when it is reset. This will reset the microcontroller if it somehow jumps to a location outside the program memory.

混浊又暗下来 2024-08-11 12:13:28

嵌入式系统可能具有检查点功能,每 n ms 保存当前堆栈。
存储器在电源重启时是非易失性的(即电池供电),因此在电源启动时,会进行测试以查看代码是否需要跳转到旧的检查点,或者是否是新系统。

我猜测亚马逊/谷歌也使用了类似的技术(但更复杂)。

Embedded systems may have a checkpoint feature where every n ms, the current stack is saved.
The memory is non-volatile on power restart(ie battery backed), so on a power start, a test is made to see if the code needs to jump to an old checkpoint, or if it's a fresh system.

I'm going to guess that a similar technique(but more sophisticated) is used for Amazon/Google.

熟人话多 2024-08-11 12:13:28

虽然我本身无法想到设计模式,但根据我的经验,这是开发人员“选择被破坏”的结果。

我曾见过一个 50 个用户的站点由于连接管理不善、调用过多且没有缓存而导致 SQL Server Enterprise Edition(具有 750 MB 数据库)和 Novell 服务器瘫痪。根据开发人员的说法,Novell 始终是罪魁祸首,直到我们在核心库中发现缺少“CloseConnection”调用。到那时,已经花费了数千美元进行升级以解决缺失的一行代码,但没有成功。

(我无法理解为什么他们有企业版,所以不要问!!)

Though I can't think of a design pattern per se, in my experience, it's a result of "select is broken" from developers.

I've seen a 50-user site cripple both SQL Server Enterprise Edition (with a 750 MB database) and a Novell server because of poor connection management coupled with excessive calls and no caching. Novell was always the culprit according to developers until we found a missing "CloseConnection" call in a core library. By then, thousands were spent, unsuccessfully, on upgrades to address that one missing line of code.

(Why they had Enterprise Edition was beyond me so don't ask!!)

中二柚 2024-08-11 12:13:28

如果您查看在 Apache 上运行的 php 等脚本语言,您会发现每次调用都会启动一个新进程。在基本情况下,进程之间没有共享状态,一旦调用完成,进程就会终止。

优点是资源管理的负担较小,因为它们将在流程完成时释放,并且不需要错误处理,因为流程被设计为快速失败并且不能处于不一致的状态。

If you look at scripting languages like php running on Apache, each invocation starts a new process. In the basic case there is no shared state between processes and once the invocation has finished the process is terminated.

The advantages are less onus on resource management as they will be released when the process finishes and less need for error handling as the process is designed to fail-fast and it cannot be left in an inconsistent state.

半边脸i 2024-08-11 12:13:28

我在应用程序级别的几个地方看到过它(如果应用程序崩溃,应用程序会自行重新启动)。

我已经在应用程序级别实现了该模式,其中从数据库文件读取的服务在读取 x 次后开始出现错误。它会查找引发的特定错误,如果看到该错误,该服务将调用一个控制台应用程序来终止该进程并重新启动该服务。这很糟糕,我讨厌它,但对于这种特殊情况,我找不到更好的答案。

请记住,IIS 有一个内置功能,可以在某些条件下重新启动应用程序池。

就此而言,重新启动服务是 Windows 上任何服务的一个选项,作为服务失败时要采取的操作之一。

I've seen it a few places at the application level (an app restarting itself if it bombs).

I've implemented the pattern at an application level, where a service reading from Dbase files starts getting errors after reading x amount of times. It looks for a particular error that gets thrown, and if it sees that error, the service calls a console app that kills the process and restarts the service. It's kludgey, and I hate it, but for this particular situation, I could find no better answer.

AND bear in mind that IIS has a built in feature that restarts the application pool under certain conditions.

For that matter, restarting a service is an option for any service on Windows as one of the actions to take when the service fails.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文