从最近的 Amazon EC2 故障中我们可以学到什么关于构建分布式系统的知识?

发布于 2024-11-04 07:10:30 字数 136 浏览 4 评论 0原文

现在尘埃落定,我们可以从最近的 Amazon EC2 和 Amazon RDS 服务中断中了解到什么关于构建分布式系统的信息

Now the dust has settled, what can we learn about building distributed systems from the recent Amazon EC2 and Amazon RDS Service Outage?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

擦肩而过的背影 2024-11-11 07:10:30

感谢您提供有趣的链接。显然,每个分布式系统都是不同的,每次中断都是独一无二的,因此很难一概而论。我的一些看法是:

  1. 即使是该地区最好的人也会发生停电...所以你最好为自己的计划做好计划。

  2. 构建分布式系统很难......所以你需要经验和有经验的朋友。

  3. 手动更改是一个常见原因...在 AWS 文章中没有明确说明,但强烈暗示。

  4. 中断通常是“突发”现象,一个简单的错误会导致许多系统以指数级增长的方式交互。 AWS 的文章将此称为“风暴”,我在大型分布式系统中目睹了类似的“风暴”。耦合程度和诸如退避参数之类的简单方面可以使扰动呈指数增长或呈指数衰减之间的差异。想想塔科马海峡大桥 - 也许这个类比有点夸张,但是调整一些简单的参数可以避免破坏性的共振。

  5. Netflix 的《混沌猴子》很有趣。 “精益”人士告诉我们,如果某件事很困难(例如测试或部署),那么您应该经常这样做,直到它不再困难为止。也许系统故障/弹性是这种方法的下一个前沿领域。

Thanks for the interesting links. Obviously every distributed system is different and every outage is unique so it is difficult to generalise. Some takeways I have are:

  1. Outages happen to even the best guys on the block...so you better plan for yours.

  2. Building distributed systems is hard...so you need experience and experienced friends.

  3. Manual changes are a common cause...not said explicitly in the AWS writeup, but strongly implied.

  4. Outages are often "emergent" phenomena whereby a simple error causes many systems to interact in a way which grows exponentially. The AWS writeup refers to this as a "storm" and I have witnessed similar "storms" in large distributed systems. The degree of coupling and simple aspects like backoff parameters can make the difference between a disturbance that grows exponentially or decays exponentially. Think of the Tacoma Narrows bridge - perhaps the analogy is a stretch, but tuning of a few simple parameters can avoid destructive resonances.

  5. The Netflix Chaos Monkey is interesting. The "Lean" guys have taught us that if something is difficult (like testing or deployment) then you should do it often until it aint difficult any more. Perhaps system failure/resilience is the next frontier for this approach.

壹場煙雨 2024-11-11 07:10:30

现在 Netflix 的《Chaos Monkey》更有意义了。查看 NetFlix 技术博客

Now Netflix's Chaos Monkey makes more sense. Check the NetFlix tech blog

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文