斯卡拉Akka:如何开发多机高可用集群
我们正在为一款游戏开发一个使用 Scala + Akka 的服务器系统,该系统将为 Android、iPhone 和 Second Life 中的客户端提供服务。该服务器的某些部分需要高可用性,在多台计算机上运行。如果其中一台服务器死亡(例如,硬件故障),系统需要继续运行。我想我希望客户有一个他们将尝试连接的机器列表,类似于 Cassandra 的工作方式。
到目前为止,我在 Akka 中看到的多节点示例在我看来都以可扩展性为中心,而不是高可用性(至少在硬件方面)。多节点示例似乎总是存在单点故障。例如,有负载平衡器,但如果我需要重新启动其中一台具有负载平衡器的计算机,我的系统将遭受一些停机时间。
有没有任何例子可以展示 Akka 的这种类型的硬件容错能力?或者,您有什么好的方法来实现这一目标吗?
到目前为止,我能想到的最好答案是研究 Erlang OTP 文档,思考它们,并尝试找出如何使用 Akka 中可用的构建块将我的系统组合在一起。
但是,如果有关于如何在多台机器之间共享状态的资源、示例或想法,以便在其中一台机器出现故障时仍能继续运行,我肯定会很感激,因为我担心我可能会重新发明轮子在这里。也许有一个多节点 STM 容器可以自动保持多个节点之间的共享状态同步?也许这很容易做到,以至于文档没有费心展示如何做到这一点的示例,或者也许我的研究和实验还不够彻底。任何想法或想法将不胜感激。
We're developing a server system in Scala + Akka for a game that will serve clients in Android, iPhone, and Second Life. There are parts of this server that need to be highly available, running on multiple machines. If one of those servers dies (of, say, hardware failure), the system needs to keep running. I think I want the clients to have a list of machines they will try to connect with, similar to how Cassandra works.
The multi-node examples I've seen so far with Akka seem to me to be centered around the idea of scalability, rather than high availability (at least with regard to hardware). The multi-node examples seem to always have a single point of failure. For example there are load balancers, but if I need to reboot one of the machines that have load balancers, my system will suffer some downtime.
Are there any examples that show this type of hardware fault tolerance for Akka? Or, do you have any thoughts on good ways to make this happen?
So far, the best answer I've been able to come up with is to study the Erlang OTP docs, meditate on them, and try to figure out how to put my system together using the building blocks available in Akka.
But if there are resources, examples, or ideas on how to share state between multiple machines in a way that if one of them goes down things keep running, I'd sure appreciate them, because I'm concerned I might be re-inventing the wheel here. Maybe there is a multi-node STM container that automatically keeps the shared state in sync across multiple nodes? Or maybe this is so easy to make that the documentation doesn't bother showing examples of how to do it, or perhaps I haven't been thorough enough in my research and experimentation yet. Any thoughts or ideas will be appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
HA 和负载管理是可扩展性的一个非常重要的方面,可作为 AkkaSource 商业产品的一部分提供。
HA and load management is a very important aspect of scalability and is available as a part of the
AkkaSource
commercial offering.如果您已经在客户端中列出了多个潜在主机,那么这些主机可以有效地成为负载均衡器。
您可以提供主机建议服务,并向客户端推荐他们应该连接到哪台计算机(基于当前负载或其他),然后客户端可以固定到该计算机,直到连接失败。
如果主机建议服务不存在,那么客户端可以简单地从其内部列表中选择一个随机主机,尝试它们直到连接。
理想情况下,在第一次启动时,客户端将连接到主机建议服务,不仅会定向到适当的主机,还会定向到其他潜在主机的列表。每次客户端连接时,该列表都会定期更新。
如果主机建议服务在客户端第一次尝试时关闭(不太可能,但是......),那么您可以在客户端安装中预先部署主机列表,以便它可以立即开始从一开始随机选择主机(如果它也有) 。
确保您的主机列表是实际的主机名,而不是 IP,这可以为您提供更大的长期灵活性(即您将“始终拥有”host1.example.com、host2.example.com...等,即使您移动基础设施并更改 IP)。
If you're listing multiple potential hosts in your clients already, then those can effectively become load balancers.
You could offer a host suggestion service and recommends to the client which machine they should connect to (based on current load, or whatever), then the client can pin to that until the connection fails.
If the host suggestion service is not there, then the client can simply pick a random host from it internal list, trying them until it connects.
Ideally on first time start up, the client will connect to the host suggestion service and not only get directed to an appropriate host, but a list of other potential hosts as well. This list can routinely be updated every time the client connects.
If the host suggestion service is down on the clients first attempt (unlikely, but...) then you can pre-deploy a list of hosts in the client install so it can start immediately randomly selecting hosts from the very beginning if it has too.
Make sure that your list of hosts is actual host names, and not IPs, that give you more flexibility long term (i.e. you'll "always have" host1.example.com, host2.example.com... etc. even if you move infrastructure and change IPs).
你可以看看 RedDwarf 及其分支 DimDwarf 已构建。它们都是水平可扩展的仅限崩溃的游戏应用服务器,并且 DimDwarf 部分是用 Scala 编写的(新的消息传递功能)。他们的方法和架构应该非常符合您的需求:)
You could take a look how RedDwarf and it's fork DimDwarf are built. They are both horizontally scalable crash-only game app servers and DimDwarf is partly written in Scala (new messaging functionality). Their approach and architecture should match your needs quite well :)
2 美分..
“如何在多台机器之间共享状态,以便其中一台机器出现故障时,机器仍能继续运行”
不要在机器之间共享状态,而是跨机器划分状态。我不知道您的域名,所以我不知道这是否有效。但本质上,如果您将某些聚合(用 DDD 术语来说)分配给某些节点,则可以在使用这些聚合时将它们保留在内存中(参与者、代理等)。为了做到这一点,您需要使用类似 Zookeeper 的东西来协调哪些节点处理哪些聚合。如果发生故障,您可以将聚合放在不同的节点上。
此外,如果您使用事件源模型来构建聚合,那么通过这些节点侦听事件并维护自己的副本,在其他节点上拥有聚合的实时副本(从属)几乎变得微不足道。
通过使用 Akka,我们几乎免费获得节点之间的远程处理。这意味着处理可能需要与其他节点上的聚合/实体交互的请求的任何节点都可以使用 RemoteActor 来执行此操作。
我在这里概述的内容非常笼统,但提供了一种使用 Akka 和 ZooKeeper 实现分布式容错的方法。它可能有帮助,也可能没有帮助。我希望如此。
一切顺利,
安迪
2 cents..
"how to share state between multiple machines in a way that if one of them goes down things keep running"
Don't share state between machines, instead partition state across machines. I don't know your domain so I don't know if this will work. But essentially if you assign certain aggregates ( in DDD terms ) to certain nodes, you can keep those aggregates in memory ( actor, agent, etc ) when they are being used. In order to do this you will need to use something like zookeeper to coordinate which nodes handle which aggregates. In the event of failure you can bring the aggregate up on a different node.
Further more, if you use an event sourcing model to build your aggregates, it becomes almost trivial to have real-time copies ( slaves ) of your aggregate on other nodes by those nodes listening for events and maintaining their own copies.
By using Akka, we get remoting between nodes almost for free. This means that which ever node handles a request that might need to interact with an Aggregate/Entity on another nodes can do so with RemoteActors.
What I have outlined here is very general but gives an approach to distributed fault-tolerance with Akka and ZooKeeper. It may or may not help. I hope it does.
All the best,
Andy