心跳协议/算法或最佳实践
最近我在我编写的一个软件中添加了一些负载平衡功能。它是一个网络应用程序,根据来自 SQL 数据库的输入进行一些数据处理。由于处理可能非常密集,因此我添加了在不同服务器上运行该应用程序的多个实例的功能,以分散负载,但现在负载平衡是手动操作。用户必须指定哪些实例采用输入域的哪一部分。
我想将其提升到一个新的水平,并对实例进行编程,以自动协商输入数据的分配,并识别其中一个实例是否“消失”(已崩溃或已断电),以便其余实例可以采取失败实例的工作负载。
为了实现这一点,我正在考虑在实例之间使用一个简单的心跳协议来确定谁在线,谁不在线,虽然这并不是非常复杂,但我想知道是否有任何已建立的心跳网络协议(基于UDP、TCP 或两者)。
显然,这种情况在具有集群、故障转移和高可用性技术的网络世界中经常发生,所以我想最后我想知道是否有任何我应该了解或实施的既定协议或算法。
编辑
根据答案,似乎要么没有完善的心跳协议,要么没有人知道它们(这意味着它们毕竟没有那么完善),其中如果是这样,我就自己动手。
虽然没有一个答案提供了我具体寻找的内容,但我将投票给 马特·戴维斯的回答,因为它是最接近的,他指出了使用多播的好主意。
谢谢大家的宝贵时间~
Recently I've added some load-balancing capabilities to a piece of software that I wrote. It is a networked application that does some data crunching based on input coming from a SQL database. Since the crunching can be pretty intensive I've added the capability to have multiple instances of this application running on different servers to split the load but as it is now the load balancing is a manual act. A user must specify which instances take which portion of the input domain.
I would like to take that to the next level and program the instances to automatically negotiate the diving up of the input data and to recognize if one of them "disappears" (has crashed or has been powered down) so that the remaining instances can take on the failed instance's workload.
In order to implement this I'm considering using a simple heartbeat protocol between the instances to determine who's online and who isn't and while this is not terribly complicated I'd like to know if there are any established heartbeat network protocols (based on UDP, TCP or both).
Obviously this happens a lot in the networking world with clustering, fail-over and high-availability technologies so I guess in the end I'd like to know if maybe there are any established protocols or algorithms that I should be aware of or implement.
EDIT
It seems, based on the answers, that either there are no well established heart-beat protocols or that nobody knows about them (which would imply that they aren't so well established after all) in which case I'm just going to roll my own.
While none of the answers offered what I was looking for specifically I'm going to vote for Matt Davis's answer since it was the closest and he pointed out a good idea to use multicast.
Thank you all for your time~
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
分布式交互式模拟 (DIS),在 IEEE 标准 1278,通过 UDP 广播使用 5 秒的默认心跳。 DIS 心跳本质上是一个实体状态 PDU,它完全定义了给定实体的状态,包括位置。由于其在模拟领域的应用,DIS 还使用称为航位推算的概念,以在实际位置(例如,超出其预测位置的给定阈值)时提供更高频率的心跳。
就您而言,DIS 实体状态 PDU 可能有点过分了。我提到它只是为了注意到心跳的频率会根据情况而变化。我不知道您所描述的应用程序是否需要这样的东西,但您永远不知道。
对于心跳,请使用 UDP,而不是 TCP。本质上,心跳是一种无连接的发明,因此 UDP(无连接)在这里比 TCP(面向连接)更相关。
关于 UDP 广播需要记住的是,广播消息仅限于广播域。简而言之,如果您的计算机被第 3 层设备(例如路由器)分隔开,则广播将不起作用,因为路由器不会将广播消息从一个广播域传输到另一个广播域。在这种情况下,我建议使用多播,因为它将跨越广播域,前提是生存时间 (TTL) 值设置得足够高。这也是一种比定向单播更加自动化的方法,定向单播需要发送者知道接收者的 IP 地址才能发送消息。
Distribued Interactive Simulation (DIS), which is defined under IEEE Standard 1278, uses a default heartbeat of 5 seconds via UDP broadcast. A DIS heartbeat is essentially an Entity State PDU, which fully defines the state, including the position, of the given entity. Due to its application within the simulation community, DIS also uses a concept referred to as dead-reckoning to provide higher frequency heartbeats when the actual position, for example, is outside a given threshold of its predicted position.
In your case, a DIS Entity State PDU would be overkill. I only mention it to make note of the fact that heartbeats can vary in frequency depending on the circumstances. I don't know that you'd need something like this for the application you described, but you never know.
For heartbeats, use UDP, not TCP. A heartbeat is, by nature, a connectionless contrivance, so it goes that UDP (connectionless) is more relevant here than TCP (connection-oriented).
The thing to keep in mind about UDP broadcasts is that a broadcast message is confined to the broadcast domain. In short, if you have computers that are separated by a layer 3 device, e.g., a router, then broadcasts are not going to work because the router will not transmit broadcast messages from one broadcast domain to another. In this case, I would recommend using multicast since it will span the broadcast domains, providing the time-to-live (TTL) value is set high enough. It's also a more automated approach than directed unicast, which would require the sender to know the IP address of the receiver in order to send the message.
使用 UDP 每 t 广播一次心跳;如果您在超过 k*t 时间内没有收到机器的消息,则假定该机器已停机。请注意,所使用的聚合带宽不会消耗资源。您可以使用 IP 广播地址,或保留您正在工作的特定 IP 的列表。
确保心跳包含“重新启动计数”以及“计算机 ID”,以便您知道以前的服务器状态不存在。
如果合适的话,我建议使用 MapReduce 。这会节省很多工作。
Broadcast a heartbeat every t using UDP; if you haven't heard from a machine in more than k*t, then it's assumed down. Be careful that the aggregate bandwidth used isn't a drain on resources. You can use IP broadcast addresses, or keep a list of specific IPs you're doing work for.
Make sure the heartbeat includes a "reboot count" as well as "machine ID" so that you know previous server state isn't around.
I'd recommend using MapReduce if it fits. It would save a lot of work.
我不确定这是否能回答这个问题,但您可能对 Weblogic Server 集群在幕后的工作方式感兴趣。来自掌握 BEA WebLogic Server一书:
I'm not sure this will answer the question but you might be interested by the way Weblogic Server clustering work under the hood. From the book Mastering BEA WebLogic Server:
思科内容交换机是解决此问题的硬件解决方案。它们实现虚拟 IP 地址作为多个真实服务器的前端,交换机知道这些服务器的真实 IP 地址。交换机定期向 Web 服务器发送 HTTP HEAD 请求,以验证它们是否仍在运行(交换机软件将其称为“保持活动”,尽管这不会使服务器本身保持活动状态)。 Cisco 交换机接受虚拟 IP 上的流量并将其转发到实际的 Web 服务器,使用可配置的负载平衡(例如循环法)或用户定义的负载平衡。
这些交换机的零售价为 3-1 万美元,尽管我的商业伙伴一年前在 eBay 上以大约 300 美元的价格购买了一台。如果您买得起,它们确实代表了一种经过验证的硬件解决方案,可以解决如何在多个服务器之间透明地传播服务的问题。 Redhat 包含内置端口配置,因此您可以使用廉价的 RedHat 盒子实现自己的 Cisco 交换机。 Google 搜索“虚拟 IP 地址”和“思科内容路由器”以获取更多信息。
Cisco content switches are a hardware solution for this problem. They implement a virtual IP address as a front end to multiple real servers, whose real IP addresses are known to the switch. The switch periodically sends HTTP HEAD requests to the web servers, to verify they are still running (which the switch software calls a "keepalive", although this doesn't keep the server itself alive). The Cisco switch accepts traffic on the virtual IP and forwards it to the actual web servers, using configurable load balancing such as round-robin, or user-defined load balancing.
These switches retail in the $3-10K range, although my business partner picked one up on eBay for about $300 a year ago. If you can afford one, they do represent a proven hardware solution to the question of how to have a service spread transparently across multiple servers. Redhat includes a built-in port configuration so that you could implement your own Cisco switch using a cheap RedHat box. Google for "virtual ip address" and "cisco content router" for more information.
除了尝试硬件负载均衡器之外,您还可以尝试免费开源负载均衡软件应用程序,例如 HAProxy ,适用于 Linux 和 BSD。
In addition to trying hardware load-balancers, you can also try a free-open-source load-balancing software application such as HAProxy, available for Linux and the BSDs.