心跳协议/算法或最佳实践

发布于 2024-08-05 12:59:46 字数 719 浏览 2 评论 0原文

最近我在我编写的一个软件中添加了一些负载平衡功能。它是一个网络应用程序,根据来自 SQL 数据库的输入进行一些数据处理。由于处理可能非常密集,因此我添加了在不同服务器上运行该应用程序的多个实例的功能,以分散负载,但现在负载平衡是手动操作。用户必须指定哪些实例采用输入域的哪一部分。

我想将其提升到一个新的水平,并对实例进行编程,以自动协商输入数据的分配,并识别其中一个实例是否“消失”(已崩溃或已断电),以便其余实例可以采取失败实例的工作负载。

为了实现这一点,我正在考虑在实例之间使用一个简单的心跳协议来确定谁在线,谁不在线,虽然这并不是非常复杂,但我想知道是否有任何已建立的心跳网络协议(基于UDP、TCP 或两者)。

显然,这种情况在具有集群、故障转移和高可用性技术的网络世界中经常发生,所以我想最后我想知道是否有任何我应该了解或实施的既定协议或算法。

编辑

根据答案,似乎要么没有完善的心跳协议,要么没有人知道它们(这意味着它们毕竟没有那么完善),其中如果是这样,我就自己动手。

虽然没有一个答案提供了我具体寻找的内容,但我将投票给 马特·戴维斯的回答,因为它是最接近的,他指出了使用多播的好主意。

谢谢大家的宝贵时间~

Recently I've added some load-balancing capabilities to a piece of software that I wrote. It is a networked application that does some data crunching based on input coming from a SQL database. Since the crunching can be pretty intensive I've added the capability to have multiple instances of this application running on different servers to split the load but as it is now the load balancing is a manual act. A user must specify which instances take which portion of the input domain.

I would like to take that to the next level and program the instances to automatically negotiate the diving up of the input data and to recognize if one of them "disappears" (has crashed or has been powered down) so that the remaining instances can take on the failed instance's workload.

In order to implement this I'm considering using a simple heartbeat protocol between the instances to determine who's online and who isn't and while this is not terribly complicated I'd like to know if there are any established heartbeat network protocols (based on UDP, TCP or both).

Obviously this happens a lot in the networking world with clustering, fail-over and high-availability technologies so I guess in the end I'd like to know if maybe there are any established protocols or algorithms that I should be aware of or implement.

EDIT

It seems, based on the answers, that either there are no well established heart-beat protocols or that nobody knows about them (which would imply that they aren't so well established after all) in which case I'm just going to roll my own.

While none of the answers offered what I was looking for specifically I'm going to vote for Matt Davis's answer since it was the closest and he pointed out a good idea to use multicast.

Thank you all for your time~

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

卷耳 2024-08-12 12:59:46

分布式交互式模拟 (DIS),在 IEEE 标准 1278,通过 UDP 广播使用 5 秒的默认心跳。 DIS 心跳本质上是一个实体状态 PDU,它完全定义了给定实体的状态,包括位置。由于其在模拟领域的应用,DIS 还使用称为航位推算的概念,以在实际位置(例如,超出其预测位置的给定阈值)时提供更高频率的心跳。

就您而言,DIS 实体状态 PDU 可能有点过分了。我提到它只是为了注意到心跳的频率会根据情况而变化。我不知道您所描述的应用程序是否需要这样的东西,但您永远不知道。

对于心跳,请使用 UDP,而不是 TCP。本质上,心跳是一种无连接的发明,因此 UDP(无连接)在这里比 TCP(面向连接)更相关。

关于 UDP 广播需要记住的是,广播消息仅限于广播域。简而言之,如果您的计算机被第 3 层设备(例如路由器)分隔开,则广播将不起作用,因为路由器不会将广播消息从一个广播域传输到另一个广播域。在这种情况下,我建议使用多播,因为它将跨越广播域,前提是生存时间 (TTL) 值设置得足够高。这也是一种比定向单播更加自动化的方法,定向单播需要发送者知道接收者的 IP 地址才能发送消息。

Distribued Interactive Simulation (DIS), which is defined under IEEE Standard 1278, uses a default heartbeat of 5 seconds via UDP broadcast. A DIS heartbeat is essentially an Entity State PDU, which fully defines the state, including the position, of the given entity. Due to its application within the simulation community, DIS also uses a concept referred to as dead-reckoning to provide higher frequency heartbeats when the actual position, for example, is outside a given threshold of its predicted position.

In your case, a DIS Entity State PDU would be overkill. I only mention it to make note of the fact that heartbeats can vary in frequency depending on the circumstances. I don't know that you'd need something like this for the application you described, but you never know.

For heartbeats, use UDP, not TCP. A heartbeat is, by nature, a connectionless contrivance, so it goes that UDP (connectionless) is more relevant here than TCP (connection-oriented).

The thing to keep in mind about UDP broadcasts is that a broadcast message is confined to the broadcast domain. In short, if you have computers that are separated by a layer 3 device, e.g., a router, then broadcasts are not going to work because the router will not transmit broadcast messages from one broadcast domain to another. In this case, I would recommend using multicast since it will span the broadcast domains, providing the time-to-live (TTL) value is set high enough. It's also a more automated approach than directed unicast, which would require the sender to know the IP address of the receiver in order to send the message.

青丝拂面 2024-08-12 12:59:46

使用 UDP 每 t 广播一次心跳;如果您在超过 k*t 时间内没有收到机器的消息,则假定该机器已停机。请注意,所使用的聚合带宽不会消耗资源。您可以使用 IP 广播地址,或保留您正在工作的特定 IP 的列表。

确保心跳包含“重新启动计数”以及“计算机 ID”,以便您知道以前的服务器状态不存在。

如果合适的话,我建议使用 MapReduce 。这会节省很多工作。

Broadcast a heartbeat every t using UDP; if you haven't heard from a machine in more than k*t, then it's assumed down. Be careful that the aggregate bandwidth used isn't a drain on resources. You can use IP broadcast addresses, or keep a list of specific IPs you're doing work for.

Make sure the heartbeat includes a "reboot count" as well as "machine ID" so that you know previous server state isn't around.

I'd recommend using MapReduce if it fits. It would save a lot of work.

无妨# 2024-08-12 12:59:46

我不确定这是否能回答这个问题,但您可能对 Weblogic Server 集群在幕后的工作方式感兴趣。来自掌握 BEA WebLogic Server一书:

[...] WebLogic Server 群集提供了群集中服务器的松散耦合。集群中的每个服务器都是独立的,不依赖任何其他服务器进行任何基本操作。即使与其他所有服务器的联系丢失,每个服务器也将继续运行并能够处理它收到的请求。集群中的每个服务器通过周期性的心跳消息维护自己的集群中其他服务器的列表。每 10 秒,每台服务器都会向集群中的其他服务器发送一条心跳消息,让它们知道它仍然处于活动状态。心跳消息是使用 JVM 内置的 IP 多播技术发送的,随着集群中服务器数量的增加,该机制变得高效且可扩展。每个服务器从其他服务器接收这些心跳消息,并使用它们来维护其当前的集群成员列表。如果服务器连续错过从任何其他服务器接收到的三个心跳消息,则会将该服务器从其成员列表中删除,直到从该服务器收到另一个心跳消息。这种心跳技术允许在集群中动态添加和删除服务器,而不影响现有服务器的配置。

I'm not sure this will answer the question but you might be interested by the way Weblogic Server clustering work under the hood. From the book Mastering BEA WebLogic Server:

[...] WebLogic Server clustering provides a loose coupling of the servers in the cluster. Each server in the cluster is independent and does not rely on any other server for any fundamental operations. Even if contact with every other server is lost, each server will continue to run and be able to process the requests it receives. Each server in the cluster maintains its own list of other servers in the cluster through periodic heartbeat messages. Every 10 seconds, each server sends a heartbeat message to the other servers in the cluster to let them know it is still alive. Heartbeat messages are sent using IP multicast technology built into the JVM, making this mechanism efficient and scalable as the number of servers in the cluster gets large. Each server receives these heartbeat messages from other servers and uses them to maintain its current cluster membership list. If a server misses receiving three heartbeat messages in a row from any other server, it takes that server out of its membership list until it receives another heartbeat message from that server. This heartbeat technology allows servers to be dynamically added and dropped from the cluster with no impact on the existing servers’ configurations.

木格 2024-08-12 12:59:46

思科内容交换机是解决此问题的硬件解决方案。它们实现虚拟 IP 地址作为多个真实服务器的前端,交换机知道这些服务器的真实 IP 地址。交换机定期向 Web 服务器发送 HTTP HEAD 请求,以验证它们是否仍在运行(交换机软件将其称为“保持活动”,尽管这不会使服务器本身保持活动状态)。 Cisco 交换机接受虚拟 IP 上的流量并将其转发到实际的 Web 服务器,使用可配置的负载平衡(例如循环法)或用户定义的负载平衡。

这些交换机的零售价为 3-1 万美元,尽管我的商业伙伴一年前在 eBay 上以大约 300 美元的价格购买了一台。如果您买得起,它们确实代表了一种经过验证的硬件解决方案,可以解决如何在多个服务器之间透明地传播服务的问题。 Redhat 包含内置端口配置,因此您可以使用廉价的 RedHat 盒子实现自己的 Cisco 交换机。 Google 搜索“虚拟 IP 地址”和“思科内容路由器”以获取更多信息。

Cisco content switches are a hardware solution for this problem. They implement a virtual IP address as a front end to multiple real servers, whose real IP addresses are known to the switch. The switch periodically sends HTTP HEAD requests to the web servers, to verify they are still running (which the switch software calls a "keepalive", although this doesn't keep the server itself alive). The Cisco switch accepts traffic on the virtual IP and forwards it to the actual web servers, using configurable load balancing such as round-robin, or user-defined load balancing.

These switches retail in the $3-10K range, although my business partner picked one up on eBay for about $300 a year ago. If you can afford one, they do represent a proven hardware solution to the question of how to have a service spread transparently across multiple servers. Redhat includes a built-in port configuration so that you could implement your own Cisco switch using a cheap RedHat box. Google for "virtual ip address" and "cisco content router" for more information.

困倦 2024-08-12 12:59:46

除了尝试硬件负载均衡器之外,您还可以尝试免费开源负载均衡软件应用程序,例如 HAProxy ,适用于 Linux 和 BSD。

In addition to trying hardware load-balancers, you can also try a free-open-source load-balancing software application such as HAProxy, available for Linux and the BSDs.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文