Erlang/OTP 消息可靠吗?消息可以重复吗?

发布于 2024-09-08 06:31:18 字数 667 浏览 1 评论 0原文

长版本:

我是 erlang 新手,正在考虑将其用于可扩展的架构。我发现该平台的许多支持者都吹捧其可靠性和容错能力。

但是,我很难准确理解在消息在瞬态内存中排队的系统中如何实现容错。我知道可以安排主管层次结构来重生已死亡的进程,但我无法找到有关重生对正在进行的工作的影响的更多讨论。正在死亡的节点上丢失的正在传输的消息和部分完成的工作工件会发生什么情况?

当消费者进程死亡时,所有生产者都会自动重传未确认的消息吗?如果不是,这怎么能被认为是容错的呢?如果是这样,什么可以防止已处理(但未完全确认)的消息被重新传输,从而被不当重新处理?

(我认识到这些问题并不是 erlang 独有的;类似的问题在任何分布式处理系统中都会出现。但是 erlang 爱好者似乎声称该平台使这一切变得“容易”..?)

假设消息被重新传输,我可以很容易地想象在这种情况下,复杂消息链的下游影响在发生故障后可能会变得非常混乱。如果没有某种重型分布式事务系统,我不明白如何在不解决每个流程中的重复问题的情况下保持一致性和正确性。我的应用程序代码是否必须始终强制执行约束以防止事务多次执行?

简短版本:

分布式 erlang 进程是否会受到重复消息的影响?如果是这样,重复保护(即幂等性)是应用程序的责任,还是 erlang/OTP 以某种方式帮助我们解决这个问题?

Long version:

I'm new to erlang, and considering using it for a scalable architecture. I've found many proponents of the platform touting its reliability and fault tolerance.

However, I'm struggling to understand exactly how fault-tolerance is achieved in this system where messages are queued in transient memory. I understand that a supervisor hierarchy can be arranged to respawn deceased processes, but I've been unable to find much discussion of the implications of respawning on works-in-progress. What happens to in-flight messages and the artifacts of partially-completed work that were lost on a dying node?

Will all producers automatically retransmit messages that are not ack'd when consumer processes die? If not, how can this be considered fault-tolerant? And if so, what prevents a message that was processed -- but not quite acknowledged -- from being retransmitted, and hence reprocessed inappropriately?

(I recognize that these concerns are not unique to erlang; similar concerns will arise in any distributed processing system. But erlang enthusiasts seem to claim that the platform makes this all "easy"..?)

Assuming messages are retransmitted, I can easily envision a scenario where the downstream effects of a complex messaging chain could become very muddled after a fault. Without some sort of heavy distributed transaction system, I don't understand how consistency and correctness can be maintained without addressing duplication in every process. Must my application code always enforce constraints to prevent transactions from being executed more than once?

Short version:

Are distributed erlang processes subject to duplicated messages? If so, is duplicate-protection (ie, idempotency) an application responsibility, or does erlang/OTP somehow help us with this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

已下线请稍等 2024-09-15 06:31:18

我将把它分成我希望有意义的几点。我可能会重新整理我在《搭便车指南》中写的一些内容并发。您可能想阅读该文章以详细了解 Erlang 中消息传递方式背后的基本原理。


1。消息传输

Erlang中的消息传递是通过发送到邮箱(一种用于存储数据的队列)的异步消息来完成的。绝对没有假设消息是否被接收,甚至消息是否被发送到有效的进程。这是因为,[在语言层面]可以合理地假设,某人可能只想在 4 天内处理一条消息,并且在消息达到某种状态之前甚至不会承认它的存在。

一个随机的例子可以是想象一个长时间运行的进程,该进程处理数据 4 小时。如果它无法处理消息,它真的应该承认收到了消息吗?也许应该,也许不应该。这实际上取决于您的应用程序。因此,不做任何假设。您可以让一半的消息异步,而只有一条消息不是异步的。

如果您需要的话,Erlang 希望您发送一条确认消息(并超时等待)。与超时有关的规则和回复的格式留给程序员来指定——Erlang 不能假设您希望在任务完成时在消息接收时得到确认,无论它是否匹配(消息当新版本的代码热加载时,可以在 4 小时内匹配)等等。

简单来说,消息是否未读、未接收或被某人拔掉插头而打断如果您不希望它在运输途中,也没关系。如果你想让它发挥作用,你需要设计一个跨进程的逻辑。

在 Erlang 进程之间实现高级消息协议的负担交给了程序员。


2.消息协议

正如您所说,这些消息存储在瞬时内存中:如果进程死亡,它尚未读取的所有消息都会丢失。如果你想要更多,有多种策略。其中一些是:

  • 尽可能快地读取消息并在需要时将其写入磁盘,发回确认并稍后处理。将此与具有持久队列的队列软件(例如 RabbitMQ 和 ActiveMQ)进行比较。
  • 使用进程组在多个节点上的一组进程之间复制消息。此时您可能会输入事务语义。这个用于 mnesia 数据库的事务提交;
  • 在收到一切正常的确认或失败消息之前,不要假设任何事情都有效。
  • 进程组和失败消息的组合。如果第一个进程无法处理任务(因为节点出现故障),VM 会自动将通知发送到处理该任务的故障转移进程。此方法有时与完整的应用程序一起使用来处理硬件故障。

根据手头的任务,您可能会使用其中的一个或多个。它们都可以在 Erlang 中实现,并且在许多情况下,已经编写了模块来为您完成繁重的工作。

所以这可能会回答你的问题。 由于您自己实现协议,因此您可以选择是否多次发送消息。


3.什么是容错

选择上述策略之一确实取决于容错对您意味着什么。在某些情况下,人们的意思是“没有数据丢失,没有任务失败”。其他人用容错来表示“用户永远不会看到崩溃”。对于 Erlang 系统,通常的含义是保持系统运行:也许让一个用户挂断电话而不是让每个人都挂断电话是可以的。

这里的想法是让失败的东西失败,但保持其余的运行。为了实现这一目标,虚拟机为您提供了一些功能:

  • 您可以知道进程何时终止以及为何终止
  • 如果其中一个进程出现问题,可以强制相互依赖的进程一起终止
  • 您可以运行一个自动记录器为您记录每个未捕获的异常,甚至定义您自己的
  • 节点可以进行监视,以便您知道它们何时关闭(或断开连接)
  • 您可以重新启动失败的进程(或失败的进程组)
  • 拥有整个应用程序如果失败,
  • 则在不同的节点上重新启动 OTP 框架还有更多的东西

通过这些工具和一些标准库的模块为您处理不同的场景,您可以在 Erlang 的异步语义之上实现几乎您想要的东西,尽管能够使用 Erlang 的容错定义通常是值得的。


4。一些注释

我个人的观点是,除非你想要纯粹的事务语义,否则很难有比 Erlang 中存在的更多的假设。您总是会遇到的一个问题是节点宕机。您永远无法知道它们的故障是因为服务器实际崩溃还是因为网络故障。

在服务器崩溃的情况下,只需重新执行任务就足够容易了。然而,对于净分割,你必须确保一些重要的操作不会重复执行,但也不会丢失。

它通常归结为 CAP 定理,它基本上为您提供了 3 个选项,您必须选择其中二:

  1. 一致性
  2. 分区容错
  3. 性可用性

根据您对自己的定位,将需要不同的方法。 CAP 定理通常用于描述数据库,但我相信每当您在处理数据时需要某种程度的容错能力时,都会提出类似的问题。

I'll separate this into points I hope will make sense. I might be re-hashing a bit of what I have written in The Hitchhiker's Guide to Concurrency. You might want to read that one to get details on the rationale behind the way message passing is done in Erlang.


1. Message transmission

Message passing in Erlang is done through asynchronous messages sent into mailboxes (a kind of queue for storing data). There is absolutely no assumption as to whether a message was received or not, or even that it was sent to a valid process. This is because it is plausible to assume [at a language level] that someone might want to treat a message in maybe only 4 days and won't even acknowledge its existence until it has reached a certain state.

A random example of this could be to imagine a long-running process that crunches data for 4 hours. Should it really acknowledge it received a message if it's unable to treat it? Maybe it should, maybe not. It really depends on your application. As such, no assumption is made. You can have half your messages asynchronous and only one that isn't.

Erlang expects you to send an acknowledgement message (and wait on it with a timeout) if you ever need it. The rules having to do with timing out and the format of the reply are left to the programmer to specify -- Erlang can't assume you want the acknowledgement on message reception, when a task is completed, whether it matches or not (the message could match in 4 hours when a new version of the code is hot-loaded), etc.

To make it short, whether a message isn't read, fails to be received or is interrupted by someone pulling the plug while it is in transit doesn't matter if you don't want it to. If you want it to matter, you need to design a logic across processes.

The burden of implementing a high-level message protocol between Erlang processes is given to the programmer.


2. Message protocols

As you said, these messages are stored in transient memory: if a process dies, all the messages it hadn't read yet are lost. If you want more, there are various strategies. A few of them are:

  • Read the message as fast as possible and write it to disk if needed, send an acknowledgement back and process it later. Compare this to queue software such as RabbitMQ and ActiveMQ with persistent queues.
  • Use process groups to duplicate messages across a group of processes on multiple nodes. At this point you might enter transactional semantics. This one is used for the mnesia database for the transaction commits;
  • Don't assume anything has worked until you receive either an acknowledgement that everything went fine or a failure message
  • A combination of process groups and failure messages. If a first process fails to handle a task (because the node goes down), a notification is automatically sent by the VM to a fail-over process which handles it instead. This method is sometimes used with full applications to handle hardware failures.

Depending on the task at hand, you might use one or many of these. They're all possible to implement in Erlang and in many cases modules are already written to do the heavy lifting for you.

So this might answer your question. Because you implement the protocols yourself, it's your choice whether messages get sent more than once or not.


3. What is fault-tolerance

Picking one of the above strategies does depend on what fault-tolerance means to you. In some cases, people mean it to say "no data is ever lost, no task ever fails." Other people use fault-tolerance to say "the user never sees a crash." In the case of Erlang systems, the usual meaning is about keeping the system running: it's alright to maybe have a single user dropping a phone call rather than having everyone dropping it.

Here the idea is then to let stuff that fails fail, but keep the rest running. To achieve this, there are a few things the VM gives you:

  • You can know when a process dies and why it did
  • You can force processes that depend on each other to die together if one of them goes wrong
  • You can run a logger that automatically logs every uncaught exception for you, and even define your own
  • Nodes can be monitored so you know when they went down (or got disconnected)
  • You can restart failed processes (or groups of failed processes)
  • Have whole applications restarting on different nodes if one fails
  • And a lot more more stuff with the OTP framework

With these tools and a few of the standard library's modules handling different scenarios for you, you can implement pretty much what you want on top of Erlang's asynchronous semantics, although it usually pays to be able to use Erlang's definition of fault tolerance.


4. A few notes

My personal opinion here is that it's pretty hard to have more assumptions than what exists in Erlang unless you want pure transactional semantics. One problem you'll always have trouble with is with nodes going down. You can never know if they went down because the server actually crashed or because the network failed.

In the case of a server crash, simply re-doing the tasks is easy enough. However with a net split, you have to make sure some vital operations are not done twice, but not lost either.

It usually boils down to the CAP theorem which basically gives you 3 options, of which you have to pick two:

  1. Consistency
  2. Partition tolerance
  3. Availability

Depending on where you position yourself, different approaches will be needed. The CAP theorem is usually used to describe databases, but I believe similar questions are to be asked whenever you need some level of fault tolerance when processing data.

以可爱出名 2024-09-15 06:31:18

erlang OTP 系统是容错的。这并不能免除您在其中构建同样容错的应用程序的需要。如果您使用 erlang 和 OTP,那么有一些东西您可以信赖。

  1. 当进程死亡时,该进程将重新启动。
  2. 在大多数情况下,进程崩溃不会导致整个应用程序崩溃。
  3. 发送消息时,只要接收器存在,就会收到消息。

据我所知,erlang 中的消息不会重复。如果您发送消息并且进程接收到该消息,则该消息将从队列中消失。但是,如果您发送一条消息并且进程接收该消息但在处理该消息时崩溃,则该消息将消失并且未得到处理。在设计系统时应该考虑这一事实。 OTP 通过使用进程将基础设施关键代码(例如主管、gen_servers 等)与可能崩溃的应用程序代码隔离,帮助您处理所有这些问题。

例如,您可能有一个 gen_server 将工作分派到进程池。池中的进程可能会崩溃并重新启动。但 gen_server 仍然保持运行状态,因为它的全部目的只是接收消息并将它们分派到池中进行处理。这使得整个系统能够保持正常运行,尽管池中出现错误和崩溃,并且总是有一些东西在等待您的消息。

仅仅因为系统具有容错能力,并不意味着您的算法具有容错能力。

The erlang OTP system is fault tolerant. That doesn't relieve you of the need to build equally fault tolerant apps in it. If you use erlang and OTP then there are a few things you can rely on.

  1. When a process dies that process will be restarted.
  2. For the most part a process crashing won't bring down your whole app
  3. When a message is sent it will be received provided the receiver exists.

As far as I know messages in erlang are not subject to duplication. If you send a message and the process receives it then the message is gone from the queue. However if you send a message and the process receives that message but crashes while processing it then that message is gone and unhandled. That fact should be considered in the design of your system. OTP helps you handle all of this by using processes to isolate infrastructure critical code (eg. supervisors, gen_servers, ...) from application code that might be subject to crashes.

For instance you might have a gen_server that dispatches work to a process pool. The processes in the pool might crash and get restarted. But the gen_server remains up since its entire purpose is just to recieve messages and dispatch them to the pool to work on. This allows the whole system to stay up despite errors and crashes in the pool and there is always something waiting for your message.

Just because the system is fault tolerant doesn't mean your algorithm is.

柠檬色的秋千 2024-09-15 06:31:18

我认为答案与 Erlang 完全无关。它在于客户端-服务器交互的语义,您可以选择在客户端-服务器协议中实现“至少一次”、“最多一次”或“恰好一次”保证。
所有这些调用语义都可以通过在发送或执行之前在客户端和服务器上组合唯一标记、重试和记录客户端请求来实现,以便服务器在崩溃后可以拾取它。
除了重复消息之外,您还可能会丢失、孤立或延迟的消息。

I think answer has nothing to do with Erlang at all. It lies in semantics of Client-Server interaction where you can chose to implement "at least once", "at most once" or "exactly once" guarantees into your client-server protocol.
All of these invocation semantics can be implemented by combining unique tags, retries and logging client requests on both client and server before sending or executing it so that it can be picked up by server after crash.
Besides duplicates you can get lost, orphaned or delayed messages.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文