Azure服务总线会议FIFO-消费者应该如何处理处理错误?
请您建议如何在设置Azure Service Bus订阅中处理消费者错误,以确保使用会话ID的FIFO处理? (请参阅 https://learn.microsoft.com/en-us/azure/service-bus-messaging/message-sessions#first-in-first-first-first-fifo-pattern )
“ 管理系统发布会计系统消费的消息。这些消息都将会话ID作为AccountID拥有该实体,因此从总线收据处于每个AccountID的范围中的FIFO顺序。
想象一下此消息方案:
- T1- createAccount 1234
- t2 -addCustomer 5678到帐户1234
- T3-客户5678的RiseinVoice在帐户1234上,
如果消息的消费者在AccountID = 1234上具有会话锁定,请在T2上偷看T2的peeklock, AddCustomer消息,然后遭受会计系统的暂时性故障,他们无法添加客户5678。消费者应该做什么?
如果他们在addcustomer消息中挂着电话,他们将无法继续处理rishInvoice消息,因为这将失败,因为客户5678在会计系统中不存在。
如果他们放弃了addcustomer,那么他们是否将旋转一个addcustomer-> faf-> abondon-> addcustomer的循环,直到达到最大交货计数消息,然后消息然后死去。
消费者在这里应该做什么以安全地应对这个问题?
参见 https://stackoverflow.com/a/53449282/491752 用于确认公交车的行为。我的问题对此问题有了了解,消费者应该做什么?
Please would you suggest how to handle consumer errors in an Azure Service Bus subscription set up to ensure FIFO processing using a session IDs? (See https://learn.microsoft.com/en-us/azure/service-bus-messaging/message-sessions#first-in-first-out-fifo-pattern )
As an example imagine a customer management system posting messages that are consumed by an accounting system. The messages all have the session ID as the AccountID owning the entities so that receipt from the bus is in FIFO order in the scope of each AccountID.
Imagine this message scenario:
- T1 - CreateAccount 1234
- T2 - AddCustomer 5678 to Account 1234
- T3 - RaiseInvoice for Customer 5678 on Account 1234
If the consumer of the messages has the session lock on AccountID=1234, takes a PeekLock on the queue at T2 for the AddCustomer message and then suffers a transient failure of the accounting system, they are not able to add Customer 5678. What should the consumer do?
If they dead-letter the AddCustomer message, they can't go on to process the RaiseInvoice message since that will fail as the Customer 5678 doesn't exist in the accounting system.
If they abandon the AddCustomer, then are they going to spin round a loop of AddCustomer->fail->abondon->AddCustomer until the max delivery count message is reached and the message then dead-letters.
What should the consumer do here to safely respond to the issue?
See https://stackoverflow.com/a/53449282/491752 for confirmation of how the bus behaves. My question is given knowledge of this problem, what should the consumer do?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果这是瞬态故障,那么您有两个选择,一个是捕获异常并重试处理。这就是Azure功能,MassTransit和Nservicebus这样的框架。他们会发现您的例外,然后再次致电您。
在那个时候,很短的例外情况可能会恢复。
下一个选项是故意放弃消息。这将其重新放在队列中,并将重新送达。这每次都会增加交货计数。希望瞬态故障在达到最大交付数量之前就可以解决。如果没有,它将被死了,这不是理想的。
因此,您还可以做的就是在发生消息处理错误时拆除整个消费者。这将使会话能够将其重新分配给另一个消费者,并且重新交付会对他们进行,希望他们会有错误。
基本上,您需要重试和/或以某种方式等待,直到瞬态条件通过。您可以在重试之间退出指数级别(新客户库应在此处自动扩展您的锁定),或者在拆除消费者之前延迟。
如果您说瞬态错误时,您的意思是持续和更长时间的内容,则可能需要监视错误并暂停系统的整个部分(禁用队列的所有消费者),直到恢复了损坏的内容为止。
这种故障建模是构建可靠系统的挑战。这也很有趣。
If it's a transient failure then you have two options, one would be to catch the exception yourself and retry the processing. This is what frameworks like Azure functions, masstransit, and nservicebus do. They catch your exception and then call you again with the same message.
Very short lived exception circumstances might recover in that time.
The next option is to abandon the message purposely. This puts it back on the queue and it will be redelivered. This will increase the delivery count each time. The hope is that the transient failure resolves before it reaches the max delivery count. If not it will be dead lettered, and that's not ideal.
So what you could also do is tear down the whole consumer when a message processing error occurs. This would enable the session to be reallocated to another consumer and the redelivery would do to them, hopefully they would have the error.
Basically, you need to retry and/or wait in some way till the transient condition passes. You could out exponential back offs between your retries (the new client libraries should extend your lock automatically here), or delays before you teardown a consumer.
If when you say transient error you mean something that lasts and hour or more, you might need to Monitor for errors and pause entire parts of the system (disable all consumers of a queue) until you've restored whatever is broken.
This failure modeling is meat of the challenge to building reliable systems. It's also sort of the fun.