RabbitMQ 在生产环境中的稳定性如何(使用 DRBD 和 Pacemaker)?
寻找 RabbitMQ 的经验,尤其是使用 Pacemaker 和 DRDB 进行 HA 配置的经验,如下所示:http://www.rabbitmq .com/pacemaker.html
DRBD 部分尤其让我紧张,所以我希望这里有人有实际经验可以分享。
Looking for experience with RabbitMQ, especially in HA configuration using Pacemaker and DRDB as recommended here: http://www.rabbitmq.com/pacemaker.html
The DRBD part in particular makes me nervous, so I'm hoping someone here has real-world experience to share.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
大部分时间都有效。然而,在处理 DRBD 时,您必须特别注意隔离(裂脑)。在生产系统上,手动修复此类问题总是很痛苦。
我们无法在主/从(多状态 RA)中运行 RabbitMQ。我们认为我们应该提高可用性。我们现在回到单个实例。如果其他人有同时运行多个 RabbitMQ 实例并支持主实体的经验,那么很高兴分享!
我发现在出现问题时缺乏调试 Pacemaker 的工具是部署到实时系统的一大障碍……并不总是清楚 Pacemaker 在“思考”或在做什么。不幸的是,hb_report 还不够。
希望这会有所帮助,
D.
Works most of the time. However you'll have to pay special attention to fencing (split brain), when dealing with DRBD. On a production system it's always a pain to have to fix this kind of issues manually.
We failed to run RabbitMQ in a master/slave (multi-state RA). We thought we'd enhance availability. We're back to a single instance now. If anyone else has experience with several RabbitMQ instances running concurrently and backing a master entity that would be great to share!
I find the lack of tools to debug Pacemaker when there are issues is a big hurdle to deploy to live systems... It's not always clear what Pacemaker is "thinking" or doing. hb_report is not sufficient unfortunately.
Hope this helps,
D.
我们也尝试了主/从配置,但是在不停机的情况下保持所有实例最新变得很困难。相信我,您想要更新 RabbitMQ。 RabbitMQ 本身或 Erlang 中总是会出现错误。
我们每年都会收到大约 100 次崩溃,但日志中没有任何有意义的解释。错误日志中只有通用的“启动时出错”,仅此而已。有时崩溃后它不会启动,大多数时候,唯一的解决方案是从所有实例中删除所有持久消息,以便队列状态在集群中同步。其他时候,它会在启动后立即崩溃,只有经过多次重复尝试后才能正确加载。这意味着使用主/从时不会增加可靠性。至少我们的例子中没有。 (RabbitMQ 3.5.3,Erlang 18.0)
它适用于生产,但前提是您在日志或数据库中的某个位置保留消息的副本,以便在重大崩溃后可以快速恢复消息。
We tried master/slave configuration as well, however it became difficult to maintain all instances up to date with no downtime. And trust me, you want to update RabbitMQ. There are always bugs popping up either in RabbitMQ itself or in Erlang.
We've are getting about 100 crashes per year without any meaningful explanation in the logs. The error log just has generic "error while starting" in it and that's pretty much it. Sometimes it won't start after the crash and most of those times, the only solution is to delete all the persistent messages from all instances, so that the queue state is synchronized across the cluster. Other times it would crash immediately after launching and only after multiple repeated attempts will it properly load. Meaning there is no added reliability what so ever when using master/slave. At least there was none in our case. (RabbitMQ 3.5.3, Erlang 18.0)
It works for production, but only if you keep a copy of the message somewhere in the logs or in the database, from where it can be quickly recovered after a major crash.