绑定到群集 MSMQ 实例的 MSMQ 消息卡在传出队列中

发布于 2024-09-26 12:33:58 字数 688 浏览 6 评论 0原文

我们为一组 NServiceBus 服务聚集了 MSMQ,一切都运行良好,直到出现问题为止。一台服务器上的传出队列开始填满,很快整个系统就会挂起。

更多详细信息:

我们在服务器 N1 和 N2 之间有一个集群 MSMQ。其他集群资源只是作为本地直接在集群队列上操作的服务,即NServiceBus分配器。

所有工作进程都位于单独的服务器(Services3 和 Services4)上。

对于那些不熟悉 NServiceBus 的人来说,工作进入由分发器管理的集群工作队列中。 Service3 和 Services4 上的工作应用程序将“我已准备好工作”消息发送到由同一分发器管理的集群控制队列,分发器通过将工作单元发送到工作进程的输入队列进行响应。

在某些时候,这个进程可能会完全挂起。以下是系统挂起时群集 MSMQ 实例上的传出队列的图片:

挂起状态下的群集 MSMQ 传出队列

如果我将集群故障转移到另一个节点,就好像整个系统都受到了打击。以下是故障转移后不久的同一群集 MSMQ 实例的图片:

故障转移后的群集 MSMQ 传出队列

任何人都可以解释此行为,以及我可以做些什么来避免它,以保持系统平稳运行?

We have clustered MSMQ for a set of NServiceBus services, and everything runs great until it doesn't. Outgoing queues on one server start filling up, and pretty soon the whole system is hung.

More details:

We have a clustered MSMQ between servers N1 and N2. Other clustered resources are only services that operate directly on the clustered queues as local, i.e. NServiceBus distributors.

All of the worker processes live on separate servers, Services3 and Services4.

For those unfamiliar with NServiceBus, work goes into a clustered work queue managed by the distributor. Worker apps on Service3 and Services4 send "I'm Ready for Work" messages to a clustered control queue managed by the same distributor, and the distributor responds by sending a unit of work to the worker process's input queue.

At some point, this process can get completely hung. Here is a picture of the outgoing queues on the clustered MSMQ instance when the system is hung:

Clustered MSMQ Outgoing Queues in Hung State

If I fail over the cluster to the other node, it's like the whole system gets a kick in the pants. Here is a picture of the same clustered MSMQ instance shortly after a failover:

Clustered MSMQ Outgoing Queues After Failover

Can anyone explain this behavior, and what I can do to avoid it, to keep the system running smoothly?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

╰つ倒转 2024-10-03 12:33:58

一年多过去了,我们的问题似乎已经解决了。关键要点似乎是:

  • 确保您拥有可靠的 DNS 系统,以便当 MSMQ 需要解析主机时,它可以。
  • 仅在 Windows 故障转移群集上创建一个 MSMQ 群集实例。

当我们设置Windows故障转移集群时,我们假设在非活动节点上“浪费”资源是不好的,因此,当时有两个准相关的NServiceBus集群,我们为Project1创建了一个集群MSMQ实例,以及 Project2 的另一个群集 MSMQ 实例。我们认为,大多数时候,我们会在单独的节点上运行它们,并且在维护时段期间它们会共同位于同一节点上。毕竟,这是我们为 SQL Server 2008 的主实例和开发实例所采用的设置,而且一直运行良好。

在某些时候,我开始对这种方法产生怀疑,特别是因为对每个 MSMQ 实例进行一次或两次故障似乎总是能让消息再次移动。

我向Udi Dahan(NServiceBus的作者)询问了这个集群托管策略,他一脸疑惑地问我“你为什么要做那样的事?”实际上,分发器非常轻量,因此实际上没有太多理由将它们均匀地分布在可用节点之间。

之后,我们决定利用我们所学到的一切仅使用一个 MSMQ 实例重新创建一个新的故障转移群集。从那以后我们就没有看到这个问题了。当然,确保这个问题得到解决将被证明是消极的,因此是不可能的。至少 6 个月以来,这都不是问题,但谁知道呢,我想明天可能就会失败!我们希望不会。

Over a year later, it seems that our issue has been resolved. The key takeaways seem to be:

  • Make sure you have a solid DNS system so when MSMQ needs to resolve a host, it can.
  • Only create one clustered instance of MSMQ on a Windows Failover Cluster.

When we set up our Windows Failover Cluster, we made the assumption that it would be bad to "waste" resources on the inactive node, and so, having two quasi-related NServiceBus clusters at the time, we made a clustered MSMQ instance for Project1, and another clustered MSMQ instance for Project2. Most of the time, we figured, we would run them on separate nodes, and during maintenance windows they would co-locate on the same node. After all, this was the setup we have for our primary and dev instances of SQL Server 2008, and that has been working quite well.

At some point I began to grow dubious about this approach, especially since failing over each MSMQ instance once or twice seemed to always get messages moving again.

I asked Udi Dahan (author of NServiceBus) about this clustered hosting strategy, and he gave me a puzzled expression and asked "Why would you want to do something like that?" In reality, the Distributor is very light-weight, so there's really not much reason to distribute them evenly among the available nodes.

After that, we decided to take everything we had learned and recreate a new Failover Cluster with only one MSMQ instance. We have not seen the issue since. Of course, making sure this problem is solved would be proving a negative, and thus impossible. It hasn't been an issue for at least 6 months, but who knows, I suppose it could fail tomorrow! Let's hope not.

被你宠の有点坏 2024-10-03 12:33:58

也许您的服务器已被克隆,因此共享相同的队列管理器 ID (QMId)。

MSMQ 使用 QMId 作为哈希来缓存远程计算机的地址。如果网络中有多于一台计算机具有相同的 QMId,您最终可能会出现消息卡住或丢失的情况。

查看此博客文章中的解释和解决方案:链接

Maybe your servers were cloned and thus share the same Queue Manager ID (QMId).

MSMQ use the QMId as a hash for caching the address of remote machines. If more than one machine has the same QMId in your network you could end up with stuck or missing messages.

Check out the explanation and solution in this blog post: Link

睡美人的小仙女 2024-10-03 12:33:58

您的端点如何配置以保留其订阅?

如果您的一个(或多个)服务遇到错误并由故障转移集群管理器重新启动怎么办?在这种情况下,该服务将永远不会再从其他服务接收到“我已准备好工作”消息之一。

当您故障转移到另一个节点时,我猜您的所有服务都会再次发送这些消息,因此,一切都会恢复正常。

要测试此行为,请执行以下操作。

  1. 停止并重新启动所有服务。
  2. 仅停止其中一项服务。
  3. 重新启动停止的服务。
  4. 如果您的系统没有挂起,请对每个服务重复此操作。

如果您的系统现在再次挂起,请检查您的配置。在这种情况下,您的至少一个(如果不是全部)服务会在重新启动之间丢失订阅。如果您还没有这样做,请将订阅保留在数据库中。

How are your endpoints configured to persist their subscriptions?

What if one (or more) of your service encounters an error and is restartet by the Failoverclustermanager? In this case, this service would never receive one of the "I'm Ready for Work" message from the other services again.

When you fail over to the other node, I guess that all your services send these messages again and, as a result, everything gets back working.

To test this behavior do the following.

  1. Stop and restart all your services.
  2. Stop only one of the services.
  3. Restart the stopped service.
  4. If your system does not hang, repeat this with each single service.

If your system now hangs again, check your configurations. It this scenario your at least one, if not all, services lose the subscriptions between restarts. If you did not do so already, persist the subscription in a database.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文