JGroups 吞噬内存

发布于 2024-08-23 17:17:58 字数 2241 浏览 7 评论 0原文

目前,我的 jgroups 配置存在问题,导致数千条消息卡在 NAKACK.xmit_table 中。实际上,它们似乎都最终出现在 xmit_table 中,几个小时后的另一个转储表明它们也从未打算离开...

这是协议栈配置

UDP(bind_addr=xxx.xxx.xxx.114;
bind_interface=bond0;
ip_mcast=true;ip_ttl=64;
loopback=false;
mcast_addr=228.1.2.80;mcast_port=45589;
mcast_recv_buf_size=80000;
mcast_send_buf_size=150000;
ucast_recv_buf_size=80000;
ucast_send_buf_size=150000):
PING(num_initial_members=3;timeout=2000):
MERGE2(max_interval=20000;min_interval=10000):
FD_SOCK:
FD(max_tries=5;shun=true;timeout=10000):
VERIFY_SUSPECT(timeout=1500):
pbcast.NAKACK(discard_delivered_msgs=true;gc_lag=50;retransmit_timeout=600,1200,2400,4800;use_mcast_xmit=true):
pbcast.STABLE(desired_avg_gossip=20000;max_bytes=400000;stability_delay=1000):UNICAST(timeout=600,1200,2400):
FRAG(frag_size=8192):pbcast.GMS(join_timeout=5000;print_local_addr=true;shun=true):
pbcast.STATE_TRANSFER

启动消息...

2010-03-01 23:40:05,358 INFO  [org.jboss.cache.TreeCache] viewAccepted(): [xxx.xxx.xxx.35:51723|17] [xxx.xxx.xxx.35:51723, xxx.xxx.xxx.36:53088, xxx.xxx.xxx.115:32781, xxx.xxx.xxx.114:32934]
2010-03-01 23:40:05,363 INFO  [org.jboss.cache.TreeCache] TreeCache local address is 10.35.191.114:32934
2010-03-01 23:40:05,393 INFO  [org.jboss.cache.TreeCache] received the state (size=32768 bytes)
2010-03-01 23:40:05,509 INFO  [org.jboss.cache.TreeCache] state was retrieved successfully (in 146 milliseconds)

...表明一切都在到目前为止还好。

设置为警告级别的日志并不表明出现了问题,除了

2010-03-03 09:59:01,354 ERROR [org.jgroups.blocks.NotificationBus] exception=java.lang.IllegalArgumentException: java.lang.NullPointerException

我猜测是不相关的偶尔出现的情况,因为它之前已经看到过,没有内存问题。

我一直在挖掘其中一台机器的两个内存转储来寻找奇怪的地方,但到目前为止还一无所获。除了可能来自不同协议

UDP 的

num_bytes_sent 53617832
num_bytes_received 679220174
num_messages_sent 99524
num_messages_received 99522

一些统计数据,而 NAKACK 具有……

num_bytes_sent 0
num_bytes_received 0
num_messages_sent 0
num_messages_received 0

以及一个巨大的 xmit_table。

每台机器有两个 JChannel 实例,一个用于 ehcache,一个用于 TreeCache。配置错误意味着它们共享相同的诊断多播地址,但这应该不会造成问题,除非我想发送诊断消息,对吗?然而,他们当然有不同的消息多播地址。

请要求澄清,我有很多信息,但目前我有点不确定哪些是相关的。

I currently have a problem with my jgroups configuration, causing thousands of messages getting stuck in the NAKACK.xmit_table. Actually all of them seem to end up in the xmit_table, and a another dump from a few hours later indicates that they never intend to leave either...

This is the protocol stack configuration

UDP(bind_addr=xxx.xxx.xxx.114;
bind_interface=bond0;
ip_mcast=true;ip_ttl=64;
loopback=false;
mcast_addr=228.1.2.80;mcast_port=45589;
mcast_recv_buf_size=80000;
mcast_send_buf_size=150000;
ucast_recv_buf_size=80000;
ucast_send_buf_size=150000):
PING(num_initial_members=3;timeout=2000):
MERGE2(max_interval=20000;min_interval=10000):
FD_SOCK:
FD(max_tries=5;shun=true;timeout=10000):
VERIFY_SUSPECT(timeout=1500):
pbcast.NAKACK(discard_delivered_msgs=true;gc_lag=50;retransmit_timeout=600,1200,2400,4800;use_mcast_xmit=true):
pbcast.STABLE(desired_avg_gossip=20000;max_bytes=400000;stability_delay=1000):UNICAST(timeout=600,1200,2400):
FRAG(frag_size=8192):pbcast.GMS(join_timeout=5000;print_local_addr=true;shun=true):
pbcast.STATE_TRANSFER

Startup message...

2010-03-01 23:40:05,358 INFO  [org.jboss.cache.TreeCache] viewAccepted(): [xxx.xxx.xxx.35:51723|17] [xxx.xxx.xxx.35:51723, xxx.xxx.xxx.36:53088, xxx.xxx.xxx.115:32781, xxx.xxx.xxx.114:32934]
2010-03-01 23:40:05,363 INFO  [org.jboss.cache.TreeCache] TreeCache local address is 10.35.191.114:32934
2010-03-01 23:40:05,393 INFO  [org.jboss.cache.TreeCache] received the state (size=32768 bytes)
2010-03-01 23:40:05,509 INFO  [org.jboss.cache.TreeCache] state was retrieved successfully (in 146 milliseconds)

... indicates that everything is fine so far.

The logs, set to warn-level does not indicate that something is wrong except for the occational

2010-03-03 09:59:01,354 ERROR [org.jgroups.blocks.NotificationBus] exception=java.lang.IllegalArgumentException: java.lang.NullPointerException

which I'm guessing is unrelated since it has been seen earlier without the memory memory issue.

I have have been digging through two memory dumps from one of the machines to find oddities but nothing so far. Except for maybe some statistics from the different protocols

UDP has

num_bytes_sent 53617832
num_bytes_received 679220174
num_messages_sent 99524
num_messages_received 99522

while NAKACK has...

num_bytes_sent 0
num_bytes_received 0
num_messages_sent 0
num_messages_received 0

... and a huge xmit_table.

Each machine has two JChannel instances, one for ehcache and one for TreeCache. A misconfiguration means that both of them share the same diagnositics mcast address, but this should not pose a problem unless I want to send diagnostics messages right? However they do of course have different mcast addresses for the messages.

Please ask for clarifications, I have lots of information but I'm a bit uncertain about what is relevant at this point.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

遇到 2024-08-30 17:17:58

事实证明,集群中的一个节点根本没有收到任何多播消息。这导致所有节点都挂在自己的 xmit_tables 上,因为它们没有从“隔离”节点收到任何稳定性消息,表明它已收到它们的消息。

重新启动 AS、更改多播地址解决了该问题。

It turns out that one of the nodes in the cluster did not receive any multicast messages at all. This caused all nodes to hang on to their own xmit_tables, since they did not get any stability messages from the 'isolated' node, stating that it had received their messages.

A restart of ASs, changing the multicast address solved the issue.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文