JGroups 吞噬内存
目前,我的 jgroups 配置存在问题,导致数千条消息卡在 NAKACK.xmit_table 中。实际上,它们似乎都最终出现在 xmit_table 中,几个小时后的另一个转储表明它们也从未打算离开...
这是协议栈配置
UDP(bind_addr=xxx.xxx.xxx.114;
bind_interface=bond0;
ip_mcast=true;ip_ttl=64;
loopback=false;
mcast_addr=228.1.2.80;mcast_port=45589;
mcast_recv_buf_size=80000;
mcast_send_buf_size=150000;
ucast_recv_buf_size=80000;
ucast_send_buf_size=150000):
PING(num_initial_members=3;timeout=2000):
MERGE2(max_interval=20000;min_interval=10000):
FD_SOCK:
FD(max_tries=5;shun=true;timeout=10000):
VERIFY_SUSPECT(timeout=1500):
pbcast.NAKACK(discard_delivered_msgs=true;gc_lag=50;retransmit_timeout=600,1200,2400,4800;use_mcast_xmit=true):
pbcast.STABLE(desired_avg_gossip=20000;max_bytes=400000;stability_delay=1000):UNICAST(timeout=600,1200,2400):
FRAG(frag_size=8192):pbcast.GMS(join_timeout=5000;print_local_addr=true;shun=true):
pbcast.STATE_TRANSFER
启动消息...
2010-03-01 23:40:05,358 INFO [org.jboss.cache.TreeCache] viewAccepted(): [xxx.xxx.xxx.35:51723|17] [xxx.xxx.xxx.35:51723, xxx.xxx.xxx.36:53088, xxx.xxx.xxx.115:32781, xxx.xxx.xxx.114:32934]
2010-03-01 23:40:05,363 INFO [org.jboss.cache.TreeCache] TreeCache local address is 10.35.191.114:32934
2010-03-01 23:40:05,393 INFO [org.jboss.cache.TreeCache] received the state (size=32768 bytes)
2010-03-01 23:40:05,509 INFO [org.jboss.cache.TreeCache] state was retrieved successfully (in 146 milliseconds)
...表明一切都在到目前为止还好。
设置为警告级别的日志并不表明出现了问题,除了
2010-03-03 09:59:01,354 ERROR [org.jgroups.blocks.NotificationBus] exception=java.lang.IllegalArgumentException: java.lang.NullPointerException
我猜测是不相关的偶尔出现的情况,因为它之前已经看到过,没有内存问题。
我一直在挖掘其中一台机器的两个内存转储来寻找奇怪的地方,但到目前为止还一无所获。除了可能来自不同协议
UDP 的
num_bytes_sent 53617832
num_bytes_received 679220174
num_messages_sent 99524
num_messages_received 99522
一些统计数据,而 NAKACK 具有……
num_bytes_sent 0
num_bytes_received 0
num_messages_sent 0
num_messages_received 0
以及一个巨大的 xmit_table。
每台机器有两个 JChannel 实例,一个用于 ehcache,一个用于 TreeCache。配置错误意味着它们共享相同的诊断多播地址,但这应该不会造成问题,除非我想发送诊断消息,对吗?然而,他们当然有不同的消息多播地址。
请要求澄清,我有很多信息,但目前我有点不确定哪些是相关的。
I currently have a problem with my jgroups configuration, causing thousands of messages getting stuck in the NAKACK.xmit_table. Actually all of them seem to end up in the xmit_table, and a another dump from a few hours later indicates that they never intend to leave either...
This is the protocol stack configuration
UDP(bind_addr=xxx.xxx.xxx.114;
bind_interface=bond0;
ip_mcast=true;ip_ttl=64;
loopback=false;
mcast_addr=228.1.2.80;mcast_port=45589;
mcast_recv_buf_size=80000;
mcast_send_buf_size=150000;
ucast_recv_buf_size=80000;
ucast_send_buf_size=150000):
PING(num_initial_members=3;timeout=2000):
MERGE2(max_interval=20000;min_interval=10000):
FD_SOCK:
FD(max_tries=5;shun=true;timeout=10000):
VERIFY_SUSPECT(timeout=1500):
pbcast.NAKACK(discard_delivered_msgs=true;gc_lag=50;retransmit_timeout=600,1200,2400,4800;use_mcast_xmit=true):
pbcast.STABLE(desired_avg_gossip=20000;max_bytes=400000;stability_delay=1000):UNICAST(timeout=600,1200,2400):
FRAG(frag_size=8192):pbcast.GMS(join_timeout=5000;print_local_addr=true;shun=true):
pbcast.STATE_TRANSFER
Startup message...
2010-03-01 23:40:05,358 INFO [org.jboss.cache.TreeCache] viewAccepted(): [xxx.xxx.xxx.35:51723|17] [xxx.xxx.xxx.35:51723, xxx.xxx.xxx.36:53088, xxx.xxx.xxx.115:32781, xxx.xxx.xxx.114:32934]
2010-03-01 23:40:05,363 INFO [org.jboss.cache.TreeCache] TreeCache local address is 10.35.191.114:32934
2010-03-01 23:40:05,393 INFO [org.jboss.cache.TreeCache] received the state (size=32768 bytes)
2010-03-01 23:40:05,509 INFO [org.jboss.cache.TreeCache] state was retrieved successfully (in 146 milliseconds)
... indicates that everything is fine so far.
The logs, set to warn-level does not indicate that something is wrong except for the occational
2010-03-03 09:59:01,354 ERROR [org.jgroups.blocks.NotificationBus] exception=java.lang.IllegalArgumentException: java.lang.NullPointerException
which I'm guessing is unrelated since it has been seen earlier without the memory memory issue.
I have have been digging through two memory dumps from one of the machines to find oddities but nothing so far. Except for maybe some statistics from the different protocols
UDP has
num_bytes_sent 53617832
num_bytes_received 679220174
num_messages_sent 99524
num_messages_received 99522
while NAKACK has...
num_bytes_sent 0
num_bytes_received 0
num_messages_sent 0
num_messages_received 0
... and a huge xmit_table.
Each machine has two JChannel instances, one for ehcache and one for TreeCache. A misconfiguration means that both of them share the same diagnositics mcast address, but this should not pose a problem unless I want to send diagnostics messages right? However they do of course have different mcast addresses for the messages.
Please ask for clarifications, I have lots of information but I'm a bit uncertain about what is relevant at this point.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
事实证明,集群中的一个节点根本没有收到任何多播消息。这导致所有节点都挂在自己的 xmit_tables 上,因为它们没有从“隔离”节点收到任何稳定性消息,表明它已收到它们的消息。
重新启动 AS、更改多播地址解决了该问题。
It turns out that one of the nodes in the cluster did not receive any multicast messages at all. This caused all nodes to hang on to their own xmit_tables, since they did not get any stability messages from the 'isolated' node, stating that it had received their messages.
A restart of ASs, changing the multicast address solved the issue.