JGroups 吞噬内存

发布于 2024-08-23 17:17:58 字数 2241 浏览 7 评论 0原文

目前，我的 jgroups 配置存在问题，导致数千条消息卡在 NAKACK.xmit_table 中。实际上，它们似乎都最终出现在 xmit_table 中，几个小时后的另一个转储表明它们也从未打算离开...

这是协议栈配置

UDP(bind_addr=xxx.xxx.xxx.114;
bind_interface=bond0;
ip_mcast=true;ip_ttl=64;
loopback=false;
mcast_addr=228.1.2.80;mcast_port=45589;
mcast_recv_buf_size=80000;
mcast_send_buf_size=150000;
ucast_recv_buf_size=80000;
ucast_send_buf_size=150000):
PING(num_initial_members=3;timeout=2000):
MERGE2(max_interval=20000;min_interval=10000):
FD_SOCK:
FD(max_tries=5;shun=true;timeout=10000):
VERIFY_SUSPECT(timeout=1500):
pbcast.NAKACK(discard_delivered_msgs=true;gc_lag=50;retransmit_timeout=600,1200,2400,4800;use_mcast_xmit=true):
pbcast.STABLE(desired_avg_gossip=20000;max_bytes=400000;stability_delay=1000):UNICAST(timeout=600,1200,2400):
FRAG(frag_size=8192):pbcast.GMS(join_timeout=5000;print_local_addr=true;shun=true):
pbcast.STATE_TRANSFER

启动消息...

2010-03-01 23:40:05,358 INFO  [org.jboss.cache.TreeCache] viewAccepted(): [xxx.xxx.xxx.35:51723|17] [xxx.xxx.xxx.35:51723, xxx.xxx.xxx.36:53088, xxx.xxx.xxx.115:32781, xxx.xxx.xxx.114:32934]
2010-03-01 23:40:05,363 INFO  [org.jboss.cache.TreeCache] TreeCache local address is 10.35.191.114:32934
2010-03-01 23:40:05,393 INFO  [org.jboss.cache.TreeCache] received the state (size=32768 bytes)
2010-03-01 23:40:05,509 INFO  [org.jboss.cache.TreeCache] state was retrieved successfully (in 146 milliseconds)

...表明一切都在到目前为止还好。

设置为警告级别的日志并不表明出现了问题，除了

2010-03-03 09:59:01,354 ERROR [org.jgroups.blocks.NotificationBus] exception=java.lang.IllegalArgumentException: java.lang.NullPointerException

我猜测是不相关的偶尔出现的情况，因为它之前已经看到过，没有内存问题。

我一直在挖掘其中一台机器的两个内存转储来寻找奇怪的地方，但到目前为止还一无所获。除了可能来自不同协议

UDP 的

num_bytes_sent 53617832
num_bytes_received 679220174
num_messages_sent 99524
num_messages_received 99522

一些统计数据，而 NAKACK 具有……

num_bytes_sent 0
num_bytes_received 0
num_messages_sent 0
num_messages_received 0

以及一个巨大的 xmit_table。

每台机器有两个 JChannel 实例，一个用于 ehcache，一个用于 TreeCache。配置错误意味着它们共享相同的诊断多播地址，但这应该不会造成问题，除非我想发送诊断消息，对吗？然而，他们当然有不同的消息多播地址。

请要求澄清，我有很多信息，但目前我有点不确定哪些是相关的。

原文

I currently have a problem with my jgroups configuration, causing thousands of messages getting stuck in the NAKACK.xmit_table. Actually all of them seem to end up in the xmit_table, and a another dump from a few hours later indicates that they never intend to leave either...

This is the protocol stack configuration

UDP(bind_addr=xxx.xxx.xxx.114;
bind_interface=bond0;
ip_mcast=true;ip_ttl=64;
loopback=false;
mcast_addr=228.1.2.80;mcast_port=45589;
mcast_recv_buf_size=80000;
mcast_send_buf_size=150000;
ucast_recv_buf_size=80000;
ucast_send_buf_size=150000):
PING(num_initial_members=3;timeout=2000):
MERGE2(max_interval=20000;min_interval=10000):
FD_SOCK:
FD(max_tries=5;shun=true;timeout=10000):
VERIFY_SUSPECT(timeout=1500):
pbcast.NAKACK(discard_delivered_msgs=true;gc_lag=50;retransmit_timeout=600,1200,2400,4800;use_mcast_xmit=true):
pbcast.STABLE(desired_avg_gossip=20000;max_bytes=400000;stability_delay=1000):UNICAST(timeout=600,1200,2400):
FRAG(frag_size=8192):pbcast.GMS(join_timeout=5000;print_local_addr=true;shun=true):
pbcast.STATE_TRANSFER

Startup message...

2010-03-01 23:40:05,358 INFO  [org.jboss.cache.TreeCache] viewAccepted(): [xxx.xxx.xxx.35:51723|17] [xxx.xxx.xxx.35:51723, xxx.xxx.xxx.36:53088, xxx.xxx.xxx.115:32781, xxx.xxx.xxx.114:32934]
2010-03-01 23:40:05,363 INFO  [org.jboss.cache.TreeCache] TreeCache local address is 10.35.191.114:32934
2010-03-01 23:40:05,393 INFO  [org.jboss.cache.TreeCache] received the state (size=32768 bytes)
2010-03-01 23:40:05,509 INFO  [org.jboss.cache.TreeCache] state was retrieved successfully (in 146 milliseconds)

... indicates that everything is fine so far.

The logs, set to warn-level does not indicate that something is wrong except for the occational

2010-03-03 09:59:01,354 ERROR [org.jgroups.blocks.NotificationBus] exception=java.lang.IllegalArgumentException: java.lang.NullPointerException

which I'm guessing is unrelated since it has been seen earlier without the memory memory issue.

I have have been digging through two memory dumps from one of the machines to find oddities but nothing so far. Except for maybe some statistics from the different protocols

UDP has

num_bytes_sent 53617832
num_bytes_received 679220174
num_messages_sent 99524
num_messages_received 99522

while NAKACK has...

num_bytes_sent 0
num_bytes_received 0
num_messages_sent 0
num_messages_received 0

... and a huge xmit_table.

Each machine has two JChannel instances, one for ehcache and one for TreeCache. A misconfiguration means that both of them share the same diagnositics mcast address, but this should not pose a problem unless I want to send diagnostics messages right? However they do of course have different mcast addresses for the messages.

Please ask for clarifications, I have lots of information but I'm a bit uncertain about what is relevant at this point.

分享到QQ

分享到微博