使用Zookeeper HA的Flink群集总是关闭:[接收信号15:Sigterm]
环境:
Flink1.14.4 Kubernetes中的独立应用模式
根据官方步骤中的
问题:
JobManager始终关闭每三分钟,然后退出每三分钟 - 退出
- - 没有计时器任务,程序逻辑只是一个简单的文字
- 当群集运行任何输入或无需执行的群集时,也每三分钟都有这个问题
- 如果JobManager non Zookeeper HA HA没有此问题
这个问题:
为什么Jobmanager总是与Zookeeper HA关闭以及如何解决它
使用相同的步骤和官方网站的YAML,因此对此问题不知道
代码:
只是Word Cound和其他程序还问题
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<String> dataStreamSource = executionEnvironment.socketTextStream(HOST, PORT);
DataStream<Tuple2<String, Integer>> sum = dataStreamSource.flatMap(new WordCount.MyFlatMapper()).keyBy(0).sum(1);
sum.print();
executionEnvironment.execute();
}
JobManager Pod resatrt并退出:
NAMESPACE NAME READY STATUS RESTARTS AGE
default flink-jobmanager-8jn6x 1/1 Running 1 (118s ago) 5m38s
default flink-jobmanager-8jn6x 1/1 Running 2 (106s ago) 8m26s
default flink-jobmanager-8jn6x 1/1 Running 3 (1s ago) 9m41s
default flink-jobmanager-8jn6x 1/1 Running 4 (1s ago) 12m
default flink-jobmanager-8jn6x 1/1 Running 5 (0s ago) 15m
default flink-jobmanager-8jn6x 1/1 Running 6 (1s ago) 18m
default flink-jobmanager-8jn6x 1/1 Terminating 6 (1s ago) 18m
default flink-jobmanager-8jn6x 1/1 Terminating 6 (1s ago) 18m
JobManager日志:
-1--
2022-04-23 09:48:21,970 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 33 (type=CHECKPOINT) @ 1650707301963 for job 00000000000000000000000000000000.
2022-04-23 09:48:22,010 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed checkpoint 33 for job 00000000000000000000000000000000 (4917 bytes, checkpointDuration=23 ms, finalizationTime=24 ms).
2022-04-23 09:48:26,627 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
2022-04-23 09:48:26,795 WARN akka.actor.CoordinatedShutdown [] - Could not addJvmShutdownHook, due to: Shutdown in progress
2022-04-23 09:48:26,822 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator [] - Shutting down remote daemon.
2022-04-23 09:48:26,824 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator [] - Shutting down remote daemon.
2022-04-23 09:48:26,824 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator [] - Remote daemon shut down; proceeding with flushing remote transports.
2022-04-23 09:48:26,838 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator [] - Remote daemon shut down; proceeding with flushing remote transports.
2022-04-23 09:48:26,887 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator [] - Remoting shut down.
2022-04-23 09:48:26,894 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator [] - Remoting shut down.
---
-2--
2022-04-23 09:51:24,903 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 67 (type=CHECKPOINT) @ 1650707484897 for job 00000000000000000000000000000000.
2022-04-23 09:51:24,943 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed checkpoint 67 for job 00000000000000000000000000000000 (4982 bytes, checkpointDuration=21 ms, finalizationTime=25 ms).
2022-04-23 09:51:26,626 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
2022-04-23 09:51:26,840 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator [] - Shutting down remote daemon.
2022-04-23 09:51:26,845 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator [] - Shutting down remote daemon.
2022-04-23 09:51:26,847 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator [] - Remote daemon shut down; proceeding with flushing remote transports.
2022-04-23 09:51:26,848 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator [] - Remote daemon shut down; proceeding with flushing remote transports.
2022-04-23 09:51:26,871 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator [] - Remoting shut down.
---
-3--
2022-04-23 09:54:26,625 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
2022-04-23 09:54:26,838 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator [] - Shutting down remote daemon.
2022-04-23 09:54:26,840 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator [] - Remote daemon shut down; proceeding with flushing remote transports.
[root@master 02-logger--ckps-nfs-reactive-hpa-zk]#
---
-4--
2022-04-23 09:57:26,627 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
2022-04-23 09:57:26,632 INFO org.apache.flink.runtime.blob.BlobServer [] - Stopped BLOB server at 0.0.0.0:6124
2022-04-23 09:57:26,812 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator [] - Shutting down remote daemon.
2022-04-23 09:57:26,812 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator [] - Remote daemon shut down; proceeding with flushing remote transports.
---
-5--
2022-04-23 10:00:26,625 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
2022-04-23 10:00:26,859 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator [] - Shutting down remote daemon.
2022-04-23 10:00:26,859 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator [] - Remote daemon shut down; proceeding with flushing remote transports.
2022-04-23 10:00:26,884 WARN akka.actor.CoordinatedShutdown [] - Could not addJvmShutdownHook, due to: Shutdown in progress
---
- 更新2022/04/30-
调试日志: https://www.mediafire.com/file/3q8vpzqfnmohgng/debug.log/文件
全部!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我的情况与你一样。我检查了K8S日志,发现它是由JobManager设置Livessprobe引起的。由于Livess Prome一直失败,K8S重新启动了POD(每分钟检查一次Livices Probe,然后在三个失败后重新启动,因此每三分钟重新启动一次)。如果您想暂时解决此问题,可以暂时禁用LIVISE探测器,但是对HA的LIVISE探测器非常重要,尚未找到一个好的解决方案
[在此处输入图像说明] [1]
[1]:https://i.sstatic.net/xizk2.png
I had the same situation like you. I checked the k8s log and found that it was caused by the jobmanager setting the livenessProbe. Because the livenessProbe kept failing, k8s restarted the pod(check livenessProbe once a minute, and restart after three failures, so it restarts every three minutes). If you want to temporarily solve this problem, you can temporarily disable the liveness probe, but the liveness probe is very important for HA, a good solution has not been found
[enter image description here][1]
[1]: https://i.sstatic.net/Xizk2.png