使用Zookeeper HA的Flink群集总是关闭:[接收信号15:Sigterm]

发布于 2025-01-23 18:01:38 字数 7314 浏览 3 评论 0 原文

环境:

Flink1.14.4 Kubernetes中的独立应用模式

根据官方步骤中的

: flink群集: https://nightlies.apache.org/flink/flink/flink/flink/flink-docs-release-1.14/docs/deployment/resource-providers/stanc-providers/standalone/standalone/kubernetes/kubernetes/#applic-mode-mode-mode-mode-mode

Zookeeper ha:

问题:

JobManager始终关闭每三分钟,然后退出每三分钟 - 退出

- - 没有计时器任务,程序逻辑只是一个简单的文字

- 当群集运行任何输入或无需执行的群集时,也每三分钟都有这个问题

- 如果JobManager non Zookeeper HA HA没有此问题

这个问题:

为什么Jobmanager总是与Zookeeper HA关闭以及如何解决它

使用相同的步骤和官方网站的YAML,因此对此问题不知道

代码:

只是Word Cound和其他程序还问题

public static void main(String[] args) throws Exception {

    StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();

    DataStreamSource<String> dataStreamSource = executionEnvironment.socketTextStream(HOST, PORT);

    DataStream<Tuple2<String, Integer>> sum = dataStreamSource.flatMap(new WordCount.MyFlatMapper()).keyBy(0).sum(1);

    sum.print();

    executionEnvironment.execute();
}

JobManager Pod resatrt并退出:


NAMESPACE     NAME                                      READY   STATUS    RESTARTS       AGE
default        flink-jobmanager-8jn6x                    1/1     Running   1 (118s ago)   5m38s
default        flink-jobmanager-8jn6x                    1/1     Running   2 (106s ago)   8m26s
default        flink-jobmanager-8jn6x                    1/1     Running   3 (1s ago)     9m41s
default        flink-jobmanager-8jn6x                    1/1     Running   4 (1s ago)     12m
default        flink-jobmanager-8jn6x                    1/1     Running   5 (0s ago)     15m
default        flink-jobmanager-8jn6x                    1/1     Running   6 (1s ago)     18m
default        flink-jobmanager-8jn6x                    1/1     Terminating   6 (1s ago)     18m
default        flink-jobmanager-8jn6x                    1/1     Terminating   6 (1s ago)     18m

JobManager日志:


-1--
2022-04-23 09:48:21,970 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering checkpoint 33 (type=CHECKPOINT) @ 1650707301963 for job 00000000000000000000000000000000.
2022-04-23 09:48:22,010 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed checkpoint 33 for job 00000000000000000000000000000000 (4917 bytes, checkpointDuration=23 ms, finalizationTime=24 ms).
2022-04-23 09:48:26,627 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
2022-04-23 09:48:26,795 WARN  akka.actor.CoordinatedShutdown                               [] - Could not addJvmShutdownHook, due to: Shutdown in progress
2022-04-23 09:48:26,822 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator        [] - Shutting down remote daemon.
2022-04-23 09:48:26,824 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator        [] - Shutting down remote daemon.
2022-04-23 09:48:26,824 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator        [] - Remote daemon shut down; proceeding with flushing remote transports.
2022-04-23 09:48:26,838 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator        [] - Remote daemon shut down; proceeding with flushing remote transports.
2022-04-23 09:48:26,887 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator        [] - Remoting shut down.
2022-04-23 09:48:26,894 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator        [] - Remoting shut down.
---
-2--
2022-04-23 09:51:24,903 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering checkpoint 67 (type=CHECKPOINT) @ 1650707484897 for job 00000000000000000000000000000000.
2022-04-23 09:51:24,943 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed checkpoint 67 for job 00000000000000000000000000000000 (4982 bytes, checkpointDuration=21 ms, finalizationTime=25 ms).
2022-04-23 09:51:26,626 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
2022-04-23 09:51:26,840 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator        [] - Shutting down remote daemon.
2022-04-23 09:51:26,845 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator        [] - Shutting down remote daemon.
2022-04-23 09:51:26,847 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator        [] - Remote daemon shut down; proceeding with flushing remote transports.
2022-04-23 09:51:26,848 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator        [] - Remote daemon shut down; proceeding with flushing remote transports.
2022-04-23 09:51:26,871 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator        [] - Remoting shut down.
---
-3--
2022-04-23 09:54:26,625 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
2022-04-23 09:54:26,838 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator        [] - Shutting down remote daemon.
2022-04-23 09:54:26,840 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator        [] - Remote daemon shut down; proceeding with flushing remote transports.
[root@master 02-logger--ckps-nfs-reactive-hpa-zk]#
---
-4--
2022-04-23 09:57:26,627 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
2022-04-23 09:57:26,632 INFO  org.apache.flink.runtime.blob.BlobServer                     [] - Stopped BLOB server at 0.0.0.0:6124
2022-04-23 09:57:26,812 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator        [] - Shutting down remote daemon.
2022-04-23 09:57:26,812 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator        [] - Remote daemon shut down; proceeding with flushing remote transports.
---
-5--
2022-04-23 10:00:26,625 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
2022-04-23 10:00:26,859 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator        [] - Shutting down remote daemon.
2022-04-23 10:00:26,859 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator        [] - Remote daemon shut down; proceeding with flushing remote transports.
2022-04-23 10:00:26,884 WARN  akka.actor.CoordinatedShutdown                               [] - Could not addJvmShutdownHook, due to: Shutdown in progress
---

- 更新2022/04/30-

调试日志: https://www.mediafire.com/file/3q8vpzqfnmohgng/debug.log/文件

全部!

Environment:

flink1.14.4
standalone application mode in kubernetes

according to official steps:

flink cluster: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/resource-providers/standalone/kubernetes/#application-mode

zookeeper HA: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/ha/zookeeper_ha/

The problem:

the jobmanager always shutdown and restart every three minutes then quit

-- no timer task and the program logic just a easy wordcount

-- when the cluster running no any input or nothing to do also have this problem every three minutes

-- if jobmanager non zookeeper HA don't have this problem

The question:

why the jobmanager always shutdown with the zookeeper HA and how to solve it

used the same steps and yaml from official site, so no idea for this problem

The code:

just a wordcound and other program also the problem

public static void main(String[] args) throws Exception {

    StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();

    DataStreamSource<String> dataStreamSource = executionEnvironment.socketTextStream(HOST, PORT);

    DataStream<Tuple2<String, Integer>> sum = dataStreamSource.flatMap(new WordCount.MyFlatMapper()).keyBy(0).sum(1);

    sum.print();

    executionEnvironment.execute();
}

Jobmanager pod resatrt and quit:


NAMESPACE     NAME                                      READY   STATUS    RESTARTS       AGE
default        flink-jobmanager-8jn6x                    1/1     Running   1 (118s ago)   5m38s
default        flink-jobmanager-8jn6x                    1/1     Running   2 (106s ago)   8m26s
default        flink-jobmanager-8jn6x                    1/1     Running   3 (1s ago)     9m41s
default        flink-jobmanager-8jn6x                    1/1     Running   4 (1s ago)     12m
default        flink-jobmanager-8jn6x                    1/1     Running   5 (0s ago)     15m
default        flink-jobmanager-8jn6x                    1/1     Running   6 (1s ago)     18m
default        flink-jobmanager-8jn6x                    1/1     Terminating   6 (1s ago)     18m
default        flink-jobmanager-8jn6x                    1/1     Terminating   6 (1s ago)     18m

Jobmanager logs:


-1--
2022-04-23 09:48:21,970 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering checkpoint 33 (type=CHECKPOINT) @ 1650707301963 for job 00000000000000000000000000000000.
2022-04-23 09:48:22,010 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed checkpoint 33 for job 00000000000000000000000000000000 (4917 bytes, checkpointDuration=23 ms, finalizationTime=24 ms).
2022-04-23 09:48:26,627 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
2022-04-23 09:48:26,795 WARN  akka.actor.CoordinatedShutdown                               [] - Could not addJvmShutdownHook, due to: Shutdown in progress
2022-04-23 09:48:26,822 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator        [] - Shutting down remote daemon.
2022-04-23 09:48:26,824 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator        [] - Shutting down remote daemon.
2022-04-23 09:48:26,824 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator        [] - Remote daemon shut down; proceeding with flushing remote transports.
2022-04-23 09:48:26,838 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator        [] - Remote daemon shut down; proceeding with flushing remote transports.
2022-04-23 09:48:26,887 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator        [] - Remoting shut down.
2022-04-23 09:48:26,894 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator        [] - Remoting shut down.
---
-2--
2022-04-23 09:51:24,903 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering checkpoint 67 (type=CHECKPOINT) @ 1650707484897 for job 00000000000000000000000000000000.
2022-04-23 09:51:24,943 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed checkpoint 67 for job 00000000000000000000000000000000 (4982 bytes, checkpointDuration=21 ms, finalizationTime=25 ms).
2022-04-23 09:51:26,626 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
2022-04-23 09:51:26,840 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator        [] - Shutting down remote daemon.
2022-04-23 09:51:26,845 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator        [] - Shutting down remote daemon.
2022-04-23 09:51:26,847 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator        [] - Remote daemon shut down; proceeding with flushing remote transports.
2022-04-23 09:51:26,848 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator        [] - Remote daemon shut down; proceeding with flushing remote transports.
2022-04-23 09:51:26,871 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator        [] - Remoting shut down.
---
-3--
2022-04-23 09:54:26,625 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
2022-04-23 09:54:26,838 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator        [] - Shutting down remote daemon.
2022-04-23 09:54:26,840 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator        [] - Remote daemon shut down; proceeding with flushing remote transports.
[root@master 02-logger--ckps-nfs-reactive-hpa-zk]#
---
-4--
2022-04-23 09:57:26,627 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
2022-04-23 09:57:26,632 INFO  org.apache.flink.runtime.blob.BlobServer                     [] - Stopped BLOB server at 0.0.0.0:6124
2022-04-23 09:57:26,812 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator        [] - Shutting down remote daemon.
2022-04-23 09:57:26,812 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator        [] - Remote daemon shut down; proceeding with flushing remote transports.
---
-5--
2022-04-23 10:00:26,625 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
2022-04-23 10:00:26,859 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator        [] - Shutting down remote daemon.
2022-04-23 10:00:26,859 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator        [] - Remote daemon shut down; proceeding with flushing remote transports.
2022-04-23 10:00:26,884 WARN  akka.actor.CoordinatedShutdown                               [] - Could not addJvmShutdownHook, due to: Shutdown in progress
---

-- updated 2022/04/30 --

Debug logs:
https://www.mediafire.com/file/3q8vpzqfnmohgng/debug.log/file

thx all!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

女中豪杰 2025-01-30 18:01:38

我的情况与你一样。我检查了K8S日志,发现它是由JobManager设置Livessprobe引起的。由于Livess Prome一直失败,K8S重新启动了POD(每分钟检查一次Livices Probe,然后在三个失败后重新启动,因此每三分钟重新启动一次)。如果您想暂时解决此问题,可以暂时禁用LIVISE探测器,但是对HA的LIVISE探测器非常重要,尚未找到一个好的解决方案

[在此处输入图像说明] [1]
[1]:https://i.sstatic.net/xizk2.png

I had the same situation like you. I checked the k8s log and found that it was caused by the jobmanager setting the livenessProbe. Because the livenessProbe kept failing, k8s restarted the pod(check livenessProbe once a minute, and restart after three failures, so it restarts every three minutes). If you want to temporarily solve this problem, you can temporarily disable the liveness probe, but the liveness probe is very important for HA, a good solution has not been found

[enter image description here][1]
[1]: https://i.sstatic.net/Xizk2.png

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文