Google Composer 升级后 Airflow 调度程序无法启动

发布于 2025-01-09 08:57:08 字数 1388 浏览 1 评论 0原文

早上好,

将 Google Composer 升级到版本 1.18,将 Apache Airflow 升级到版本 1.10.15(使用 Composer 的自动升级)后,调度程序似乎无法启动。

气流消息:“调度程序似乎没有运行。最后一次检测信号是在 1 天前收到的。DAG 列表可能不会更新,并且不会安排新任务。”

得到这个后我尝试:

  • 重新启动网络服务器 gcloud beta Composer 环境 restart-web-server

  • 尝试重新启动 Airflow-Scheduler: kubectl 获取部署气流调度程序 -o yaml | kubectl 替换 --force -f -

  • 我查看了 pod 的信息: kubectl 描述 pod 气流调度程序

最后状态:已终止 原因:错误 退出代码:1 开始时间:2022 年 2 月 23 日星期三 15:59:13 +0000 完成时间:2022 年 2 月 23 日星期三 16:04:09 +0000

  • 所以我删除了 pod 并等待它自行运行: kubectl 删除 pod airflow-scheduler-...

  • 编辑 1:来自 Pod 的日志:

Dags 和插件尚未同步

  • 编辑 2:其他日志:

正在建立同步状态... 开始同步... 正在复制 gs://europe-west1-********-bucket/dags/sql/... 跳过尝试下载到以斜杠结尾的文件名 (/home/airflow/gcs/dags/sql/)。这通常发生在使用时 gsutil 从 Cloud Console 创建的子目录下载 (https://cloud.google.com/console) / [0/1 个文件][ 0.0 B/ 11.0 B] 0% 完成 InvalidUrl 错误:无效的目标路径:/home/airflow/gcs/dags/sql/

但它继续单独重新启动,有时会出现 CrashLoopBackOff 因此表明容器重新启动后反复崩溃

不知道我还能做什么:/。

感谢您的帮助:)

Good morning,

After upgrade the Google Composer to the version 1.18 and Apache Airflow to the version 1.10.15 (using the auto upgrade from the composer) the scheduler does not seem to be able to start.

Airflow message: "The scheduler does not appear to be running. Last heartbeat was received 1 day ago.The DAGs list may not update, and new tasks will not be scheduled."

After get this I tried:

  • Restart web server
    gcloud beta composer environments restart-web-server

  • Try to restart Airflow-Scheduler:
    kubectl get deployment airflow-scheduler -o yaml | kubectl replace --force -f -

  • I looked the info of the pod:
    kubectl describe pod airflow-scheduler

Last State: Terminated
Reason: Error
Exit Code: 1
Started: Wed, 23 Feb 2022 15:59:13 +0000
Finished: Wed, 23 Feb 2022 16:04:09 +0000

  • So I deleted the pod and wait until it run by itself:
    kubectl delete pod airflow-scheduler-...

  • EDIT 1: The logs from the pod:

Dags and plugins are not synced yet

  • EDIT 2: Additional logs:

Building synchronization state...
Starting synchronization...
Copying gs://europe-west1-********-bucket/dags/sql/...
Skipping attempt to download to filename ending with slash
(/home/airflow/gcs/dags/sql/). This typically happens when using
gsutil to download from a subdirectory created by the Cloud Console
(https://cloud.google.com/console)
/ [0/1 files][ 0.0 B/ 11.0 B] 0% Done InvalidUrl Error: Invalid destination path: /home/airflow/gcs/dags/sql/

But it continues restarting alone and sometimes appears the CrashLoopBackOff so indicates that a container is repeatedly crashing after restarting

Not sure what could I do more :/.

Thanks for the help :)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

世态炎凉 2025-01-16 08:57:08

您面临的问题与资源达到限制并且不允许您启动调度程序的问题有关。

我的假设是这可能会发生:

  1. 调度程序上设置的限制导致 gcsfuse 进程
    被杀了,你能把它们移除来检查是否可以阻止
    崩溃循环?
  2. K8s 集群没有足够的资源供 Composer Agent 执行
    启动调度程序作业,您可以向其中添加资源。
  3. 当它启动时,您会得到一个损坏的条目。这
    你可以用这个做的事情是重新启动调度程序
    您自己的,通过使用 ssh 连接到实例。

The problem that you are facing has to do with a problem where the resources are getting on the limits and this is not letting you start the Scheduler.

My assumptions are that this could be happening:

  1. The limits set on the scheduler are causing the gcsfuse process to
    get killed, can you remove them to check if that stops the
    crashloop?
  2. K8s cluster does not have enough resources for the Composer Agent to
    start the scheduler job, you can add resources to this.
  3. You are getting a corrupted entry when it is starting for this. The
    thing that you could do with this is to restart the scheduler on
    your own, by using ssh to connect into the instance.
歌入人心 2025-01-16 08:57:08

在我们的 DAG 文件夹(存储桶内)中,我们有另一个文件夹,其中包含由不同 BigQuery 运算符触发的所有 SQL。由于某种原因,该文件夹的同步未正确完成,因此删除该文件夹并再次添加后,工作人员再次启动。

In our DAGs folder (inside the bucket) we have another folder with the all SQLs that are triggered by the different BigQuery operators. For some reason the synchronisation of that folder was not being done correctly, so after deleting the folder and adding it again the workers were up again.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文