解决 celerybeat 的单点故障问题

发布于 2025-01-05 20:22:47 字数 258 浏览 4 评论 0原文

我正在寻找推荐的解决方案来解决 celerybeat 成为 celery/rabbitmq 部署的单点故障的问题。到目前为止,通过搜索网络,我没有找到任何有意义的东西。

就我而言,定时调度程序每天启动一次可以运行半天或更长时间的一系列作业。由于只能有一个 celerybeat 实例,如果它或其运行的服务器发生问题,关键作业将不会运行。

我希望已经有一个可行的解决方案,因为我不能是唯一需要可靠(集群等)调度程序的人。如果不需要的话,我不想诉诸某种数据库支持的调度程序。

I'm looking for recommended solution to work around celerybeat being a single point of failure for celery/rabbitmq deployment. I didn't find anything that made sense so far, by searching the web.

In my case, once a day timed scheduler kicks off a series of jobs that could run for half a day or longer. Since there can only be one celerybeat instance, if something happens to it or the server that it's running on, critical jobs will not be run.

I'm hoping there is already a working solution for this, as I can't be the only one who needs reliable (clustered or the like) scheduler. I don't want to resort to some sort of database-backed scheduler, if I don't have to.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

少女净妖师 2025-01-12 20:22:47

celery github 存储库中有一个关于此的未决问题。但不知道他们是否正在努力。

作为一种解决方法,您可以为任务添加锁,以便一次仅运行 1 个特定周期性任务的实例。

类似这样的:

if not cache.add('My-unique-lock-name', True, timeout=lock_timeout):
    return

计算出锁定超时很好,很棘手。如果不同的 celerybeats 尝试在不同的时间运行它们,我们将使用 0.9 * 任务 run_every 秒。
0.9 只是为了留出一些余量(例如,当 celery 稍微落后于计划一次时,那么它会按计划进行,这将导致锁仍然处于活动状态)。

然后您就可以在所有机器上使用 celerybeat 实例。每个 celerybeat 实例的每个任务都会排队,但其中只有一个任务会完成运行。

任务仍将以这种方式尊重 run_every - 最坏的情况:任务将以 0.9*run_every 速度运行。

这种情况的一个问题是:如果任务已排队但未在计划时间处理(例如,因为队列处理器不可用),则可能会在错误的时间放置锁,从而可能导致下一个任务根本无法运行。为了解决这个问题,您需要某种检测机制,无论任务是否准时。

尽管如此,在生产中使用时这不应该是常见情况。

另一个解决方案是子类化 celerybeat Scheduler 并重写其 tick 方法。然后在处理任务之前为每个刻度添加一个锁。这确保只有具有相同周期性任务的 celerybeats 不会多次对相同的任务进行排队。每个tick只有一个celerybeat(赢得竞争条件的人)会对任务进行排队。当一个芹菜节拍下降时,下一个节拍就会有另一个芹菜赢得比赛。

这当然可以与第一种解决方案结合使用。

当然,要实现这一点,需要为所有服务器复制和/或共享缓存后端。

这是一个老问题,但我希望它对任何人都有帮助。

There is an open issue in celery github repo about this. Don't know if they are working on it though.

As a workaround you could add a lock for tasks so that only 1 instance of specific PeriodicTask will run at a time.

Something like:

if not cache.add('My-unique-lock-name', True, timeout=lock_timeout):
    return

Figuring out lock timeout is well, tricky. We're using 0.9 * task run_every seconds if different celerybeats will try to run them at different times.
0.9 just to leave some margin (e.g. when celery is a little behind schedule once, then it is on schedule which would cause lock to still be active).

Then you can use celerybeat instance on all machines. Each task will be queued for every celerybeat instance but only one task of them will finish the run.

Tasks will still respect run_every this way - worst case scenario: tasks will run at 0.9*run_every speed.

One issue with this case: if tasks were queued but not processed at scheduled time (for example because queue processors was unavailable) - then lock may be placed at wrong time causing possibly 1 next task to simply not run. To go around this you would need some kind of detection mechanism whether task is more or less on time.

Still, this shouldn't be a common situation when using in production.

Another solution is to subclass celerybeat Scheduler and override its tick method. Then for every tick add a lock before processing tasks. This makes sure that only celerybeats with same periodic tasks won't queue same tasks multiple times. Only one celerybeat for each tick (one who wins the race condition) will queue tasks. In one celerybeat goes down, with next tick another one will win the race.

This of course can be used in combination with the first solution.

Of course for this to work cache backend needs to be replicated and/or shared for all of servers.

It's an old question but I hope it helps anyone.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文