分布式 Celery 调度程序

发布于 2024-11-28 15:10:58 字数 264 浏览 6 评论 0原文

我正在寻找一个类似于 Python 的分布式 cron 框架,并找到了 Celery。然而,文档说“你必须确保一次只有一个调度程序针对一个计划运行,否则你最终会得到重复的任务”,Celery 使用 celery.beat.PersistentScheduler 将计划存储到本地文件。

所以,我的问题是,除了默认值之外,是否还有另一种实现可以将调度“放入集群”并协调任务执行,以便每个任务只运行一次? 我的目标是能够在集群中的所有主机上以相同的计划运行 celerybeat。

谢谢

I'm looking for a distributed cron-like framework for Python, and found Celery. However, the docs says "You have to ensure only a single scheduler is running for a schedule at a time, otherwise you would end up with duplicate tasks", Celery is using celery.beat.PersistentScheduler which store the schedule to a local file.

So, my question, is there another implementation than the default that can put the schedule "into the cluster" and coordinate task execution so that each task is only run once?
My goal is to be able to run celerybeat with identical schedules on all hosts in the cluster.

Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

挽心 2024-12-05 15:10:58

tl;dr:没有 Celerybeat 不适合您的用例。您必须只运行一个 celerybeat 进程,否则您的任务将会重复。

我知道这是一个非常古老的问题。我会尝试做一个小总结,因为我也有同样的问题/疑问(2018 年)。

一些背景知识:我们正在 Kubernetes 集群中运行 Django 应用程序(使用 Celery)。集群(EC2 实例)和 Pod(~容器)是自动缩放的:简单地说,我不知道应用程序的运行时间和数量。

您有责任仅运行 celerybeat 的一个进程,否则您的任务将重复。 [1] Celery 存储库中有此功能请求:[2]

要求用户确保只有一个 celerybeat 实例
存在于他们的集群中创建了一个实质性的实现
负担(要么造成单点故障,要么鼓励用户
滚动他们自己的分布式互斥体)。

celerybeat 应该提供一种机制来防止无意的
并发性,或者文档应该建议最佳实践
方法。

一段时间后,这个功能请求被 Celery 作者以缺乏资源为由拒绝了。 [3] 我强烈建议您阅读 Github 上的整个主题。那里的人推荐这些项目/解决方案:

我没有尝试任何来自上面(我不想在我的应用程序中存在另一个依赖项,并且我不喜欢锁定任务/您需要处理故障转移等/)。

我最终在 Kubernetes 中使用了 CronJob (https://kubernetes.io/docs/概念/工作负载/控制器/cron-jobs/)。

[1] celerybeat - 多个实例和监控

[2] https://github.com/ celery/celery/issues/251

[3] https://github.com/celery/celery/issues/251#issuecomment-228214951

tl;dr: No Celerybeat is not suitable for your use case. You have to run just one process of celerybeat, otherwise your tasks will be duplicated.

I know this is a very old question. I will try to make a small summary because I have the same problem/question (in the year 2018).

Some background: We're running Django application (with Celery) in the Kubernetes cluster. Cluster (EC2 instances) and Pods (~containers) are autoscaled: simply said, I do not know when and how many instances of the application are running.

It's your responsibility to run only one process of the celerybeat, otherwise, your tasks will be duplicated. [1] There was this feature request in the Celery repository: [2]

Requiring the user to ensure that only one instance of celerybeat
exists across their cluster creates a substantial implementation
burden (either creating a single point-of-failure or encouraging users
to roll their own distributed mutex).

celerybeat should either provide a mechanism to prevent inadvertent
concurrency, or the documentation should suggest a best-practice
approach.

After some time, this feature request was rejected by the author of Celery for lack of resources. [3] I highly recommend reading the entire thread on the Github. People there recommend these project/solutions:

I did not try anything from the above (I do not want another dependency in my app and I do not like locking tasks /you need to deal with fail-over etc./).

I ended up using CronJob in Kubernetes (https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/).

[1] celerybeat - multiple instances & monitoring

[2] https://github.com/celery/celery/issues/251

[3] https://github.com/celery/celery/issues/251#issuecomment-228214951

习惯成性 2024-12-05 15:10:58

我认为对于 celerybeat 的作用可能存在一些误解。 Celerybeat 不处理周期性任务;它只发布它们。它将周期性任务放入队列中以供 celeryd 工作人员处理。如果您运行单个 celerybeat 进程和多个 celeryd 进程,则任务执行将分布到集群中。

I think there might be some misunderstanding about what celerybeat does. Celerybeat does not process the periodic tasks; it only publishes them. It puts the periodic tasks on the queue to be processed by the celeryd workers. If you run a single celerybeat process and multiple celeryd processes then the task execution will be distributed into the cluster.

月亮是我掰弯的 2024-12-05 15:10:58

我们有同样的问题,我们有三台服务器运行 Celerybeat。然而,我们的解决方案是仅在单个服务器上运行 Celerybeat,因此不会创建重复的任务。为什么您希望 Celerybeat 在多个服务器上运行?

如果您担心 Celery 宕机,只需创建一个脚本来监控 Celerybeat 进程是否仍在运行。

$ ps aux | grep celerybeat

这将显示 Celerybeat 进程是否正在运行。然后创建一个脚本,如果您发现进程已关闭,请向您的系统管理员发送电子邮件。 这是一个示例设置,我们在其中仅在一台服务器上运行 Celerybeat。

We had this same issue where we had three servers running Celerybeat. However, our solution was to only run Celerybeat on a single server so duplicate tasks weren't created. Why would you want Celerybeat running on multiple servers?

If you're worried about Celery going down just create a script to monitor that the Celerybeat process is still running.

$ ps aux | grep celerybeat

That will show you if the Celerybeat process is running. Then create a script where if you see the process is down, email your system admins. Here's a sample setup where we're only running Celerybeat on one server.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文