在特定时间运行多个密集型作业的最佳解决方案

发布于 2024-12-27 08:28:40 字数 784 浏览 1 评论 0原文

我们有一个 Web 应用程序,它使用 IMAP 在用户定义的时间有条件地将消息插入到用户的邮箱中。

这些“作业”中的每一个都存储在 MySQL 数据库中,并带有作业运行时间的时间戳(可能是未来几个月)。用户可以随时取消作业。

问题是建立 IMAP 连接是一个缓慢的过程,在插入消息之前,我们经常必须有条件地检查收件箱(或类似的)中是否有人回复,这给每个作业增加了相当大的处理开销。

我们目前有一个系统,其中每分钟运行一次 cron 脚本,以便从数据库中获取需要在接下来的 X 分钟内交付的所有作业。然后,它将它们分成 Z 作业批次,并为每个批次执行一个异步 POST 请求,返回同一服务器,其中包含这些 Z 作业的所有数据(以实现“假”多线程)。然后,服务器处理通过 HTTP 传入的每批 Z 作业。

我们使用异步 HTTP POST 进行多线程而不是像 pnctl_fork 这样的东西的原因是,我们可以添加其他服务器并让它们将数据 POST 到这些服务器,并让它们运行作业而不是当前服务器。

所以我的问题是 - 有没有更好的方法来做到这一点?

我很欣赏像 beanstalkd 这样的工作队列可以使用,但是它们是否适合必须运行的模型特定时间的工作?

另外,因为我们无论如何都需要将作业保留在数据库中(因为我们需要为用户提供用于管理作业的 UI),所以在某处添加工作队列实际上会增加更多开销而不是减少开销吗?

我确信有更好的方法来实现我们的需求 - 任何建议将不胜感激!

我们使用 PHP 来完成这一切,因此我们正在寻找基于 PHP 的/兼容的解决方案。

We have a web app that uses IMAP to conditionally insert messages into users' mailboxes at user-defined times.

Each of these 'jobs' are stored in a MySQL DB with a timestamp for when the job should be run (may be months into the future). Jobs can be cancelled at anytime by the user.

The problem is that making IMAP connections is a slow process, and before we insert the message we often have to conditionally check whether there is a reply from someone in the inbox (or similar), which adds considerable processing overhead to each job.

We currently have a system where we have cron script running every minute or so that gets all the jobs from the DB that need delivering in the next X minutes. It then splits them up into batches of Z jobs, and for each batch performs an asynchronous POST request back to the same server with all the data for those Z jobs (in order to achieve 'fake' multithreading). The server then processes each batch of Z jobs that come in via HTTP.

The reason we use an async HTTP POST for multithreading and not something like pnctl_fork is so that we can add other servers and have them POST the data to those instead, and have them run the jobs rather than the current server.

So my question is - is there a better way to do this?

I appreciate work queues like beanstalkd are available to use, but do they fit with the model of having to run jobs at specific times?

Also, because we need to keep the jobs in the DB anyway (because we need to provide the users with a UI for managing the jobs), would adding a work queue in there somewhere actually be adding more overhead rather than reducing it?

I'm sure there are better ways to achieve what we need - any suggestions would be much appreciated!

We're using PHP for all this so a PHP-based/compatible solution is really what we are looking for.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

江挽川 2025-01-03 08:28:40

Beanstalkd 将是一个合理的方法来做到这一点。它具有 put-with-delay 的概念,因此您可以定期使用能够在 XX 中保留和运行的消息填充主存储中的队列。 code> 秒(您希望它运行的时间 - 现在的时间)。

然后,工作人员将正常运行,连接到 beanstalkd 守护进程并等待保留新作业。如果没有 HTTP 连接的开销,它也会更加高效。例如,我曾经将消息发布到 Amazon SQS(通过 http)。这最多只能达到 20 QPS,但 Beanstalkd 几乎可以毫不费力地每秒接受超过 1000 个 QPS。

编辑添加:在不知道作业 ID 的情况下,您无法删除该作业,尽管您可以将其存储在外部。 OTOH,用户是否必须能够在最后一刻之前随时删除作业?您不必提前几周或几个月将作业放入队列,因此您仍然只有一个数据库读取器,每隔 1 到 5 分钟运行一次,将接下来的几个作业放入队列,并且仍然有您需要的尽可能多的工人,以及他们可以带来的效率。

最终,这取决于您正在执行的数据库读/写次数,以及数据库服务器如何处理它们。

如果你正在做的事情现在不是问题,并且不会因为额外的负载而变得如此,那么就继续吧。

Beanstalkd would be a reasonable way to do this. It has the concept of put-with-delay, so you can regularly fill the queue from your primary store with a message that will be able to be reserved, and run, in X seconds (time you want it to run - the time now).

The workers would then run as normal, connecting to the beanstalkd daemon and waiting for a new job to be reserved. It would also be a lot more efficient without the overhead of a HTTP connection. As an example, I used to post messages to Amazon SQS (by http). This could barely do 20 QPS at very most, but Beanstalkd accepted over a thousand per second with barely any effort.

Edited to add: You can't delete a job without knowing it's ID, though you could store that outside. OTOH, do users have to be able to delete jobs at any time up to the last minute? You don't have to put a job into the queue weeks or months in advance, and so you would still only have one DB-reader that ran every, say, 1 to 5 mins to put the next few jobs into the queue, and still have as many workers as you would need, with the efficiencies they can bring.

Ultimately, it depends on the number of DB read/writes that you are doing, and how the database server is able to handle them.

If what you are doing is not a problem now, and won't become so with additional load, then carry on.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文