如何在不杀死数据库的情况下使用 Gearman 进行文件处理?
我目前正在设计一个用于处理上传文件的系统。
文件通过 LAMP Web 前端上传,并且必须通过多个阶段进行处理,其中一些阶段是连续的,另一些阶段可能并行运行。
几个关键点:
- 上传文件的客户端只关心安全传送文件而不关心处理结果,因此可以完全异步。
- 文件大小最大为 50kb
- 系统必须扩展到每天处理超过 100 万个文件
- 重要的是没有文件可能丢失或未处理
- 我的假设是 MySQL,但如果 NoSQL 可以提供优势,我对 NoSQL 没有任何问题。
我最初的想法是让前端将文件直接放入 MySQL 数据库,然后让许多工作进程在完成每个步骤时轮询数据库设置标志。经过一些粗略的计算后,我意识到这不会扩展,因为工作人员轮询将开始导致上传表上的锁定问题。
经过一番研究,Gearman 似乎可以解决这个问题。工作人员可以在 Gearman 服务器上注册并轮询作业,而不会损坏数据库。
我目前困惑的是如何最有效地调度工作。我可以看到三种方法来做到这一点:
- 编写一个调度程序来轮询数据库,然后将作业发送到 Gearman
- 让上传进程在收到文件时触发异步 Gearman 作业
- 使用 Gearman MySQL UDF 扩展来创建数据库插入文件时触发作业
第一种方法仍然会对数据库造成一定程度的影响,但它可以轻松地从故障中恢复。 后两种方法似乎需要启用 Gearman 队列持久性才能从故障中恢复,但我担心如果启用此功能,我将失去吸引我使用 Gearman 的原始速度并将数据库瓶颈转移到下游。
任何有关这些方法中哪种方法最有效(甚至更好的现实世界示例)的建议将不胜感激。
如果您认为我以错误的方式处理整件事,也请随时参与。
I'm currently designing a system for processing uploaded files.
The files are uploaded through a LAMP web frontend and must be processed through several stages some of which are sequential and others which may run in parallel.
A few key points:
- The clients uploading the files only care about safely delivering the files not the results of the processing so it can be completely asynchronous.
- The files are max 50kb in size
- The system must scale up to processing over a million files a day
- It is critical that no files may be lost or go unprocessed
- My assumption is MySQL, but I have no issue with NoSQL if this could offer an advantage.
My initial idea was to have the front end put the files straight into a MySQL DB and then have a number of worker processes poll the database setting flags as they completed each step. After some rough calculations I realised that this wouldn't scale as the workers polling would start to cause locking problems on the upload table.
After some research it looks like Gearman might be the solution to the problem. The workers can register with the Gearman server and can poll for jobs without crippling the DB.
What I am currently puzzling over is how to dispatch jobs in the most efficient manner. There are three ways I can see to do this:
- Write a single dispatcher to poll the database and then send jobs to Gearman
- Have the upload process fire off an asynchronous Gearman job when it receives a file
- Use the Gearman MySQL UDF extension to make the DB fire off jobs when files are inserted
The first approach will still hammer the DB somewhat but it could trivially recover from a failure.
The second two approaches would seem to require enabling Gearman queue persistence to recover from faults, but I am concerned that if I enable this I will loose the raw speed that attracts me to Gearman and shift the DB bottleneck downstream.
Any advice on which of these approaches would be the most efficient (or even better real world examples) would be much appreciated.
Also feel free to pitch in if you think I'm going about the whole thing the wrong way.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这个问题已经开放了一段时间了,所以我想我应该提供一些关于我所采取的方法的信息。
每次为“调度”工作人员上传文件时,我都会创建一个 gearman 作业,该工作人员了解每个文件所需的处理步骤的顺序。调度程序对每个处理步骤的 gearman 作业进行排队。
任何完成的作业都会将完成时间戳写回数据库并调用调度程序,然后调度程序可以对任何后续任务进行排队。
为每个作业完成写入时间戳意味着如果处理丢失或失败,系统可以恢复其队列,而不必承受持久队列的负担。
This has been open for a little while now so I thought I would provide some information on the approach that I took.
I create a gearman job every time a file is uploaded for a "dispatch" worker which understands the sequence of processing steps required for each file. The dispatcher queues gearman jobs for each of the processing steps.
Any jobs that complete write back a completion timestamp to the DB and call the dispatcher which can then queue any follow on tasks.
The writing of timestamps for each job completion means the system can recover its queues if processing is missed or fails without having to have the burden of persistent queues.
我会将文件保存到磁盘,然后将文件名发送给 Gearman。当流程的每个部分完成时,它都会为流程的下一部分生成另一条消息,您可以将文件移动到新的工作目录中,以便下一阶段对其进行处理。
I would save the files to disk, then send the filename to Gearman. As each part of the process completes, it generates another message for the next part of the process, you could move the file into a new work-in-process directory for the next stage to work on it.