如何在不杀死数据库的情况下使用 Gearman 进行文件处理？

发布于 2024-12-11 19:52:38 字数 871 浏览 6 评论 0原文

我目前正在设计一个用于处理上传文件的系统。

文件通过 LAMP Web 前端上传，并且必须通过多个阶段进行处理，其中一些阶段是连续的，另一些阶段可能并行运行。

几个关键点：

上传文件的客户端只关心安全传送文件而不关心处理结果，因此可以完全异步。
文件大小最大为 50kb
系统必须扩展到每天处理超过 100 万个文件
重要的是没有文件可能丢失或未处理
我的假设是 MySQL，但如果 NoSQL 可以提供优势，我对 NoSQL 没有任何问题。

我最初的想法是让前端将文件直接放入 MySQL 数据库，然后让许多工作进程在完成每个步骤时轮询数据库设置标志。经过一些粗略的计算后，我意识到这不会扩展，因为工作人员轮询将开始导致上传表上的锁定问题。

经过一番研究，Gearman 似乎可以解决这个问题。工作人员可以在 Gearman 服务器上注册并轮询作业，而不会损坏数据库。

我目前困惑的是如何最有效地调度工作。我可以看到三种方法来做到这一点：

编写一个调度程序来轮询数据库，然后将作业发送到 Gearman
让上传进程在收到文件时触发异步 Gearman 作业
使用 Gearman MySQL UDF 扩展来创建数据库插入文件时触发作业

第一种方法仍然会对数据库造成一定程度的影响，但它可以轻松地从故障中恢复。后两种方法似乎需要启用 Gearman 队列持久性才能从故障中恢复，但我担心如果启用此功能，我将失去吸引我使用 Gearman 的原始速度并将数据库瓶颈转移到下游。

任何有关这些方法中哪种方法最有效（甚至更好的现实世界示例）的建议将不胜感激。

如果您认为我以错误的方式处理整件事，也请随时参与。

原文

I'm currently designing a system for processing uploaded files.

The files are uploaded through a LAMP web frontend and must be processed through several stages some of which are sequential and others which may run in parallel.

A few key points:

The clients uploading the files only care about safely delivering the files not the results of the processing so it can be completely asynchronous.
The files are max 50kb in size
The system must scale up to processing over a million files a day
It is critical that no files may be lost or go unprocessed
My assumption is MySQL, but I have no issue with NoSQL if this could offer an advantage.

My initial idea was to have the front end put the files straight into a MySQL DB and then have a number of worker processes poll the database setting flags as they completed each step. After some rough calculations I realised that this wouldn't scale as the workers polling would start to cause locking problems on the upload table.

After some research it looks like Gearman might be the solution to the problem. The workers can register with the Gearman server and can poll for jobs without crippling the DB.

What I am currently puzzling over is how to dispatch jobs in the most efficient manner. There are three ways I can see to do this:

Write a single dispatcher to poll the database and then send jobs to Gearman
Have the upload process fire off an asynchronous Gearman job when it receives a file
Use the Gearman MySQL UDF extension to make the DB fire off jobs when files are inserted

The first approach will still hammer the DB somewhat but it could trivially recover from a failure.
The second two approaches would seem to require enabling Gearman queue persistence to recover from faults, but I am concerned that if I enable this I will loose the raw speed that attracts me to Gearman and shift the DB bottleneck downstream.

Any advice on which of these approaches would be the most efficient (or even better real world examples) would be much appreciated.

Also feel free to pitch in if you think I'm going about the whole thing the wrong way.

分享到QQ

分享到微博