关于接受 zip 文件进行处理的守护进程的建议
我希望编写一个守护进程:
- 从包含 zip 文件路径的队列(sqs、rabbit-mq 等)读取一条消息,
- 更新数据库中的一条记录,例如“此作业正在处理”
- 读取上述存档的内容并将一行插入数据库,其中包含从每个文件的文件元数据中剔除的信息,将
- 每个文件复制到 s3
- 删除 zip 文件
- 将作业标记为“完成”
- 读取队列中的下一条消息,重复
此操作应运行为一项服务,由当有人通过 Web 前端上传文件时排队的消息启动。 上传者不需要立即看到结果,但上传会在后台相当方便地处理。
我对Python很熟悉,所以首先想到的是编写一个简单的服务器,用Twisted来处理每个请求并执行上面提到的过程。 但是,我从来没有写过这样的东西,可以在多用户上下文中运行。 它不会每分钟或每小时处理数百个上传,但如果它可以一次处理多个上传,那就太好了,合理。 我也不太熟悉编写多线程应用程序和处理阻塞等问题。
人们过去是如何解决这个问题的? 我还可以采取哪些其他方法?
预先感谢您的任何帮助和讨论!
im looking to write a daemon that:
- reads a message from a queue (sqs, rabbit-mq, whatever ...) containing a path to a zip file
- updates a record in the database saying something like "this job is processing"
- reads the aforementioned archive's contents and inserts a row into a database w/ information culled from file meta data for each file found
- duplicates each file to s3
- deletes the zip file
- marks the job as "complete"
- read next message in queue, repeat
this should be running as a service, and initiated by a message queued when someone uploads a file via the web frontend. the uploader doesn't need to immediately see the results, but the upload be processed in the background fairly expediently.
im fluent with python, so the very first thing that comes to mind is writing a simple server with twisted to handle each request and carry out the process mentioned above. but, ive never written anything like this that would run in a multi-user context. its not going to service hundreds of uploads per minute or hour, but it'd be nice if it could handle several at a time, reasonable. i also am not terribly familiar with writing multi-threaded applications and dealing with issues like blocking.
how have people solved this in the past? what are some other approaches i could take?
thanks in advance for any help and discussion!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我使用 Beanstalkd 作为排队守护进程,效果非常好(一些近时处理和图像调整大小 - 过去几周到目前为止已超过 200 万)。 使用 zip 文件名(可能来自特定目录)将消息放入队列中 [我以 JSON 格式序列化命令和参数],当您在工作客户端中保留该消息时,其他人都无法获取它,除非您允许它超时(当它返回到队列中等待接收时)。
剩下的就是解压并上传到S3,S3还有其他库。
如果您想同时处理多个 zip 文件,请根据需要运行多个工作进程。
I've used Beanstalkd as a queueing daemon to very good effect (some near-time processing and image resizing - over 2 million so far in the last few weeks). Throw a message into the queue with the zip filename (maybe from a specific directory) [I serialise a command and parameters in JSON], and when you reserve the message in your worker-client, no one else can get it, unless you allow it to time out (when it goes back to the queue to be picked up).
The rest is the unzipping and uploading to S3, for which there are other libraries.
If you want to handle several zip files at once, run as many worker processes as you want.
我会避免做任何多线程的事情,而是使用队列和数据库来同步您想要启动的尽可能多的工作进程。
对于这个应用程序,我认为扭曲或任何用于创建服务器应用程序的框架都将是矫枉过正的。
把事情简单化。 Python 脚本启动,检查队列,做一些工作,再次检查队列。 如果您想要一个正确的后台守护进程,您可能只想确保按照此处所述与终端分离:如何在 Python 中创建守护进程?
添加一些日志记录,也许是一个 try/ except 块,用于通过电子邮件将失败发送给您。
I would avoid doing anything multi-threaded and instead use the queue and the database to synchronize as many worker processes as you care to start up.
For this application I think twisted or any framework for creating server applications is going to be overkill.
Keep it simple. Python script starts up, checks the queue, does some work, checks the queue again. If you want a proper background daemon you might want to just make sure you detach from the terminal as described here: How do you create a daemon in Python?
Add some logging, maybe a try/except block to email out failures to you.
我选择使用芹菜的组合(http://ask.github.com/celery/introduction .html)、rabbitmq 和一个简单的 django 视图来处理上传。 工作流程如下所示:
Task
被调度来处理上传。 所有工作都在Task
内完成。i opted to use a combination of celery (http://ask.github.com/celery/introduction.html), rabbitmq, and a simple django view to handle uploads. the workflow looks like this:
Task
is dispatched to process the upload. all work is done inside theTask
.