使用 Gearman 模拟 Google Appengine 的任务队列
我最喜欢 Google 任务队列的特点之一是它的简单性。更具体地说,我喜欢它需要一个 URL 和一些参数,然后在任务队列准备好执行任务时发布到该 URL。
这种结构意味着任务始终执行最新版本的代码。相反,我的 gearman 工作人员都在我的 django 项目中运行代码 - 因此,当我实时推送新版本时,我必须杀死旧工作人员并运行一个新工作人员,以便它使用当前版本的代码。
我的目标是让任务队列独立于代码库,以便我可以推送新的实时版本而无需重新启动任何工作人员。所以,我开始思考:为什么不让任务像 Google App Engine 任务队列一样通过 url 执行呢?
该过程将像这样工作:
- 用户请求进入并触发一些不应阻塞的任务。
- 每个任务都有一个唯一的 URL,因此我将 gearman 任务排入队列以 POST 到指定的 URL。
- gearman服务器找到一个worker,将url传递给worker并将数据发布到worker。worker
- 只需将数据发布到url,从而执行任务。
假设如下:
- 来自 gearman 工作线程的每个请求都经过某种方式签名,以便我们知道它来自 gearman 服务器而不是恶意请求。
- 任务运行时间限制在 10 秒以内(不会有可能超时的长任务)
这种方法有哪些潜在缺陷?这是一个让我担心的问题:
- 服务器可能会受到重击一次发出许多请求,这些请求都是由前一个请求触发的。因此,一个用户请求可能需要 10 个并发 http 请求。我想我可以让一个工人在每次请求速率限制之前睡觉。
有什么想法吗?
One of the characteristics I love most about Google's Task Queue is its simplicity. More specifically, I love that it takes a URL and some parameters and then posts to that URL when the task queue is ready to execute the task.
This structure means that the tasks are always executing the most current version of the code. Conversely, my gearman workers all run code within my django project -- so when I push a new version live, I have to kill off the old worker and run a new one so that it uses the current version of the code.
My goal is to have the task queue be independent from the code base so that I can push a new live version without restarting any workers. So, I got to thinking: why not make tasks executable by url just like the google app engine task queue?
The process would work like this:
- User request comes in and triggers a few tasks that shouldn't be blocking.
- Each task has a unique URL, so I enqueue a gearman task to POST to the specified URL.
- The gearman server finds a worker, passes the url and post data to a worker
- The worker simply posts to the url with the data, thus executing the task.
Assume the following:
- Each request from a gearman worker is signed somehow so that we know it's coming from a gearman server and not a malicious request.
- Tasks are limited to run in less than 10 seconds (There would be no long tasks that could timeout)
What are the potential pitfalls of such an approach? Here's one that worries me:
- The server can potentially get hammered with many requests all at once that are triggered by a previous request. So one user request might entail 10 concurrent http requests. I suppose I could have a single worker with a sleep before every request to rate-limit.
Any thoughts?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
作为 Django 和 Google AppEngine 的用户,我当然可以欣赏您所得到的内容。在工作中,我目前正在使用一些非常酷的开源工具来处理完全相同的场景。
看看Celery。它是一个用 Python 构建的分布式任务队列,公开了三个概念 - 队列、一组工作线程和结果存储。它的每个部分都可以使用不同的工具进行插入。
队列应该是久经沙场的、快速的。查看RabbitMQ,了解 Erlang 中使用 AMQP 协议的出色队列实现。
worker 最终可以是 Python 函数。您可以使用队列消息触发工作程序,或者可能与您所描述的内容更相关 - 使用 webhooks
查看Celery webhook 文档。使用所有这些工具,您可以构建一个可用于生产的分布式任务队列来实现您的上述要求。
我还应该提到,关于您的第一个陷阱,celery 使用 Token Bucket< 来实现任务的速率限制/a> 算法。
As a user of both Django and Google AppEngine, I can certainly appreciate what you're getting at. At work I'm currently working on the exact same scenario using some pretty cool open source tools.
Take a look at Celery. It's a distributed task queue built with Python that exposes three concepts - a queue, a set of workers, and a result store. It's pluggable with different tools for each part.
The queue should be battle-hardened, and fast. Check out RabbitMQ for a great queue implementation in Erlang, using the AMQP protocol.
The workers ultimately can be Python functions. You can trigger workers using either queue messages, or perhaps more pertinent to what you're describing - using webhooks
Check out the Celery webhook documentation. Using all these tools you can build a production ready distributed task queue that implements your requirements above.
I should also mention that in regards to your first pitfall, celery implements rate-limiting of tasks using a Token Bucket algorithm.