如何处理 mod_wsgi/django 中的阻塞 IO?
我使用以下配置在 Apache+mod_wsgi 下以守护进程模式运行 Django:
WSGIDaemonProcess myserver processes=2 threads=15
我的应用程序在后端执行一些 IO,这可能需要几秒钟。
def my_django_view:
content=... # Do some processing on backend file
return HttpResponse(content)
看起来,如果我正在处理超过 2 个正在处理此类 IO 的 http 请求,Django 将简单地阻塞,直到前面的请求之一完成。
这是预期的行为吗?难道线程不应该帮助缓解这种情况吗?即,在看到这种等待之前,我是否应该能够为给定的 WSGI 进程处理最多 15 个单独的请求?
或者我在这里遗漏了什么?
I am running Django under Apache+mod_wsgi in daemon mode with the following config:
WSGIDaemonProcess myserver processes=2 threads=15
My application does some IO on the backend, which could take several seconds.
def my_django_view:
content=... # Do some processing on backend file
return HttpResponse(content)
It appears that if I am processing more than 2 http requests that are handling this kind of IO, Django will simply block until one of the previous requests completes.
Is this expected behavior? Shouldn't threading help alleviate this i.e. shouldn't I be able to process up to 15 separate requests for a given WSGI process, before I see this kind of wait?
Or am I missing something here?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如果处理是在 python 中进行的,则全局解释器锁不会被释放——在单个 python 进程中,一次只能执行一个 python 代码线程。不过,GIL 通常在 C 代码内发布——例如,像大多数 I/O 一样。
如果这种处理会经常发生,您可能会考虑运行第二个“工作者”应用程序作为守护程序,从数据库读取任务,执行操作并将结果写回数据库。 Apache 可能会决定终止响应时间过长的进程。
If the processing is in python, then Global Interpreter Lock is not being released -- in a single python process only one thread of python code can be executing at a time. The GIL is usually released inside C code though -- like most I/O, for example.
If this kind of processing is going to happen a lot, you might consider running a second "worker" application as a deamon, reading tasks from the database, performing the operations and writing resulsts back to the database. Apache might decide to kill processes that take too long to respond.
拉多米尔·多皮拉尔斯基(Radomir Dopieralski)的回答+1。
如果任务需要很长时间,您应该将其委托给请求-响应周期之外的进程,可以使用标准 cron,也可以使用一些分布式任务队列,例如 芹菜
+1 to Radomir Dopieralski's answer.
If the task takes long you should delegate it to a process outside the request-response cycle, either by using a standard cron, or some distributed task queue like Celery
用于工作负载卸载的数据库在 2010 年非常流行,在当时也是一个好主意,但现在我们已经取得了一些进展。
我们使用 Apache Kafka 作为队列来存储正在进行的工作负载。所以,数据流现在是:
User -> Apache httpd->卡夫卡-> python 守护进程处理器
用户后期操作将数据放入系统中,通过 wsgi 应用程序进行处理,该应用程序将其快速写入 Kafka 队列。在后期操作中进行最少的健全性检查以保持快速,但会发现一些明显的问题。 Kafka 存储数据的速度非常快,因此 http 响应速度很快。
一组单独的 python 守护进程从 Kafka 提取数据并对其进行处理。实际上,我们有多个进程需要以不同的方式处理它,但是 Kafka 通过只写入一次并在需要时让多个读取器读取相同的数据来实现快速处理;不会因重复存储而受到处罚。
这使得周转非常非常快;最佳的资源利用率,因为我们有其他离线的盒子处理来自 kafka 的拉取,并且可以根据需要对其进行调整以减少延迟。 Kafka 是 HA,将相同的数据写入集群中的多个盒子,因此我的经理不会抱怨“如果”场景会发生什么。
我们对卡夫卡非常满意。 http://kafka.apache.org
Databases for workload offloading were quite the thing in 2010, and a good idea then, but we've come a bit farther now.
We're using Apache Kafka as a queue to store our in-flight workload. So, Dataflow is now:
User -> Apache httpd -> Kafka -> python daemon processor
User post operation puts data into system to be processed via wsgi app that just writes it very fast to a Kafka queue. Minimal sanity checking is done in the post operation to keep it fast but find some obvious problems. Kafka stores the data very fast so the http response is zippy.
A separate set of python daemons pull data from Kafka and do processing on it. We actually have multiple processes that need to process it differently, but Kafka makes that fast by only writing once and having multiple readers read the same data if needed; no penalty for duplicate storage is incurred.
This allows very, very fast turnaround; optimal resource usage since we have other boxes offline handle the pull-from-kafka and can tune that to reduce lag as needed. Kafka is HA with same data written to multiple boxes in the cluster so my manager doesn't complain about 'what happens if' scenarios.
We're quite happy with Kafka. http://kafka.apache.org