绞线 + MapReduce 在单个节点/服务器上?
我对扭曲线程感到困惑。
我听过、读过很多关于 Python 中线程与进程主题的文章、书籍,并且听过一些演示。在我看来,除非要进行大量 IO 操作或想要跨作业利用共享内存,否则正确的选择是使用多处理。
然而,从我到目前为止所看到的来看,Twisted 似乎使用了 Threads(来自 python 线程模块的 pThreads)。 Twisted 在处理大量数据方面似乎表现得非常好。
我有相当多的进程,我想在单个节点/服务器上使用 Python 中的 MapReduce 模式将处理分配给它们。他们实际上不做任何 IO,他们只是做很多处理。
Twisted 反应堆是完成这项工作的正确工具吗?
I'm confused about Twisted threading.
I've heard and read more than a few articles, books, and sat through a few presentations on the subject of threading vs processes in Python. It just seems to me that unless one is doing lots of IO or wanting to utilize shared memory across jobs, then the right choice is to use multiprocessing.
However, from what I've seen so far, it seems like Twisted uses Threads (pThreads from the python threading module). And Twisted seems to perform really really well in processing lots of data.
I've got a fairly large number of processes that I'd like to distribute processing to using the MapReduce pattern in Python on a single node/server. They don't do any IO really, they just do a lot of processing.
Is the Twisted reactor the right tool for this job?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
对您的问题的简短回答:不,绞线不是重型加工的正确解决方案。
如果你有很多处理要做,Twisted 的线程仍然会受到 GIL(全局解释器锁)的约束。无需深入解释,GIL 一次只允许一个线程执行 Python 代码。这实际上意味着您将无法通过单个多线程扭曲进程来利用多个核心。也就是说,一些C 模块(例如 SciPy 的位)可以释放 GIL 并运行多线程,尽管相关的 python 代码仍然有效地是单线程的。
twisted 的线程主要用于将其与基于阻塞 I/O 的模块一起使用。一个典型的例子是数据库 API,因为 db-api 规范不考虑异步用例,并且大多数数据库模块都遵守该规范。因此,要从扭曲的应用程序中使用 PostgreSQL,必须阻止或使用类似
twisted.enterprise.adbapi
的东西,它是一个使用twisted.internet.threads.deferToThread< 的包装器/code> 允许在进行其他操作时执行 SQL 查询。这可以允许其他 python 代码运行,因为
socket
模块(以及大多数涉及操作系统 I/O 的其他模块)将在系统调用时释放 GIL。也就是说,您可以使用twisted 编写一个与许多twisted(或非twisted,如果您愿意)工作人员通信的网络应用程序。然后,每个工作人员都可以处理少量工作,并且您将不会受到 GIL 的限制,因为每个工作人员都是其自己的完全隔离的进程。然后,主进程可以利用twisted 的许多异步原语。例如,您可以使用
DeferredList
等待来自任意数量工作线程的大量结果,然后在所有Deferred
完成时运行响应处理程序。 (从而允许您进行地图调用)如果您想沿着这条路线走下去,我建议您查看twisted.protocols.amp
,这是他们的异步消息协议,并且可以非常简单地用于实现基于网络的 RPC 或 Map-Reduce。与多处理之类的进程相比,运行许多不同的进程的缺点是
尽管对于现代系统来说,2) 很少会成为问题,除非您正在运行数百个子进程。问题 1) 可以通过使用流程管理系统来解决,例如 supervisord
Edit For有关 python 和 GIL 的更多信息,您应该观看 Dave Beazley 关于该主题的演讲 ( 网站 、视频< /a>, 幻灯片 )
The short answer to your question: no, twisted threading is not the right solution for heavy processing.
If you have a lot of processing to do, twisted's threading will still be subject to the GIL (Global Interpreter Lock). Without going into a long in depth explanation, the GIL is what allows only one thread at a time to execute python code. What this means in effect is you will not be able to take advantage of multiple cores with a single multi-threaded twisted process. That said, some C modules (such as bits of SciPy) can release the GIL and run multi-threaded, though the python code associated is still effectively single-threaded.
What twisted's threading is mainly useful for is using it along with blocking I/O based modules. A prime example of this is database API's, because the db-api spec doesn't account for asynchronous use cases, and most database modules adhere to the spec. Thusly, to use PostgreSQL for example from a twisted app, one has to either block or use something like
twisted.enterprise.adbapi
which is a wrapper that usestwisted.internet.threads.deferToThread
to allow a SQL query to execute while other stuff is going on. This can allow other python code to run because thesocket
module (among most others involving operating system I/O) will release the GIL while in a system call.That said, you can use twisted to write a network application talking to many twisted (or non-twisted, if you'd like) workers. Each worker could then work on little bits of work, and you would not be restricted by the GIL, because each worker would be its own completely isolated process. The master process can then make use of many of twisted's asynchronous primitives. For example you could use a
DeferredList
to wait on a number of results coming from any number of workers, and then run a response handler when all of theDeferred
's complete. (thus allowing you to do your map call) If you want to go down this route, I recommend looking attwisted.protocols.amp
, which is their Asynchronous Message Protocol, and can be used very trivially to implement a network-based RPC or map-reduce.The downside of running many disparate processes versus something like
multiprocessing
is thatThough for modern systems, 2) is rarely a problem unless you are running hundreds of subprocesses. And problem 1) can be solved by using a process management system like supervisord
Edit For more on python and the GIL, you should watch Dave Beazley's talks on the subject ( website , video, slides )