并行化我的 python 程序
我有一个 python 程序,它从输入文件中读取一行,进行一些操作并将其写入输出文件。我有一台四核机器,我想充分利用它们。我认为有两种选择可以做到这一点,
- 创建 n 个多个 python 进程,每个进程处理总记录数/n
- 在单个 python 进程中为每个输入记录创建 n 个线程,每个线程处理一条记录。
- 在单个 python 进程中创建一个包含 n 个线程的池,每个线程执行一条输入记录。
我从未使用过 python 多处理功能,黑客可以告诉我哪种方法是最佳选择吗?
I have a python program that reads a line from a input file, does some manipulation and writes it to output file. I have a quadcore machine, and I want to utilize all of them. I think there are two alternatives to do this,
- Creating n multiple python processes each handling a total number of records/n
- Creating n threads in a single python process for every input record and each thread processing a record.
- Creating a pool of n threads in a single python process, each executing a input record.
I have never used python mutliprocessing capabilities, can the hackers please tell which method is best option?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
Python 解释器 (CPython) 的参考实现持有臭名昭著的“全局解释器锁”(GIL) ,实际上一次只允许一个线程执行 Python 代码。因此,多线程在 Python 中非常有限——除非您的繁重工作是在释放 GIL 的 C 扩展中完成的。
克服此限制的最简单方法是使用
multiprocessing
模块代替。它具有与threading
类似的API,并且使用起来非常简单。在您的情况下,您可以像这样使用它(假设操作是困难的部分):The reference implementation of the Python interpreter (CPython) holds the infamous "Global Interpreter Lock" (GIL), effectively allowing only one thread to execute Python code at a time. As a result, multithreading is very limited in Python -- unless your heavy lifting gets done in C extensions that release the GIL.
The simplest way to overcome this limitation is to use the
multiprocessing
module instead. It has a similar API tothreading
and is pretty straight-forward to use. In your case, you could use it like this (assuming that the manipulation is the hard part):第一个是正确答案。
首先,创建和管理多进程比多线程更容易。您可以使用
multiprocessing
模块或pyro
之类的模块来处理细节。其次,线程需要处理 Python 的全局解释器锁,这使得即使您是 Java 或 C# 线程专家也变得更加复杂。最重要的是,多核机器上的性能比您想象的更难预测。如果您还没有实施和测量两种不同的做事方式,那么您关于哪种方式最快的直觉可能是错误的。顺便说一句,如果您确实是 Java 或 C# 线程方面的专家,那么您可能应该选择线程,但使用 Jython 或 IronPython 而不是 CPython。
Number one is the right answer.
First of all, it is easier to create and manage multiple processes than multiple threads. You can use the
multiprocessing
module or something likepyro
to take care of the details. Secondly, threading needs to deal with Python's global interpreter lock which makes it more complicated even if you are an expert at threading with Java or C#. And most importantly, performance on multicore machines is harder to predict than you might think. If you haven't implemented and measured two different ways to do things, your intuition as to which way is fastest, is probably wrong.By the way if you really are an expert at Java or C# threading, then you probably should go with threading instead, but use
Jython
orIronPython
instead of CPython.同时从多个进程读取同一个文件是很棘手的。是否可以预先分割文件?
虽然 Python 有 GIL,但 Jython 和 IronPython 都没有这个限制。
还要确保简单的单个进程尚未达到最大磁盘 I/O。如果这样做的话,你将很难获得任何东西。
Reading the same file from several processes concurrently is tricky. Is it possible to split the file beforehand?
While Python has the GIL both Jython and IronPython hasn't that limitation.
Also make sure that a simple single process doesn't already max disk I/O. You will have a hard time gaining anything if it does.