并行化我的 python 程序

发布于 2024-10-27 15:18:11 字数 287 浏览 4 评论 0原文

我有一个 python 程序,它从输入文件中读取一行,进行一些操作并将其写入输出文件。我有一台四核机器,我想充分利用它们。我认为有两种选择可以做到这一点,

  1. 创建 n 个多个 python 进程,每个进程处理总记录数/n
  2. 在单个 python 进程中为每个输入记录创建 n 个线程,每个线程处理一条记录。
  3. 在单个 python 进程中创建一个包含 n 个线程的池,每个线程执行一条输入记录。

我从未使用过 python 多处理功能,黑客可以告诉我哪种方法是最佳选择吗?

I have a python program that reads a line from a input file, does some manipulation and writes it to output file. I have a quadcore machine, and I want to utilize all of them. I think there are two alternatives to do this,

  1. Creating n multiple python processes each handling a total number of records/n
  2. Creating n threads in a single python process for every input record and each thread processing a record.
  3. Creating a pool of n threads in a single python process, each executing a input record.

I have never used python mutliprocessing capabilities, can the hackers please tell which method is best option?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

忘你却要生生世世 2024-11-03 15:18:11

Python 解释器 (CPython) 的参考实现持有臭名昭著的“全局解释器锁”(GIL) ,实际上一次只允许一个线程执行 Python 代码。因此,多线程在 Python 中非常有限——除非您的繁重工作是在释放 GIL 的 C 扩展中完成的。

克服此限制的最简单方法是使用 multiprocessing模块代替。它具有与threading类似的API,并且使用起来非常简单。在您的情况下,您可以像这样使用它(假设操作是困难的部分):

import multiprocessing

def process_line(line):
    # This function is executed in your worker processes.  Manipulate the
    # line and return the results.
    return manipulate(line)

if __name__ == '__main__':
    with open('input.txt') as fin, open('output.txt', 'w') as fout:
        # This creates a pool of N worker processes, where N is the number
        # of CPUs in your machine.
        pool = multiprocessing.Pool()

        # Let the workers do the manipulation and write the results to
        # the output file:
        for manipulated_line in pool.imap(process_line, fin):
            fout.write(manipulated_line)

The reference implementation of the Python interpreter (CPython) holds the infamous "Global Interpreter Lock" (GIL), effectively allowing only one thread to execute Python code at a time. As a result, multithreading is very limited in Python -- unless your heavy lifting gets done in C extensions that release the GIL.

The simplest way to overcome this limitation is to use the multiprocessing module instead. It has a similar API to threading and is pretty straight-forward to use. In your case, you could use it like this (assuming that the manipulation is the hard part):

import multiprocessing

def process_line(line):
    # This function is executed in your worker processes.  Manipulate the
    # line and return the results.
    return manipulate(line)

if __name__ == '__main__':
    with open('input.txt') as fin, open('output.txt', 'w') as fout:
        # This creates a pool of N worker processes, where N is the number
        # of CPUs in your machine.
        pool = multiprocessing.Pool()

        # Let the workers do the manipulation and write the results to
        # the output file:
        for manipulated_line in pool.imap(process_line, fin):
            fout.write(manipulated_line)
风和你 2024-11-03 15:18:11

第一个是正确答案。

首先,创建和管理多进程比多线程更容易。您可以使用multiprocessing模块或pyro之类的模块来处理细节。其次,线程需要处理 Python 的全局解释器锁,这使得即使您是 Java 或 C# 线程专家也变得更加复杂。最重要的是,多核机器上的性能比您想象的更难预测。如果您还没有实施和测量两种不同的做事方式,那么您关于哪种方式最快的直觉可能是错误的。

顺便说一句,如果您确实是 Java 或 C# 线程方面的专家,那么您可能应该选择线程,但使用 Jython 或 IronPython 而不是 CPython。

Number one is the right answer.

First of all, it is easier to create and manage multiple processes than multiple threads. You can use the multiprocessing module or something like pyro to take care of the details. Secondly, threading needs to deal with Python's global interpreter lock which makes it more complicated even if you are an expert at threading with Java or C#. And most importantly, performance on multicore machines is harder to predict than you might think. If you haven't implemented and measured two different ways to do things, your intuition as to which way is fastest, is probably wrong.

By the way if you really are an expert at Java or C# threading, then you probably should go with threading instead, but use Jython or IronPython instead of CPython.

诗化ㄋ丶相逢 2024-11-03 15:18:11

同时从多个进程读取同一个文件是很棘手的。是否可以预先分割文件?

虽然 Python 有 GIL,但 Jython 和 IronPython 都没有这个限制。

还要确保简单的单个进程尚未达到最大磁盘 I/O。如果这样做的话,你将很难获得任何东西。

Reading the same file from several processes concurrently is tricky. Is it possible to split the file beforehand?

While Python has the GIL both Jython and IronPython hasn't that limitation.

Also make sure that a simple single process doesn't already max disk I/O. You will have a hard time gaining anything if it does.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文