当前位置：文江博客话题详情

并行化我的 python 程序

发布于 2024-10-27 15:18:11 字数 287 浏览 4 评论 0原文

我有一个 python 程序，它从输入文件中读取一行，进行一些操作并将其写入输出文件。我有一台四核机器，我想充分利用它们。我认为有两种选择可以做到这一点，

创建 n 个多个 python 进程，每个进程处理总记录数/n
在单个 python 进程中为每个输入记录创建 n 个线程，每个线程处理一条记录。
在单个 python 进程中创建一个包含 n 个线程的池，每个线程执行一条输入记录。

我从未使用过 python 多处理功能，黑客可以告诉我哪种方法是最佳选择吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

忘你却要生生世世 2024-11-03 15:18:11

Python 解释器 (CPython) 的参考实现持有臭名昭著的“全局解释器锁”(GIL) ，实际上一次只允许一个线程执行 Python 代码。因此，多线程在 Python 中非常有限——除非您的繁重工作是在释放 GIL 的 C 扩展中完成的。

克服此限制的最简单方法是使用 multiprocessing模块代替。它具有与threading类似的API，并且使用起来非常简单。在您的情况下，您可以像这样使用它（假设操作是困难的部分）：

import multiprocessing

def process_line(line):
    # This function is executed in your worker processes.  Manipulate the
    # line and return the results.
    return manipulate(line)

if __name__ == '__main__':
    with open('input.txt') as fin, open('output.txt', 'w') as fout:
        # This creates a pool of N worker processes, where N is the number
        # of CPUs in your machine.
        pool = multiprocessing.Pool()

        # Let the workers do the manipulation and write the results to
        # the output file:
        for manipulated_line in pool.imap(process_line, fin):
            fout.write(manipulated_line)

The reference implementation of the Python interpreter (CPython) holds the infamous "Global Interpreter Lock" (GIL), effectively allowing only one thread to execute Python code at a time. As a result, multithreading is very limited in Python -- unless your heavy lifting gets done in C extensions that release the GIL.

The simplest way to overcome this limitation is to use the multiprocessing module instead. It has a similar API to threading and is pretty straight-forward to use. In your case, you could use it like this (assuming that the manipulation is the hard part):

import multiprocessing

def process_line(line):
    # This function is executed in your worker processes.  Manipulate the
    # line and return the results.
    return manipulate(line)

if __name__ == '__main__':
    with open('input.txt') as fin, open('output.txt', 'w') as fout:
        # This creates a pool of N worker processes, where N is the number
        # of CPUs in your machine.
        pool = multiprocessing.Pool()

        # Let the workers do the manipulation and write the results to
        # the output file:
        for manipulated_line in pool.imap(process_line, fin):
            fout.write(manipulated_line)

回复收藏 0 原文

风和你 2024-11-03 15:18:11

第一个是正确答案。

首先，创建和管理多进程比多线程更容易。您可以使用multiprocessing模块或pyro之类的模块来处理细节。其次，线程需要处理 Python 的全局解释器锁，这使得即使您是 Java 或 C# 线程专家也变得更加复杂。最重要的是，多核机器上的性能比您想象的更难预测。如果您还没有实施和测量两种不同的做事方式，那么您关于哪种方式最快的直觉可能是错误的。

顺便说一句，如果您确实是 Java 或 C# 线程方面的专家，那么您可能应该选择线程，但使用 Jython 或 IronPython 而不是 CPython。

回复收藏 0 原文