python 中的多进程还是线程?

发布于 2024-07-30 18:23:25 字数 180 浏览 5 评论 0原文

我有一个 python 应用程序,它获取数据集合,并针对该集合中的每条数据执行一项任务。 由于存在延迟,该任务需要一些时间才能完成。 由于这种延迟,我不希望每条数据都随后执行任务,我希望它们全部并行发生。 我应该使用多进程吗? 或此操作的线程?

我尝试使用线程,但遇到了一些麻烦,通常某些任务永远不会真正触发。

I have a python application that grabs a collection of data and for each piece of data in that collection it performs a task. The task takes some time to complete as there is a delay involved. Because of this delay, I don't want each piece of data to perform the task subsequently, I want them to all happen in parallel. Should I be using multiprocess? or threading for this operation?

I attempted to use threading but had some trouble, often some of the tasks would never actually fire.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

琉璃梦幻 2024-08-06 18:23:25

如果您确实受计算限制,那么使用 多处理模块 可能是最好的选择最轻量级的解决方案(就内存消耗和实现难度而言)。

如果您受 I/O 限制,请使用 线程模块通常会给你很好的结果。 确保使用线程安全存储(如队列)将数据传递给线程。 或者,在它们产生时,向它们提供一条对它们来说是唯一的数据。

PyPy 专注于性能。 它具有许多有助于计算密集型处理的功能。 他们还支持软件事务内存,尽管这还没有达到生产质量。 我们承诺您可以使用比多处理更简单的并行或并发机制(这有一些尴尬的要求。)

Stackless Python 也是一个好主意。 Stackless 存在如上所述的可移植性问题。 Unladen Swallow 曾经很有希望,但现在已经不存在了。 Pyston 是另一个(未完成的)注重速度的 Python 实现。 它采用了与 PyPy 不同的方法,这可能会产生更好(或只是不同)的加速。

If you are truly compute bound, using the multiprocessing module is probably the lightest weight solution (in terms of both memory consumption and implementation difficulty.)

If you are I/O bound, using the threading module will usually give you good results. Make sure that you use thread safe storage (like the Queue) to hand data to your threads. Or else hand them a single piece of data that is unique to them when they are spawned.

PyPy is focused on performance. It has a number of features that can help with compute-bound processing. They also have support for Software Transactional Memory, although that is not yet production quality. The promise is that you can use simpler parallel or concurrent mechanisms than multiprocessing (which has some awkward requirements.)

Stackless Python is also a nice idea. Stackless has portability issues as indicated above. Unladen Swallow was promising, but is now defunct. Pyston is another (unfinished) Python implementation focusing on speed. It is taking an approach different to PyPy, which may yield better (or just different) speedups.

温柔嚣张 2024-08-06 18:23:25

任务像顺序运行一样,但你会有并行运行的错觉。 当您用于文件或连接 I/O 时,任务是很好的选择,因为它是轻量级的。

带池的多进程可能是适合您的解决方案,因为进程并行运行,因此非常适合密集计算,因为每个进程都在一个 CPU(或核心)中运行。

设置多进程可能非常简单:

from multiprocessing import Pool

def worker(input_item):
    output = do_some_work()
    return output

pool = Pool() # it make one process for each CPU (or core) of your PC. Use "Pool(4)" to force to use 4 processes, for example.
list_of_results = pool.map(worker, input_list) # Launch all automatically

Tasks runs like sequentially but you have the illusion that are run in parallel. Tasks are good when you use for file or connection I/O and because are lightweights.

Multiprocess with Pool may be the right solution for you because processes runs in parallel so are very good with intensive computing because each process run in one CPU (or core).

Setup multiprocess may be very easy:

from multiprocessing import Pool

def worker(input_item):
    output = do_some_work()
    return output

pool = Pool() # it make one process for each CPU (or core) of your PC. Use "Pool(4)" to force to use 4 processes, for example.
list_of_results = pool.map(worker, input_list) # Launch all automatically
花想c 2024-08-06 18:23:25

对于小型数据集合,只需使用 subprocess.Popen 创建子流程即可。

每个子进程可以简单地从标准输入或命令行参数获取它的数据,进行处理,然后将结果写入输出文件。

当子进程全部完成(或超时)时,您只需合并输出文件即可。

很简单。

For small collections of data, simply create subprocesses with subprocess.Popen.

Each subprocess can simply get it's piece of data from stdin or from command-line arguments, do it's processing, and simply write the result to an output file.

When the subprocesses have all finished (or timed out), you simply merge the output files.

Very simple.

苏大泽ㄣ 2024-08-06 18:23:25

您可能会考虑研究 Stackless Python。 如果您可以控制需要很长时间的函数,您可以在其中抛出一些 stackless.schedule() (表示屈服于下一个协程),否则您可以 将 Stackless 设置为抢占式多任务处理

在 Stackless 中,您没有线程,但是 taskletgreenlet 本质上是非常轻量级的线程。 它工作得很好,因为有一个非常好的框架,只需很少的设置即可实现多任务处理。

然而,Stackless 阻碍了可移植性,因为您必须替换一些标准 Python 库——Stackless 消除了对 C 堆栈的依赖。 如果下一个用户也安装了 Stackless,那么它非常便携,但这种情况很少发生。

You might consider looking into Stackless Python. If you have control over the function that takes a long time, you can just throw some stackless.schedule()s in there (saying yield to the next coroutine), or else you can set Stackless to preemptive multitasking.

In Stackless, you don't have threads, but tasklets or greenlets which are essentially very lightweight threads. It works great in the sense that there's a pretty good framework with very little setup to get multitasking going.

However, Stackless hinders portability because you have to replace a few of the standard Python libraries -- Stackless removes reliance on the C stack. It's very portable if the next user also has Stackless installed, but that will rarely be the case.

傾城如夢未必闌珊 2024-08-06 18:23:25

使用 CPython 的线程模型不会给您带来任何性能改进,因为由于垃圾收集的处理方式,线程实际上并不是并行执行的。 多进程将允许并行执行。 显然,在这种情况下,您必须有多个可用核心来分配并行作业。

此相关问题中提供了更多信息。

Using CPython's threading model will not give you any performance improvement, because the threads are not actually executed in parallel, due to the way garbage collection is handled. Multiprocess would allow parallel execution. Obviously in this case you have to have multiple cores available to farm out your parallel jobs to.

There is much more information available in this related question.

又怨 2024-08-06 18:23:25

如果您可以轻松地分区和分离您拥有的数据,那么听起来您应该只在外部进行分区,并将它们提供给程序的多个进程。 (即多个进程而不是线程)

If you can easily partition and separate the data you have, it sounds like you should just do that partitioning externally, and feed them to several processes of your program. (i.e. several processes instead of threads)

深居我梦 2024-08-06 18:23:25

IronPython 具有真正的多线程,与 CPython 不同,它是 GIL。 因此,根据您在做什么,它可能值得一看。 但听起来您的用例更适合多处理模块。

对于推荐 stackless python 的人来说,我不是这方面的专家,但在我看来,他正在谈论软件“多线程”,这实际上根本不是并行的(仍然在一个物理线程中运行,因此无法扩展到多核。)它只是构造异步(但仍然是单线程、非并行)应用程序的另一种方法。

IronPython has real multithreading, unlike CPython and it's GIL. So depending on what you're doing it may be worth looking at. But it sounds like your use case is better suited to the multiprocessing module.

To the guy who recommends stackless python, I'm not an expert on it, but it seems to me that he's talking about software "multithreading", which is actually not parallel at all (still runs in one physical thread, so cannot scale to multiple cores.) It's merely an alternative way to structure asynchronous (but still single-threaded, non-parallel) application.

我不是你的备胎 2024-08-06 18:23:25

您可能想查看Twisted。 它是为异步网络任务而设计的。

You may want to look at Twisted. It is designed for asynchronous network tasks.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文