定义适当的进程数

发布于 2024-11-25 16:30:14 字数 328 浏览 0 评论 0原文

我有一个处理大量 apache 日志的 python 代码(解压缩、解析、处理数字、正则表达式等)。一个父进程获取文件列表(最多数百万),并使用多进程池将要解析的文件列表发送给工作人员。

我想知道是否有任何指南/基准/建议可以帮助我估计理想的子进程数量?即。每个核心拥有一个进程比启动数百个进程更好?

目前脚本执行的3/4时间是读取文件并解压,在资源方面,其CPU已100%加载,内存和I/O都还可以。所以我认为通过适当的多处理设置可以完成很多工作。脚本将在不同的机器/操作系统上运行,因此也欢迎特定于操作系统的提示。

另外,使用线程而不是多进程有什么好处吗?

I have a python code treating a lot of apache logs (decompress, parse, crunching numbers, regexping etc). One parent process which takes a list of files (up to few millions), and sends a list of files to parse to workers, using multiprocess pool.

I wonder, if there is any guidelines / benchmarks / advices which can help me to estimate ideal number of child process ? Ie. having one process per core is better than launching few hundreds of them?

Currently 3/4 time of script execution is reading files and decompressing them, and in terms of resources, its CPU which is 100% loaded, memory and I/O being ok. So I assume there is a lot which can be done with proper multiprocessing settings. Script will be running on different machines / os, so os-specific hints are welcome, too.

Also, is there any benefit in using threads rather than multiprocess?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

七七 2024-12-02 16:30:14

我想知道是否有任何指南/基准/建议可以帮助我估计理想的子进程数量?

不。

每个核心拥有一个进程比启动数百个进程更好?

你永远不可能提前知道

自由度太多。

您只能通过运行实验来凭经验发现它,直到获得您想要的性能水平。

此外,使用线程而不是多进程有什么好处吗?

很少。

线程没有多大帮助。当进程(作为一个整体)等待 O/S 完成 I/O 请求时,执行 I/O 的多个线程将被锁定等待。

您的操作系统在调度进程方面做得非常非常好。当您进行 I/O 密集型操作时,您确实需要多个进程。

I wonder, if there is any guidelines / benchmarks / advices which can help me to estimate ideal number of child process ?

No.

having one process per core is better than launching few hundreds of them?

You can never know in advance.

There are too many degrees of freedom.

You can only discover it empirically by running experiments until you get the level of performance you desire.

Also, is there any benefit in using threads rather than multiprocess?

Rarely.

Threads don't help much. Multiple threads doing I/O will be locked up waiting while the process (as a whole) waits for the O/S to finish the I/O request.

Your operating system does a very, very good job of scheduling processes. When you have I/O intensive operations, you really want multiple processes.

落叶缤纷 2024-12-02 16:30:14

如果程序受 I/O 限制,则多核不会提供更好的性能。如果磁盘为两个或更多主服务器提供服务,性能甚至可能会变得更差。

Multiple cores do not provide better performance if the program is I/O bound. The performance might even become worse if the disk is serving two or more masters.

若相惜即相离 2024-12-02 16:30:14

我不确定当前的操作系统是否这样做,但过去 I/O 缓冲区是按进程分配的,因此将一个进程的缓冲区划分到多个线程中会导致缓冲区抖动。对于 I/O 密集型任务使用多个进程会更好。

I'm not sure if current OSes do this, but it used to be that I/O buffers were allocated per-process, so dividing one process' buffer among multiple threads would lead to buffer thrashing. You're far better off using multiple processes for I/O-heavy tasks.

白色秋天 2024-12-02 16:30:14

我先解决最后一个问题。在 CPython 中,通过跨线程分配 CPU 限制的负载来获得相当大的性能提升几乎是不可能的。这是由于全局解释器锁造成的。在这方面, multiprocessing 是一个更好的选择。

至于估计理想的工作人员数量,我的建议是:对您的代码、数据、硬件和不同数量的工作人员进行一些实验,看看您可以从中收集到什么加速、瓶颈等信息。

I'll address the last question first. In CPython, it is next to impossible to make sizeable performance gains by distributing CPU-bound load across threads. This is due to the Global Interpreter Lock. In that respect multiprocessing is a better bet.

As to estimating the ideal number of workers, here is my advice: run some experiments with your code, your data, your hardware and a varying number of workers, and see what you can glean from that in terms of speedups, bottlenecks etc.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文