Python3 通过多处理并行化作业

发布于 2024-12-05 02:24:57 字数 292 浏览 1 评论 0原文

我有一个脚本,用于解析包含其他文件目录的文件,必须打开并读取该文件来搜索关键字。 由于文件数量不断增加,我想启用多重处理以减少完成作业所需的时间。

我正在考虑让父进程解析包含目录的文件并使用子进程来获取其他文件。由于父级需要在创建子级之前获取数据,因此这将是一种阻塞架构(父级必须在调用子级之前读取所有文件),而我想将包含目录的列表发送给其中一个子级每 100 个结果。

因此,父进程继续解析文件,而子进程同时查找关键字。

我怎么能做这样的事呢? 如果您需要更多解释,请询问我,我会告诉您更多信息。

谢谢。

I have a script that parses a file containing directories to other file, that have to be opened and read searching for a keyword.
Since the number of file is growing I'd like to enable multiprocessing to reduce the amount of time requested to complete the job.

I was thinking to leave the parent process parsing the file containing directories and use child processes to fetch the other files. Since the parent would need to obtain the data before to create childs it would be a blocking architecture (the parent has to read all the file before to call childs), while I'd like to send to one of the childs the list containing directories each 100 results.

So, the parent continues parsing the file while childs work at the same time to find the keyword.

How could I do something like that?
If you need more explanations, please, ask me and I'll tell you more.

Thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

凉宸 2024-12-12 02:24:57

我正在考虑让父进程解析包含目录的文件并使用子进程来获取其他文件。

目录是一个名称。父级解析列表并向每个子级提供目录名称。正确的?然后孩子读取目录内的文件。

由于父级需要在创建子级之前获取数据,因此这将是一个阻塞架构(父级必须在调用子级之前读取所有文件),

嗯。孩子不读取目录内的文件?上面,它说孩子确实读取了文件。对于父母来说,读取大量数据并将其推送给孩子是愚蠢的。

虽然我想向其中一个孩子发送包含每 100 个结果的目录的列表。

出色地。这是不同的。现在您想让父级读取一个目录名,读取一批 100 个文件名并将文件名发送给子级。好的。这比读取所有数据要简单得多。现在只剩下100个名字了。

因此,父级继续解析文件,而子级同时查找关键字。

好的。但你完全错过了并行处理的机会。

仔细阅读 multiprocessing 模块。

你想要的是两个队列和两种工人。

您的应用程序将构建两个队列。它将构建一个源进程、一个“获取批处理”工作进程池和一个“获取文件”工作进程池。

  • 来源。这个过程(基本上)是一个读取原始“包含目录的文件”的函数。并将每个目录名放入“获取批处理”队列中。

  • 获取批次。这是一个进程池。每个进程都是一个从“获取批次”队列中获取条目的函数。这是一个目录名称。然后,它读取目录并将 100 个文件名的元组放入“获取文件”队列中。

  • 获取文件。这是一个进程池。每个进程都是一个从“获取文件”队列中获取条目的函数。这是一个包含 100 个文件的元组。然后它打开并读取这 100 个文件,对它们进行天知道会做什么。

多处理模块的想法是使用工作池,所有工作池都从队列中获取任务并将结果放入另一个队列中。这些工作人员全部同时运行。

I was thinking to leave the parent process parsing the file containing directories and use child processes to fetch the other files.

A directory is a name. The parent parses a list and provides the directory name to each child. Right? The child then reads the files inside the directory.

Since the parent would need to obtain the data before to create childs it would be a blocking architecture (the parent has to read all the file before to call childs),

Um. The child doesn't read the files inside the directory? Up above, it says the child does read the files. It's silly for the parent to read a lot of data and push that to the children.

while I'd like to send to one of the childs the list containing directories each 100 results.

Well. This is different. Now you want to have the parent read a directory name, read a batch of 100 file names and send the file names to the child. Okay. That's less silly than reading all the data. Now it's just 100 names.

So, the parent continues parsing the file while childs work at the same time to find the keyword.

Okay. But you're totally missing the opportunity for parallel processing.

Read the multprocessing module carefully.

What you want are two queues and two kinds of workers.

Your application will build the two queues. It will build a source Process, a pool of "get batch" worker processes, and a pool of "get files" worker processes.

  • Source. This process is (basically) a function that reads the original "file containing directories". And puts each directory name into the "get batch" queue.

  • Get Batch. This is a pool of processes. Each process is a function that gets an entry from the "get batch" queue. This is a directory name. It then reads the directory and enqueues a tuple of 100 file names into the "get files" queue.

  • Get Files. This is a pool of processes. Each process is a function that gets an entry from the "get files" queue. This is a tuple of 100 files. It then opens and reads these 100 files doing god-knows-what with them.

The idea of the multiprocessing module is to use pools of workers that all get their tasks from a queue and put their results into another queue. These workers all run at the same time.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文