Python3 通过多处理并行化作业
我有一个脚本,用于解析包含其他文件目录的文件,必须打开并读取该文件来搜索关键字。 由于文件数量不断增加,我想启用多重处理以减少完成作业所需的时间。
我正在考虑让父进程解析包含目录的文件并使用子进程来获取其他文件。由于父级需要在创建子级之前获取数据,因此这将是一种阻塞架构(父级必须在调用子级之前读取所有文件),而我想将包含目录的列表发送给其中一个子级每 100 个结果。
因此,父进程继续解析文件,而子进程同时查找关键字。
我怎么能做这样的事呢? 如果您需要更多解释,请询问我,我会告诉您更多信息。
谢谢。
I have a script that parses a file containing directories to other file, that have to be opened and read searching for a keyword.
Since the number of file is growing I'd like to enable multiprocessing to reduce the amount of time requested to complete the job.
I was thinking to leave the parent process parsing the file containing directories and use child processes to fetch the other files. Since the parent would need to obtain the data before to create childs it would be a blocking architecture (the parent has to read all the file before to call childs), while I'd like to send to one of the childs the list containing directories each 100 results.
So, the parent continues parsing the file while childs work at the same time to find the keyword.
How could I do something like that?
If you need more explanations, please, ask me and I'll tell you more.
Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
目录是一个名称。父级解析列表并向每个子级提供目录名称。正确的?然后孩子读取目录内的文件。
嗯。孩子不读取目录内的文件?上面,它说孩子确实读取了文件。对于父母来说,读取大量数据并将其推送给孩子是愚蠢的。
出色地。这是不同的。现在您想让父级读取一个目录名,读取一批 100 个文件名并将文件名发送给子级。好的。这比读取所有数据要简单得多。现在只剩下100个名字了。
好的。但你完全错过了并行处理的机会。
仔细阅读
multiprocessing
模块。你想要的是两个队列和两种工人。
您的应用程序将构建两个队列。它将构建一个源进程、一个“获取批处理”工作进程池和一个“获取文件”工作进程池。
来源。这个过程(基本上)是一个读取原始“包含目录的文件”的函数。并将每个目录名放入“获取批处理”队列中。
获取批次。这是一个进程池。每个进程都是一个从“获取批次”队列中获取条目的函数。这是一个目录名称。然后,它读取目录并将 100 个文件名的元组放入“获取文件”队列中。
获取文件。这是一个进程池。每个进程都是一个从“获取文件”队列中获取条目的函数。这是一个包含 100 个文件的元组。然后它打开并读取这 100 个文件,对它们进行天知道会做什么。
多处理模块的想法是使用工作池,所有工作池都从队列中获取任务并将结果放入另一个队列中。这些工作人员全部同时运行。
A directory is a name. The parent parses a list and provides the directory name to each child. Right? The child then reads the files inside the directory.
Um. The child doesn't read the files inside the directory? Up above, it says the child does read the files. It's silly for the parent to read a lot of data and push that to the children.
Well. This is different. Now you want to have the parent read a directory name, read a batch of 100 file names and send the file names to the child. Okay. That's less silly than reading all the data. Now it's just 100 names.
Okay. But you're totally missing the opportunity for parallel processing.
Read the
multprocessing
module carefully.What you want are two queues and two kinds of workers.
Your application will build the two queues. It will build a source Process, a pool of "get batch" worker processes, and a pool of "get files" worker processes.
Source. This process is (basically) a function that reads the original "file containing directories". And puts each directory name into the "get batch" queue.
Get Batch. This is a pool of processes. Each process is a function that gets an entry from the "get batch" queue. This is a directory name. It then reads the directory and enqueues a tuple of 100 file names into the "get files" queue.
Get Files. This is a pool of processes. Each process is a function that gets an entry from the "get files" queue. This is a tuple of 100 files. It then opens and reads these 100 files doing god-knows-what with them.
The idea of the multiprocessing module is to use pools of workers that all get their tasks from a queue and put their results into another queue. These workers all run at the same time.