并行 Foreach 内存问题
我在 FileInfoCollection 中有一个文件集合(3000 个文件)。我想通过应用一些独立的逻辑(可以并行执行)来处理所有文件。
FileInfo[] fileInfoCollection = directory.GetFiles();
Parallel.ForEach(fileInfoCollection, ProcessWorkerItem);
但在处理大约 700 个文件后,我收到内存不足错误。我以前使用过线程池,但它给出了同样的错误。 如果我尝试在没有线程的情况下执行(并行处理),它工作得很好。
在“ProcessWorkerItem”中,我正在运行基于文件字符串数据的算法。另外,我使用 log4net 进行日志记录,并且在此方法中与 SQL 服务器有很多通信。
以下是一些信息,文件大小:1-2 KB XML 文件。我读取了这些文件,该过程取决于文件的内容。它识别字符串中的一些关键字并生成另一种 XML 格式。关键词在SQL Server数据库中(近2000字)。
I have a file collection (3000 files) in a FileInfoCollection. I want to process all the files by applying some logic which is independent (can be executed in parallel).
FileInfo[] fileInfoCollection = directory.GetFiles();
Parallel.ForEach(fileInfoCollection, ProcessWorkerItem);
But after processing about 700 files I am getting an out of memory error. I used Thread-pool before but it was giving same error.
If I try to execute without threading (parallel processing) it works fine.
In "ProcessWorkerItem" I am running an algorithm based on the string data of the file. Additionally I use log4net for logging and there are lot of communications with the SQL server in this method.
Here are some info, Files size : 1-2 KB XML files. I read those files and the process is dependent on the content of the file. It is identifying some keywords in the string and generating another XML format. Keywords are in the SQL server database (nearly 2000 words).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
那么,
ProcessWorkerItem
是做什么的?您也许可以更改它以使用更少的内存(例如流式传输数据而不是一次加载所有数据),或者您可能希望使用此重载和ParallelOptions.MaxDegreeOfParallelism
。基本上,您希望避免尝试一次处理所有 3000 个文件:) IIRC,如果您的任务似乎受 IO 限制,并行扩展会“注意到”,并允许一次执行超过正常数量的文件 - 这实际上并不是什么你想要这里,因为你也受记忆限制。Well, what does
ProcessWorkerItem
do? You may be able to change that to use less memory (e.g. stream the data instead of loading it all in at once) or you may want to explicitly limit the degree of parallelism using this overload andParallelOptions.MaxDegreeOfParallelism
. Basically you want to avoid trying to process all 3000 files at once :) IIRC, Parallel Extensions will "notice" if your tasks appear to be IO bound, and allow more than the normal number to execute at once - which isn't really what you want here, as you're memory bound as well.如果您尝试并行操作大文件,则可能会耗尽可用内存。
也许考虑尝试 Rx 扩展并使用它的 Throttle 方法来控制/组合您的处理?
If you're attempting operations on large files in parallel then it's feasible that you would run out of available memory.
Maybe consider trying out Rx extensions and using it's Throttle method to control/compose your processing?
我发现了引发内存泄漏的错误,我将工作单元模式与实体框架一起使用。在工作单元中,我将上下文保存在哈希表中,并以线程名称作为哈希键。当我使用线程时,哈希表不断增长,并导致内存泄漏。
因此,我向工作单元添加了额外的方法,以便在完成线程的任务后从哈希表中删除元素。
I found the bug which raised the memory leak, I as using Unit Of Work pattern with entity framework. In unit of work I keep the context in a hash table with thread name as the hash key. When I use threading the hash table keeps growing and it cased the memory leak.
So I added additional method to unit of work to remove the element from hash table after completing the task of a thread.