递归处理文件夹中文件的快速(低级)方法
我的应用程序索引最终用户计算机上所有硬盘驱动器的内容。 我正在使用 Directory.GetFiles 和 Directory.GetDirectories 递归处理整个文件夹结构。我仅对少数选定的文件类型(最多 10 个文件类型)建立索引。
我在探查器中看到,大部分索引时间都花在枚举文件和文件夹上 - 取决于实际索引的文件比例(最多 90%)。
我想尽快建立索引。我已经优化了索引本身和索引文件的处理。
我正在考虑使用 Win32 API 调用,但实际上我在探查器中看到大部分处理时间实际上都花在了 .NET 完成的这些 API 调用上。
是否有一种可以从 C# 访问的(可能是低级的)方法,可以使文件/文件夹的枚举至少部分更快?
根据评论中的要求,我当前的代码(只是删除了不相关部分的方案):
private IEnumerable<IndexedEntity> RecurseFolder(string indexedFolder)
{
//for a single extension:
string[] files = Directory.GetFiles(indexedFolder, extensionFilter);
foreach (string file in files)
{
yield return ProcessFile(file);
}
foreach (string directory in Directory.GetDirectories(indexedFolder))
{
//recursively process all subdirectories
foreach (var ie in RecurseFolder(directory))
{
yield return ie;
}
}
}
My application indexes contents of all hard drives on end users computers.
I am using Directory.GetFiles and Directory.GetDirectories to recursively process the whole folder structure. I am indexing only a few selected file types (up to 10 filetypes).
I am seeing in profiler that most of the indexing time is spent in enumerating files and folders - depending on ratio of files that will actually be indexed up to 90 percent of time.
I would like to make the indexing as fast as possible. I have already optimized the indexing itself and processing of the indexed files.
I was thinking using Win32 API calls, but I am actually seeing in the profiler that most of the processing time is actually spent on these API calls done by .NET.
Is there a (possibly low level) method accessible from C# that would make enumeration of files/folders at least partially faster?
As requested in the comment, my current code (just a scheme with irrelevant parts trimmed):
private IEnumerable<IndexedEntity> RecurseFolder(string indexedFolder)
{
//for a single extension:
string[] files = Directory.GetFiles(indexedFolder, extensionFilter);
foreach (string file in files)
{
yield return ProcessFile(file);
}
foreach (string directory in Directory.GetDirectories(indexedFolder))
{
//recursively process all subdirectories
foreach (var ie in RecurseFolder(directory))
{
yield return ie;
}
}
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
在.NET 4.0中,有内置的可枚举文件列表方法< /a>;因为这并不遥远,我会尝试使用它。这可能是一个因素,特别是如果您有大量填充的文件夹(需要大型阵列分配)。
如果深度是问题,我会考虑展平您的方法以使用本地堆栈/队列和单个迭代器块。这将减少用于枚举深层文件夹的代码路径:
迭代该过程,从结果创建您的 ProcessFile。
In .NET 4.0, there are inbuilt enumerable file listing methods; since this isn't far away, I would try using that. This might be a factor in particular if you have any folders that are massively populated (requiring a large array allocation).
If depth is the issue, I would consider flattening your method to use a local stack/queue and a single iterator block. This will reduce the code path used to enumerate the deep folders:
Iterate that, creating your
ProcessFile
s from the results.如果您认为 .NET 实现导致了问题,那么我建议您使用 winapi 调用 _findfirst、_findnext 等。
在我看来,.NET 需要大量内存,因为列表在每次调用时都被完全复制到数组中。目录级别 - 因此,如果您的目录结构有 10 层深,那么您在任何给定时刻都有 10 个版本的数组文件,并且为结构中的每个目录分配/释放该数组。
使用与 _findfirst 等相同的递归技术只需要在每个递归级别保留目录结构中某个位置的句柄。
If you believe that the .NET implementation is causing the problem then I suggest that you use the winapi calls _findfirst, _findnext etc.
It seems to me that .NET requires a lot of memory for because the lists are completely copied into the arrays at each level of directory - so if your directory structure is 10 levels deep you have 10 versions of the array files at any given moment and an allocation/deallocation of this array for every directory in the structure.
Using the same recursive technique with _findfirst etc will only require that handles to a position in the directory structure be kept at every level of recursion.