C# - 如何快速、优化地列出子目录中的文件
我正在尝试使用以下方法列出根目录的所有子目录中的文件。但当文件数量达到数百万时,它会花费很多时间。有没有更好的方法来做到这一点。
我正在使用 .NET 3.5,因此无法使用枚举器:-(
******************* Main *************
DirectoryInfo dir = new DirectoryInfo(path);
DirectoryInfo[] subDir = dir.GetDirectories();
foreach (DirectoryInfo di in subDir) //call for each sub directory
{
PopulateList(di.FullName, false);
}
*******************************************
static void PopulateList(string directory, bool IsRoot)
{
System.Diagnostics.ProcessStartInfo procStartInfo = new System.Diagnostics.ProcessStartInfo("cmd", "/c " + "dir /s/b \"" + directory + "\"");
procStartInfo.RedirectStandardOutput = true;
procStartInfo.UseShellExecute = false;
procStartInfo.CreateNoWindow = true;
System.Diagnostics.Process proc = new System.Diagnostics.Process();
proc.StartInfo = procStartInfo;
proc.Start();
string fileName = directory.Substring(directory.LastIndexOf('\\') + 1);
StreamWriter writer = new StreamWriter(fileName + ".lst");
while (proc.StandardOutput.EndOfStream != true)
{
writer.WriteLine(proc.StandardOutput.ReadLine());
writer.Flush();
}
writer.Close();
}
I am trying to list the files in all the sub-directories of a root directory with the below approach. But its taking much time when the number of files are in millions. Is there any better approach of doing this.
I am using .NET 3.5 so can't use enumerator :-(
******************* Main *************
DirectoryInfo dir = new DirectoryInfo(path);
DirectoryInfo[] subDir = dir.GetDirectories();
foreach (DirectoryInfo di in subDir) //call for each sub directory
{
PopulateList(di.FullName, false);
}
*******************************************
static void PopulateList(string directory, bool IsRoot)
{
System.Diagnostics.ProcessStartInfo procStartInfo = new System.Diagnostics.ProcessStartInfo("cmd", "/c " + "dir /s/b \"" + directory + "\"");
procStartInfo.RedirectStandardOutput = true;
procStartInfo.UseShellExecute = false;
procStartInfo.CreateNoWindow = true;
System.Diagnostics.Process proc = new System.Diagnostics.Process();
proc.StartInfo = procStartInfo;
proc.Start();
string fileName = directory.Substring(directory.LastIndexOf('\\') + 1);
StreamWriter writer = new StreamWriter(fileName + ".lst");
while (proc.StandardOutput.EndOfStream != true)
{
writer.WriteLine(proc.StandardOutput.ReadLine());
writer.Flush();
}
writer.Close();
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
删除所有与进程相关的内容并尝试 目录。 GetDirectories () 和 Directory.GetFiles() 方法:
来自 MSDN,搜索选项.AllDirectories:
Remove all Process-related stuff and try out Directory.GetDirectories () and Directory.GetFiles() methods:
From MSDN, SearchOption.AllDirectories:
在每个目录的循环中使用 DirectoryInfo.GetFiles 肯定会更快,而不是产生大量新进程来读取其输出。
It will be definitely faster to use
DirectoryInfo.GetFiles
in a loop for each directory instead of spawning tons of new processes to read thier output.对于数百万个文件,您实际上遇到了文件系统限制(请参阅 this 并搜索“300,000”),因此请考虑到这一点。
至于优化,我认为您确实想要惰性迭代,因此您必须 P/Invoke 到
FindFirstFile
/FindNextFile
。With millions of files you're actually running into filesystem limitation (see this and search for "300,000"), so take this into account.
As for optimizations, I think you'd really want to iterate lazily, so you'll have to P/Invoke into
FindFirstFile
/FindNextFile
.查看已有的 Directory.GetFiles 重载。
例如:
是的,这需要很多时间。但我认为仅使用 .Net 类无法提高其性能。
Check out already available Directory.GetFiles overload.
For example:
And yes it will take a lot of time. But I don't think that you can increase its performance using only .Net classes.
假设您的数百万个文件分布在多个子目录中并且您使用的是 .NET 4.0,您可以查看并行扩展。
使用并行的 foreach 循环来处理子目录列表可以使事情变得更快。
新的并行扩展也比在较低级别尝试多线程更安全、更易于使用。
需要注意的一件事是确保将并发进程的数量限制在合理的范围内。
Assuming that your millions of files are spread across multiple sub-directories and you're using .NET 4.0, you could look at the parallel extensions.
Using a parallel foreach loop to process the list of sub-directories, could make things a lot faster.
The new parallel extensions are also a lot safer and easier to use than attempting multi-threading at a lower-level.
The one thing to look out for is making sure that you limit the number of concurrent processes to something sensible.