C 例程 opendir()、readdir() 和 closeir() 为我提供了一种遍历目录结构的方法。然而, readdir() 返回的每个 dirent 结构似乎没有为我提供一种有用的方法来获取我需要递归到目录子目录中的 DIR 指针集。
当然,他们给了我文件的名称,所以我可以将该名称附加到目录路径和 stat() 和 opendir() 它们,或者我可以通过 chdir() 和 roll 更改进程的当前工作目录它通过 chdir("..") 返回。
第一种方法的问题是,如果目录路径的长度足够长,那么将包含它的字符串传递给 opendir() 的成本将超过打开目录的成本。如果您更理论一点,您可以说您的复杂性可能会增加超出线性时间(目录树中(相对)文件名的总字符数)。
另外,第二种方法也有问题。由于每个进程都有一个当前工作目录,因此在多线程应用程序中,除了一个线程之外的所有线程都必须阻塞。另外,我不知道当前工作目录是否只是为了方便(即,在文件系统查询之前将相对路径附加到它)。如果是这样,这种方法也将是低效的。
我接受这些功能的替代方案。那么如何高效地遍历一棵 UNIX 目录树(其下文件的总字符数的线性时间)呢?
The C routines opendir(), readdir() and closedir() provide a way for me to traverse a directory structure. However, each dirent structure returned by readdir() does not seem to provide a useful way for me to obtain the set of pointers to DIR that I would need to recurse into the directory subdirectories.
Of course, they give me the name of the files, so I could either append that name to the directory path and stat() and opendir() them, or I could change the current working directory of the process via chdir() and roll it back via chdir("..").
The problem with the first approach is that if the length of the directory path is great enough, then the cost to pass a string containing it to opendir() will overweight the cost of opening a directory. If you are a bit more theoretical, you could say your complexity could increase beyond linear time (in the total character count of the (relative) filenames in the directory tree).
Also, the second approach has a problem. Since each process has a single current working directory, all but one thread will have to block in a multithreaded application. Also, I don't know if the current working directory is just a mere convenience (i.e., the relative path will be appended to it prior to a filesystem query). If it is, this approach will be inefficient too.
I am accepting alternatives to these functions. So how is it one can traverse a UNIX directory tree efficiently (linear time in the total character count of the files under it)?
发布评论
评论(5)
您是否尝试过
ftw()
又名 File Tree Walk ?来自
man 3 ftw
的片段:int ftw(const char *dir, int (*fn)(const char *file, const struct stat *sb, int flag), int nopenfd);< /代码>
Have you tried
ftw()
aka File Tree Walk ?Snippit from
man 3 ftw
:int ftw(const char *dir, int (*fn)(const char *file, const struct stat *sb, int flag), int nopenfd);
您似乎缺少一个基本点:目录遍历涉及从磁盘读取数据。即使该数据位于缓存中,您最终也需要执行大量代码才能将其从缓存中获取到进程中。路径通常也很短——超过几百个字节是很不寻常的。这些意味着您可以相当合理地为您需要的所有路径构建字符串,而不会出现任何实际问题。与从磁盘读取数据的时间相比,构建字符串所花费的时间仍然相当短。这意味着您通常可以忽略字符串操作所花费的时间,而专门致力于优化磁盘使用。
我自己的经验是,对于大多数目录遍历,广度优先搜索通常更可取——当您遍历当前目录时,将所有子目录的完整路径放入诸如优先级队列之类的东西中。遍历完当前目录后,从队列中取出第一项并遍历它,继续遍历,直到队列为空。这通常会提高缓存局部性,从而减少读取磁盘所花费的时间。根据系统(磁盘速度与 CPU 速度、可用总内存等)的不同,它几乎总是至少与深度优先遍历一样快,并且可以轻松达到两倍(左右)。
You seem to be missing one basic point: directory traversal involves reading data from the disk. Even when/if that data is in the cache, you end up going through a fair amount of code to get it from the cache into your process. Paths are also generally pretty short -- any more than a couple hundred bytes is pretty unusual. Together these mean that you can pretty reasonably build up strings for all the paths you need without any real problem. The time spent building the strings is still pretty minor compared to the time to read data from the disk. That means you can normally ignore the time spent on string manipulation, and work exclusively at optimizing disk usage.
My own experience has been that for most directory traversal a breadth-first search is usually preferable -- as you're traversing the current directory, put the full paths to all sub-directories in something like a priority queue. When you're finished traversing the current directory, pull the first item from the queue and traverse it, continuing until the queue is empty. This generally improves cache locality, so it reduces the amount of time spent reading the disk. Depending on the system (disk speed vs. CPU speed, total memory available, etc.) it's nearly always at least as fast as a depth-first traversal, and can easily be up to twice as fast (or so).
opendir
/readdir
/closedir
的使用方式就是让函数递归!请查看 Dreamincode.net 上的代码片段。希望这有帮助。
编辑谢谢R.Sahu,链接已过期,但是,通过wayback archive 并擅自将其添加到 要点。请记住,相应地检查许可证并注明来源的原始作者! :)
The way to use
opendir
/readdir
/closedir
is to make the function recursive! Have a look at the snippet here on Dreamincode.net.Hope this helps.
EDIT Thanks R.Sahu, the linky has expired, however, found it via wayback archive and took the liberty to add it to gist. Please remember, to check the license accordingly and attribute the original author for the source! :)
您可以使用 opendir() >
openat()
、dirfd()
和fdopendir( )
并构造一个递归函数来遍历目录树:这里仍然使用
readdir()
来获取下一个目录条目。如果下一个条目是目录,则我们使用 dirfd() 查找父目录 fd 并将其与子目录名称一起传递给 openat()。生成的 fd 引用子目录。它被传递给fdopendir()
,它返回子目录的DIR *
指针,然后可以将其传递给我们的dir_recurse()
,其中它再次适用于readdir()
调用。该程序在以
..
为根的整个目录树上进行递归。打印条目,每个目录级别缩进 1 个空格。目录打印时带有尾随/
。在编译器资源管理器上。
Instead of
opendir()
, you can use a combination ofopenat()
,dirfd()
andfdopendir()
and construct a recursive function to walk a directory tree:Here
readdir()
is still used to get the next directory entry. If the next entry is a directory, then we find the parent directory fd withdirfd()
and pass this, along with the child directory name toopenat()
. The resulting fd refers to the child directory. This is passed tofdopendir()
which returns aDIR *
pointer for the child directory, which can then be passed to ourdir_recurse()
where it again will be valid for use withreaddir()
calls.This program recurses over the whole directory tree rooted at
..
Entries are printed, indented by 1 space per directory level. Directories are printed with a trailing/
.On Compiler Explorer.
对于您的应用程序来说可能有点大材小用,但这里有一个库,旨在遍历包含数亿个文件的目录树。
https://github.com/hpc/libcircle
Probably overkill for your application, but here's a library designed to traverse a directory tree with hundreds of millions of files.
https://github.com/hpc/libcircle