(在 Linux 中)计算目录中大量文件的最快/最简单的方法是什么?
我有一些目录,其中包含大量文件。每次我尝试访问其中的文件列表时,我都无法做到这一点,或者出现明显的延迟。我试图在 Linux 上的命令行中使用 ls 命令,而我的托管提供商的 Web 界面也没有帮助。
问题是,当我只执行 ls 时,甚至开始显示某些内容都需要花费大量时间。因此,ls | wc -l 也无济于事。
经过一番研究,我想出了这段代码(在这个例子中,它计算了某个服务器上新电子邮件的数量):
print sum([len(files) for (root, dirs, files) in walk('/home/myname/Maildir/new')])
上面的代码是用 Python 编写的。我使用Python的命令行工具,它运行得非常快(立即返回结果)。
我对以下问题的答案感兴趣:是否可以更快地计算目录(没有子目录)中的文件?最快的方法是什么?
I had some directory, with large number of files. Every time I tried to access the list of files within it, I was not able to do that or there was significant delay. I was trying to use ls
command within command-line on Linux and web interface from my hosting provider did not help also.
The problem is, that when I just do ls
, it takes significant amount of time to even start displaying something. Thus, ls | wc -l
would not help also.
After some research I came up with this code (in this example it counts number of new emails on some server):
print sum([len(files) for (root, dirs, files) in walk('/home/myname/Maildir/new')])
The above code is written in Python. I used Python's command-line tool and it worked pretty fast (returned result instantly).
I am interested in the answer to the following question: is it possible to count files in a directory (without subdirectories) faster? What is the fastest way to do that?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
ls
对每个文件执行stat(2)
调用。其他工具,例如find(1)
和 shell 通配符扩展,可能会避免此调用,而只执行readdir
。一种可能有效的 shell 命令组合是find dir -maxdepth 1|wc -l
,但它会很乐意列出目录本身,并错误地计算其中包含换行符的任何文件名。在 Python 中,获取这些名称的直接方法是 os.listdir(directory) 。与 os.walk 和 os.path.walk 不同,它不需要递归、检查文件类型或进行进一步的 Python 函数调用。
附录: ls 似乎并不总是统计。至少在我的 GNU 系统上,当不需要进一步的信息(例如哪些名称是目录)时,它只能执行 getdents 调用。 getdents 是用于在 GNU/Linux 中实现 readdir 的底层系统调用。
添加 2: ls 输出结果之前出现延迟的原因之一是它要进行排序和制表。 ls -U1 可以避免这种情况。
ls
does astat(2)
call for every file. Other tools, likefind(1)
and the shell wildcard expansion, may avoid this call and just doreaddir
. One shell command combination that might work isfind dir -maxdepth 1|wc -l
, but it will gladly list the directory itself and miscount any filename with a newline in it.From Python, the straight forward way to get just these names is os.listdir(directory). Unlike os.walk and os.path.walk, it does not need to recurse, check file types, or make further Python function calls.
Addendum: It seems ls doesn't always stat. At least on my GNU system, it can do only a getdents call when further information (such as which names are directories) is not requested. getdents is the underlying system call used to implement readdir in GNU/Linux.
Addition 2: One reason for a delay before ls outputs results is that it sorts and tabulates. ls -U1 may avoid this.
这在 Python 中应该相当快:
This should be pretty fast in Python:
给定目录中的文件总数
给定目录及其下所有子目录中的文件总数
有关更多详细信息,请进入终端并执行
man find
Total number of files in the given directory
Total number of files in the given directory and all subdirectories under it
For more details drop into a terminal and do
man find
我不确定速度,但如果你只想使用 shell 内置函数,这应该可以:
I'm not sure about speed, but if you want to just use shell builtins this should work:
我认为 ls 大部分时间都花在显示第一行之前,因为它必须对条目进行排序,因此 ls -U 应该更快地显示第一行(尽管它总体上可能并没有那么好)。
I think
ls
is spending most of its time before displaying the first line because it has to sort the entries, sols -U
should display the first line much faster (though it may not be that much better in total).最快的方法是避免解释语言的所有开销并编写一些直接解决您的问题的代码。这样做很难以可移植的方式完成,但非常简单。目前我使用的是 OS X 机器,但将以下内容转换为 Linux 应该非常简单。 (我选择忽略隐藏文件,只计算常规文件...根据需要进行修改或添加命令行开关以获得您想要的功能。)
The fastest way would be to avoid all the overhead of interpreted languages and write some code that directly addresses your problem. Doing so is difficult to do in a portable way, but pretty straightforward. At the moment I'm on an OS X box, but converting the following to Linux should be extremely straightforward. (I opted to ignore hidden files and only count regular files...modify as necessary or add command line switches to get the functionality you want.)
我的用例是一个 Linux SBC (Banana Pi),用于对 FAT32 USB 记忆棒上的目录中的文件进行计数。
在 shell 中,
需要 6.4 秒,其中包含 32k 个文件(32k = FAT32 上的最大文件/目录)
从 python 中
只需要 0.874 秒(!)在 Python 中看不到任何其他东西比这更快。
My use case is a linux SBC (Banana Pi) counting files in a directory on a FAT32 USB stick.
In a shell, doing
takes 6.4secs with 32k files in there (32k = max files/dir on FAT32)
From python doing
takes only 0.874secs(!) Can't see anything else in Python being quicker than that.
在 bash 中计算目录中文件的更短方法:
files=(*) ; echo ${#files[@]}
我在tmpfs中生成了10_000个空文件;在我的机器上运行 ls | 需要 0.03 秒来计算它们wc -l 只是稍微慢一些(我在之前和之间刷新了缓存以防万一)
A shorter way of counting files in a directory in bash:
files=(*) ; echo ${#files[@]}
I generate 10_000 empty files in tmpfs; it takes 0.03s on my machine to count them, running ls | wc -l was just slightly slower (I flushed the cache before and in between just in case)