在 Python 中遍历目录树的相当快的方法?
假设给定的目录树具有合理的大小:比如说像 Twisted 或 Python 这样的开源项目,遍历和迭代该目录内所有文件/目录的绝对路径的最快方法是什么?
我想在 Python 中执行此操作。 os.path.walk 很慢。所以我尝试了ls -lR和tree -fi。对于包含大约 8337 个文件(包括 tmp、pyc、test、.svn 文件)的项目:
$ time tree -fi > /dev/null
real 0m0.170s
user 0m0.044s
sys 0m0.123s
$ time ls -lR > /dev/null
real 0m0.292s
user 0m0.138s
sys 0m0.152s
$ time find . > /dev/null
real 0m0.074s
user 0m0.017s
sys 0m0.056s
$
tree
似乎比 ls -lR
更快(尽管 ls - R
比 tree
更快,但它不提供完整路径)。 find
是最快的。
有人能想到更快和/或更好的方法吗?在 Windows 上,如果需要,我可以简单地提供 32 位二叉树.exe 或 ls.exe。
更新 1:添加了 find
更新 2:为什么我要这样做? ...我正在尝试对 cd、pushd 等进行智能替换,并为其他依赖于传递路径的命令(less、more、cat、vim、tail)进行包装命令。程序偶尔会使用文件遍历来执行此操作(例如:输入“cd sr grai pat lxml”将自动转换为“cd src/pypm/grail/patches/lxml”)。如果这个 CD 替换需要半秒才能运行,我不会满意。请参阅http://github.com/srid/pf
Assuming that the given directory tree is of reasonable size: say an open source project like Twisted or Python, what is the fastest way to traverse and iterate over the absolute path of all files/directories inside that directory?
I want to do this from within Python. os.path.walk is slow. So I tried ls -lR and tree -fi. For a project with about 8337 files (including tmp, pyc, test, .svn files):
$ time tree -fi > /dev/null
real 0m0.170s
user 0m0.044s
sys 0m0.123s
$ time ls -lR > /dev/null
real 0m0.292s
user 0m0.138s
sys 0m0.152s
$ time find . > /dev/null
real 0m0.074s
user 0m0.017s
sys 0m0.056s
$
tree
appears to be faster than ls -lR
(though ls -R
is faster than tree
, but it does not give full paths). find
is the fastest.
Can anyone think of a faster and/or better approach? On Windows, I may simply ship a 32-bit binary tree.exe or ls.exe if necessary.
Update 1: Added find
Update 2: Why do I want to do this? ... I am trying to make a smart replacement for cd, pushd, etc.. and wrapper commands for other commands relying on passing paths (less, more, cat, vim, tail). The program will use file traversal occasionally to do that (eg: typing "cd sr grai pat lxml" would automatically translate to "cd src/pypm/grail/patches/lxml"). I won't be satisfied if this cd replacement took, say, half a second to run. See http://github.com/srid/pf
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
即使 os.path.walk 根本不花时间,你在 pf 中的方法也会慢得令人绝望。在所有现有路径上进行包含 3 个无界闭包的正则表达式匹配将会杀死你。这是我引用的 Kernighan 和 Pike 的代码,这是一个任务的正确算法:
注意:此代码是在 ANSI C、ISO C 或 POSIX 之前编写的,当人们读取原始目录文件时甚至无法想象任何内容。代码的方法比所有的指针吊索有用得多。
Your approach in pf is going to be hopelessly slow, even if
os.path.walk
took no time at all. Doing a regex match containing 3 unbounded closures across all extant paths will kill you right there. Here is the code from Kernighan and Pike that I referenced, this is a proper algorithm for the task:Note: this code was written way before ANSI C, ISO C, or POSIX anything was even imagined when one read directory files raw. The approach of the code is far more useful than all the pointer slinging.
在性能上很难比
find
更好,但问题是快多少以及为什么需要它这么快?您声称 os.path.walk 很慢,事实上,在我的机器上,在 16k 目录树上,它慢了约 3 倍。但话又说回来,我们讨论的是 Python 0.68 秒和 1.9 秒之间的差异。如果设置速度记录是您的目标,那么您无法击败硬编码的 C,因为它完全受 75% 的系统调用限制,并且您无法使操作系统运行得更快。也就是说,25% 的 Python 时间花在系统调用上。您想对遍历的路径做什么?
It would be hard to get much better than
find
in performance, but the question is how much faster and why do you need it to be so fast? You claim that os.path.walk is slow, indeed, it is ~3 times slower on my machine over a tree of 16k directories. But then again, we're talking about the difference between 0.68 seconds and 1.9 seconds for Python.If setting a speed record is your goal, you can't beat hard coded C which is completely 75% system call bound and you can't make the OS go faster. That said, 25% of the Python time is spent in system calls. What is it that you want to do with the traversed paths?
您没有提到的一种解决方案是“os.walk”。我不确定它会比 os.path.walk 更快,但客观上它更好。
您还没有说当您拥有目录列表时要如何处理它,因此很难给出更具体的建议。
One solution you have not mentioned is 'os.walk'. I'm not sure it'd be any faster than os.path.walk, but it's objectively better.
You have not said what you're going to do with the list of directories when you have it, so it's hard to give more specific suggestions.
虽然我怀疑你是否有多个读取头,但以下是如何遍历几百万个文件的方法(我们在几分钟内就完成了 10M+)。
https://github.com/hpc/purger/blob/master /src/treewalk/treewalk.c
Although I doubt you have multiple read heads, here's how you can traverse a few million files (we've done 10M+ in a few minutes).
https://github.com/hpc/purger/blob/master/src/treewalk/treewalk.c