在 Python 中遍历目录树的相当快的方法?

发布于 2024-08-30 10:17:04 字数 1101 浏览 6 评论 0原文

假设给定的目录树具有合理的大小:比如说像 Twisted 或 Python 这样的开源项目,遍历和迭代该目录内所有文件/目录的绝对路径的最快方法是什么?

我想在 Python 中执行此操作。 os.path.walk 很慢。所以我尝试了ls -lRtree -fi。对于包含大约 8337 个文件(包括 tmp、pyc、test、.svn 文件)的项目:

$ time tree -fi > /dev/null 

real    0m0.170s
user    0m0.044s
sys     0m0.123s

$ time ls -lR > /dev/null 

real    0m0.292s
user    0m0.138s
sys     0m0.152s

$ time find . > /dev/null 

real    0m0.074s
user    0m0.017s
sys     0m0.056s
$

tree 似乎比 ls -lR 更快(尽管 ls - Rtree 更快,但它不提供完整路径)。 find 是最快的。

有人能想到更快和/或更好的方法吗?在 Windows 上,如果需要,我可以简单地提供 32 位二叉树.exe 或 ls.exe。

更新 1:添加了 find

更新 2:为什么我要这样做? ...我正在尝试对 cd、pushd 等进行智能替换,并为其他依赖于传递路径的命令(less、more、cat、vim、tail)进行包装命令。程序偶尔会使用文件遍历来执行此操作(例如:输入“cd sr grai pat lxml”将自动转换为“cd src/pypm/grail/patches/lxml”)。如果这个 CD 替换需要半秒才能运行,我不会满意。请参阅http://github.com/srid/pf

Assuming that the given directory tree is of reasonable size: say an open source project like Twisted or Python, what is the fastest way to traverse and iterate over the absolute path of all files/directories inside that directory?

I want to do this from within Python. os.path.walk is slow. So I tried ls -lR and tree -fi. For a project with about 8337 files (including tmp, pyc, test, .svn files):

$ time tree -fi > /dev/null 

real    0m0.170s
user    0m0.044s
sys     0m0.123s

$ time ls -lR > /dev/null 

real    0m0.292s
user    0m0.138s
sys     0m0.152s

$ time find . > /dev/null 

real    0m0.074s
user    0m0.017s
sys     0m0.056s
$

tree appears to be faster than ls -lR (though ls -R is faster than tree, but it does not give full paths). find is the fastest.

Can anyone think of a faster and/or better approach? On Windows, I may simply ship a 32-bit binary tree.exe or ls.exe if necessary.

Update 1: Added find

Update 2: Why do I want to do this? ... I am trying to make a smart replacement for cd, pushd, etc.. and wrapper commands for other commands relying on passing paths (less, more, cat, vim, tail). The program will use file traversal occasionally to do that (eg: typing "cd sr grai pat lxml" would automatically translate to "cd src/pypm/grail/patches/lxml"). I won't be satisfied if this cd replacement took, say, half a second to run. See http://github.com/srid/pf

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

深空失忆 2024-09-06 10:17:05

即使 os.path.walk 根本不花时间,你在 pf 中的方法也会慢得令人绝望。在所有现有路径上进行包含 3 个无界闭包的正则表达式匹配将会杀死你。这是我引用的 Kernighan 和 Pike 的代码,这是一个任务的正确算法:

/* spname:  return correctly spelled filename */
/*
 * spname(oldname, newname)  char *oldname, *newname;
 *  returns -1 if no reasonable match to oldname,
 *           0 if exact match,
 *           1 if corrected.
 *  stores corrected name in newname.
 */

#include <sys/types.h>
#include <sys/dir.h>

spname(oldname, newname)
    char *oldname, *newname;
{
    char *p, guess[DIRSIZ+1], best[DIRSIZ+1];
    char *new = newname, *old = oldname;

    for (;;) {
        while (*old == '/') /* skip slashes */
            *new++ = *old++;
        *new = '\0';
        if (*old == '\0')   /* exact or corrected */
            return strcmp(oldname,newname) != 0;
        p = guess;  /* copy next component into guess */
        for ( ; *old != '/' && *old != '\0'; old++)
            if (p < guess+DIRSIZ)
                *p++ = *old;
        *p = '\0';
        if (mindist(newname, guess, best) >= 3)
            return -1;  /* hopeless */
        for (p = best; *new = *p++; ) /* add to end */
            new++;                    /* of newname */
    }
}

mindist(dir, guess, best)   /* search dir for guess */
    char *dir, *guess, *best;
{
    /* set best, return distance 0..3 */
    int d, nd, fd;
    struct {
        ino_t ino;
        char  name[DIRSIZ+1];   /* 1 more than in dir.h */
    } nbuf;

    nbuf.name[DIRSIZ] = '\0';   /* +1 for terminal '\0' */
    if (dir[0] == '\0')     /* current directory */
        dir = ".";
    d = 3;  /* minimum distance */
    if ((fd=open(dir, 0)) == -1)
        return d;
    while (read(fd,(char *) &nbuf,sizeof(struct direct)) > 0)
        if (nbuf.ino) {
            nd = spdist(nbuf.name, guess);
            if (nd <= d && nd != 3) {
                strcpy(best, nbuf.name);
                d = nd;
                if (d == 0)     /* exact match */
                    break;
            }
        }
    close(fd);
    return d;
}

/* spdist:  return distance between two names */
/*
 *  very rough spelling metric:
 *  0 if the strings are identical
 *  1 if two chars are transposed
 *  2 if one char wrong, added or deleted
 *  3 otherwise
 */

#define EQ(s,t) (strcmp(s,t) == 0)

spdist(s, t)
    char *s, *t;
{
    while (*s++ == *t)
        if (*t++ == '\0')
            return 0;       /* exact match */
    if (*--s) {
        if (*t) {
            if (s[1] && t[1] && *s == t[1] 
              && *t == s[1] && EQ(s+2, t+2))
                return 1;   /* transposition */
            if (EQ(s+1, t+1))
                return 2;   /* 1 char mismatch */
        }
        if (EQ(s+1, t))
            return 2;       /* extra character */
    }
    if (*t && EQ(s, t+1))
        return 2;           /* missing character */
    return 3;
}

注意:此代码是在 ANSI C、ISO C 或 POSIX 之前编写的,当人们读取原始目录文件时甚至无法想象任何内容。代码的方法比所有的指针吊索有用得多。

Your approach in pf is going to be hopelessly slow, even if os.path.walk took no time at all. Doing a regex match containing 3 unbounded closures across all extant paths will kill you right there. Here is the code from Kernighan and Pike that I referenced, this is a proper algorithm for the task:

/* spname:  return correctly spelled filename */
/*
 * spname(oldname, newname)  char *oldname, *newname;
 *  returns -1 if no reasonable match to oldname,
 *           0 if exact match,
 *           1 if corrected.
 *  stores corrected name in newname.
 */

#include <sys/types.h>
#include <sys/dir.h>

spname(oldname, newname)
    char *oldname, *newname;
{
    char *p, guess[DIRSIZ+1], best[DIRSIZ+1];
    char *new = newname, *old = oldname;

    for (;;) {
        while (*old == '/') /* skip slashes */
            *new++ = *old++;
        *new = '\0';
        if (*old == '\0')   /* exact or corrected */
            return strcmp(oldname,newname) != 0;
        p = guess;  /* copy next component into guess */
        for ( ; *old != '/' && *old != '\0'; old++)
            if (p < guess+DIRSIZ)
                *p++ = *old;
        *p = '\0';
        if (mindist(newname, guess, best) >= 3)
            return -1;  /* hopeless */
        for (p = best; *new = *p++; ) /* add to end */
            new++;                    /* of newname */
    }
}

mindist(dir, guess, best)   /* search dir for guess */
    char *dir, *guess, *best;
{
    /* set best, return distance 0..3 */
    int d, nd, fd;
    struct {
        ino_t ino;
        char  name[DIRSIZ+1];   /* 1 more than in dir.h */
    } nbuf;

    nbuf.name[DIRSIZ] = '\0';   /* +1 for terminal '\0' */
    if (dir[0] == '\0')     /* current directory */
        dir = ".";
    d = 3;  /* minimum distance */
    if ((fd=open(dir, 0)) == -1)
        return d;
    while (read(fd,(char *) &nbuf,sizeof(struct direct)) > 0)
        if (nbuf.ino) {
            nd = spdist(nbuf.name, guess);
            if (nd <= d && nd != 3) {
                strcpy(best, nbuf.name);
                d = nd;
                if (d == 0)     /* exact match */
                    break;
            }
        }
    close(fd);
    return d;
}

/* spdist:  return distance between two names */
/*
 *  very rough spelling metric:
 *  0 if the strings are identical
 *  1 if two chars are transposed
 *  2 if one char wrong, added or deleted
 *  3 otherwise
 */

#define EQ(s,t) (strcmp(s,t) == 0)

spdist(s, t)
    char *s, *t;
{
    while (*s++ == *t)
        if (*t++ == '\0')
            return 0;       /* exact match */
    if (*--s) {
        if (*t) {
            if (s[1] && t[1] && *s == t[1] 
              && *t == s[1] && EQ(s+2, t+2))
                return 1;   /* transposition */
            if (EQ(s+1, t+1))
                return 2;   /* 1 char mismatch */
        }
        if (EQ(s+1, t))
            return 2;       /* extra character */
    }
    if (*t && EQ(s, t+1))
        return 2;           /* missing character */
    return 3;
}

Note: this code was written way before ANSI C, ISO C, or POSIX anything was even imagined when one read directory files raw. The approach of the code is far more useful than all the pointer slinging.

请你别敷衍 2024-09-06 10:17:05

在性能上很难比 find 更好,但问题是快多少以及为什么需要它这么快?您声称 os.path.walk 很慢,事实上,在我的机器上,在 16k 目录树上,它慢了约 3 倍。但话又说回来,我们讨论的是 Python 0.68 秒和 1.9 秒之间的差异。

如果设置速度记录是您的目标,那么您无法击败硬编码的 C,因为它完全受 75% 的系统调用限制,并且您无法使操作系统运行得更快。也就是说,25% 的 Python 时间花在系统调用上。您想对遍历的路径做什么?

It would be hard to get much better than find in performance, but the question is how much faster and why do you need it to be so fast? You claim that os.path.walk is slow, indeed, it is ~3 times slower on my machine over a tree of 16k directories. But then again, we're talking about the difference between 0.68 seconds and 1.9 seconds for Python.

If setting a speed record is your goal, you can't beat hard coded C which is completely 75% system call bound and you can't make the OS go faster. That said, 25% of the Python time is spent in system calls. What is it that you want to do with the traversed paths?

許願樹丅啲祈禱 2024-09-06 10:17:05

您没有提到的一种解决方案是“os.walk”。我不确定它会比 os.path.walk 更快,但客观上它更好。

您还没有说当您拥有目录列表时要如何处理它,因此很难给出更具体的建议。

One solution you have not mentioned is 'os.walk'. I'm not sure it'd be any faster than os.path.walk, but it's objectively better.

You have not said what you're going to do with the list of directories when you have it, so it's hard to give more specific suggestions.

桜花祭 2024-09-06 10:17:05

虽然我怀疑你是否有多个读取头,但以下是如何遍历几百万个文件的方法(我们在几分钟内就完成了 10M+)。

https://github.com/hpc/purger/blob/master /src/treewalk/treewalk.c

Although I doubt you have multiple read heads, here's how you can traverse a few million files (we've done 10M+ in a few minutes).

https://github.com/hpc/purger/blob/master/src/treewalk/treewalk.c

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文