大量文件的快速 Linux 文件计数

发布于 2024-08-04 23:47:00 字数 174 浏览 10 评论 0原文

我试图找出当文件数量非常大(超过 100,000 个)时查找特定目录中文件数量的最佳方法。

当有那么多文件时,执行 ls | wc -l 需要相当长的时间来执行。我相信这是因为它返回所有文件的名称。我试图占用尽可能少的磁盘 I/O。

我尝试过一些 shell 和 Perl 脚本,但没有成功。我该怎么做呢?

I'm trying to figure out the best way to find the number of files in a particular directory when there are a very large number of files (more than 100,000).

When there are that many files, performing ls | wc -l takes quite a long time to execute. I believe this is because it's returning the names of all the files. I'm trying to take up as little of the disk I/O as possible.

I have experimented with some shell and Perl scripts to no avail. How can I do it?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(17

迷鸟归林 2024-08-11 23:47:00

默认情况下,ls 对名称进行排序,如果名称很多,这可能需要一段时间。此外,在读取并排序所有名称之前,不会有输出。使用 ls -f 选项关闭排序。

ls -f | wc -l

注意:这还将启用 -a,因此 ... 以及其他以 . 将被计算在内。

By default ls sorts the names, which can take a while if there are a lot of them. Also there will be no output until all of the names are read and sorted. Use the ls -f option to turn off sorting.

ls -f | wc -l

Note: This will also enable -a, so ., .., and other files starting with . will be counted.

花伊自在美 2024-08-11 23:47:00

最快的方法是专门构建的程序,如下所示:

#include <stdio.h>
#include <dirent.h>

int main(int argc, char *argv[]) {
    DIR *dir;
    struct dirent *ent;
    long count = 0;

    dir = opendir(argv[1]);

    while((ent = readdir(dir)))
            ++count;

    closedir(dir);

    printf("%s contains %ld files\n", argv[1], count);

    return 0;
}

根据我的测试,不考虑缓存,我针对同一目录一遍又一遍地运行每个程序大约 50 次,以避免基于缓存的数据倾斜,然后我得到了大致有以下性能数据(以实际时钟时间为单位):

ls -1  | wc - 0:01.67
ls -f1 | wc - 0:00.14
find   | wc - 0:00.22
dircnt | wc - 0:00.04

最后一个,dircnt,是从上述源代码编译的程序。

编辑2016-09-26

由于大众的需求,我将这个程序重写为递归,因此它将放入子目录中并继续分别计算文件和目录。

由于很明显有些人想知道如何完成这一切,因此我在代码中添加了很多注释,试图让大家清楚地了解发生了什么。我编写了此代码并在 64 位 Linux 上进行了测试,但它应该可以在任何兼容 POSIX 的系统上运行,包括 Microsoft Windows。欢迎报告错误;如果您无法在 AIX 或 OS/400 或其他操作系统上运行它,我很乐意更新它。

正如您所看到的,它比原来的复杂得多,并且必然如此:至少必须存在一个函数才能递归调用,除非您希望代码变得非常复杂(例如,管理子目录堆栈并处理在一个循环中)。由于我们必须检查文件类型,不同操作系统之间的差异、标准库等都会发挥作用,因此我编写了一个程序,尝试在任何要编译的系统上使用。

几乎没有错误检查,并且 count 函数本身并不真正报告错误。唯一可能真正失败的调用是 opendirstat (如果您不幸运并且有一个 dirent 已包含文件类型的系统)。我并不偏执于检查子目录路径名的总长度,但从理论上讲,系统不应允许任何长度超过 PATH_MAX 的路径名。如果有问题,我可以解决这个问题,但这只是需要向学习编写 C 语言的人解释更多代码。该程序旨在作为如何递归地深入子目录的示例。

#include <stdio.h>
#include <dirent.h>
#include <string.h>
#include <stdlib.h>
#include <limits.h>
#include <sys/stat.h>

#if defined(WIN32) || defined(_WIN32) 
#define PATH_SEPARATOR '\\' 
#else
#define PATH_SEPARATOR '/' 
#endif

/* A custom structure to hold separate file and directory counts */
struct filecount {
  long dirs;
  long files;
};

/*
 * counts the number of files and directories in the specified directory.
 *
 * path - relative pathname of a directory whose files should be counted
 * counts - pointer to struct containing file/dir counts
 */
void count(char *path, struct filecount *counts) {
    DIR *dir;                /* dir structure we are reading */
    struct dirent *ent;      /* directory entry currently being processed */
    char subpath[PATH_MAX];  /* buffer for building complete subdir and file names */
    /* Some systems don't have dirent.d_type field; we'll have to use stat() instead */
#if !defined ( _DIRENT_HAVE_D_TYPE )
    struct stat statbuf;     /* buffer for stat() info */
#endif

/* fprintf(stderr, "Opening dir %s\n", path); */
    dir = opendir(path);

    /* opendir failed... file likely doesn't exist or isn't a directory */
    if(NULL == dir) {
        perror(path);
        return;
    }

    while((ent = readdir(dir))) {
      if (strlen(path) + 1 + strlen(ent->d_name) > PATH_MAX) {
          fprintf(stdout, "path too long (%ld) %s%c%s", (strlen(path) + 1 + strlen(ent->d_name)), path, PATH_SEPARATOR, ent->d_name);
          return;
      }

/* Use dirent.d_type if present, otherwise use stat() */
#if defined ( _DIRENT_HAVE_D_TYPE )
/* fprintf(stderr, "Using dirent.d_type\n"); */
      if(DT_DIR == ent->d_type) {
#else
/* fprintf(stderr, "Don't have dirent.d_type, falling back to using stat()\n"); */
      sprintf(subpath, "%s%c%s", path, PATH_SEPARATOR, ent->d_name);
      if(lstat(subpath, &statbuf)) {
          perror(subpath);
          return;
      }

      if(S_ISDIR(statbuf.st_mode)) {
#endif
          /* Skip "." and ".." directory entries... they are not "real" directories */
          if(0 == strcmp("..", ent->d_name) || 0 == strcmp(".", ent->d_name)) {
/*              fprintf(stderr, "This is %s, skipping\n", ent->d_name); */
          } else {
              sprintf(subpath, "%s%c%s", path, PATH_SEPARATOR, ent->d_name);
              counts->dirs++;
              count(subpath, counts);
          }
      } else {
          counts->files++;
      }
    }

/* fprintf(stderr, "Closing dir %s\n", path); */
    closedir(dir);
}

int main(int argc, char *argv[]) {
    struct filecount counts;
    counts.files = 0;
    counts.dirs = 0;
    count(argv[1], &counts);

    /* If we found nothing, this is probably an error which has already been printed */
    if(0 < counts.files || 0 < counts.dirs) {
        printf("%s contains %ld files and %ld directories\n", argv[1], counts.files, counts.dirs);
    }

    return 0;
}

编辑 2017-01-17

我合并了 @FlyingCodeMonkey 建议的两项更改:

  1. 使用 lstat 而不是 stat。如果您正在扫描的目录中有符号链接目录,这将改变程序的行为。之前的行为是(链接的)子目录会将其文件计数添加到总体计数中;新的行为是链接目录将被视为单个文件,并且其内容将不被计算在内。
  2. 如果文件路径太长,将发出错误消息并且程序将停止。

编辑 2017-06-29

运气好的话,这将是此答案的最后编辑:)

我已将此代码复制到 GitHub 存储库 使获取代码变得更容易(而不是复制/粘贴,您可以下载源代码),而且它使任何人都可以更轻松地提出建议通过从 GitHub 提交拉取请求进行修改。

源代码可在 Apache License 2.0 下获取。欢迎补丁*


  • “补丁”就是像我这样的老人所说的“拉取请求”。

The fastest way is a purpose-built program, like this:

#include <stdio.h>
#include <dirent.h>

int main(int argc, char *argv[]) {
    DIR *dir;
    struct dirent *ent;
    long count = 0;

    dir = opendir(argv[1]);

    while((ent = readdir(dir)))
            ++count;

    closedir(dir);

    printf("%s contains %ld files\n", argv[1], count);

    return 0;
}

From my testing without regard to cache, I ran each of these about 50 times each against the same directory, over and over, to avoid cache-based data skew, and I got roughly the following performance numbers (in real clock time):

ls -1  | wc - 0:01.67
ls -f1 | wc - 0:00.14
find   | wc - 0:00.22
dircnt | wc - 0:00.04

That last one, dircnt, is the program compiled from the above source.

EDIT 2016-09-26

Due to popular demand, I've re-written this program to be recursive, so it will drop into subdirectories and continue to count files and directories separately.

Since it's clear some folks want to know how to do all this, I have a lot of comments in the code to try to make it obvious what's going on. I wrote this and tested it on 64-bit Linux, but it should work on any POSIX-compliant system, including Microsoft Windows. Bug reports are welcome; I'm happy to update this if you can't get it working on your AIX or OS/400 or whatever.

As you can see, it's much more complicated than the original and necessarily so: at least one function must exist to be called recursively unless you want the code to become very complex (e.g. managing a subdirectory stack and processing that in a single loop). Since we have to check file types, differences between different OSs, standard libraries, etc. come into play, so I have written a program that tries to be usable on any system where it will compile.

There is very little error checking, and the count function itself doesn't really report errors. The only calls that can really fail are opendir and stat (if you aren't lucky and have a system where dirent contains the file type already). I'm not paranoid about checking the total length of the subdir pathnames, but theoretically, the system shouldn't allow any path name that is longer than than PATH_MAX. If there are concerns, I can fix that, but it's just more code that needs to be explained to someone learning to write C. This program is intended to be an example of how to dive into subdirectories recursively.

#include <stdio.h>
#include <dirent.h>
#include <string.h>
#include <stdlib.h>
#include <limits.h>
#include <sys/stat.h>

#if defined(WIN32) || defined(_WIN32) 
#define PATH_SEPARATOR '\\' 
#else
#define PATH_SEPARATOR '/' 
#endif

/* A custom structure to hold separate file and directory counts */
struct filecount {
  long dirs;
  long files;
};

/*
 * counts the number of files and directories in the specified directory.
 *
 * path - relative pathname of a directory whose files should be counted
 * counts - pointer to struct containing file/dir counts
 */
void count(char *path, struct filecount *counts) {
    DIR *dir;                /* dir structure we are reading */
    struct dirent *ent;      /* directory entry currently being processed */
    char subpath[PATH_MAX];  /* buffer for building complete subdir and file names */
    /* Some systems don't have dirent.d_type field; we'll have to use stat() instead */
#if !defined ( _DIRENT_HAVE_D_TYPE )
    struct stat statbuf;     /* buffer for stat() info */
#endif

/* fprintf(stderr, "Opening dir %s\n", path); */
    dir = opendir(path);

    /* opendir failed... file likely doesn't exist or isn't a directory */
    if(NULL == dir) {
        perror(path);
        return;
    }

    while((ent = readdir(dir))) {
      if (strlen(path) + 1 + strlen(ent->d_name) > PATH_MAX) {
          fprintf(stdout, "path too long (%ld) %s%c%s", (strlen(path) + 1 + strlen(ent->d_name)), path, PATH_SEPARATOR, ent->d_name);
          return;
      }

/* Use dirent.d_type if present, otherwise use stat() */
#if defined ( _DIRENT_HAVE_D_TYPE )
/* fprintf(stderr, "Using dirent.d_type\n"); */
      if(DT_DIR == ent->d_type) {
#else
/* fprintf(stderr, "Don't have dirent.d_type, falling back to using stat()\n"); */
      sprintf(subpath, "%s%c%s", path, PATH_SEPARATOR, ent->d_name);
      if(lstat(subpath, &statbuf)) {
          perror(subpath);
          return;
      }

      if(S_ISDIR(statbuf.st_mode)) {
#endif
          /* Skip "." and ".." directory entries... they are not "real" directories */
          if(0 == strcmp("..", ent->d_name) || 0 == strcmp(".", ent->d_name)) {
/*              fprintf(stderr, "This is %s, skipping\n", ent->d_name); */
          } else {
              sprintf(subpath, "%s%c%s", path, PATH_SEPARATOR, ent->d_name);
              counts->dirs++;
              count(subpath, counts);
          }
      } else {
          counts->files++;
      }
    }

/* fprintf(stderr, "Closing dir %s\n", path); */
    closedir(dir);
}

int main(int argc, char *argv[]) {
    struct filecount counts;
    counts.files = 0;
    counts.dirs = 0;
    count(argv[1], &counts);

    /* If we found nothing, this is probably an error which has already been printed */
    if(0 < counts.files || 0 < counts.dirs) {
        printf("%s contains %ld files and %ld directories\n", argv[1], counts.files, counts.dirs);
    }

    return 0;
}

EDIT 2017-01-17

I've incorporated two changes suggested by @FlyingCodeMonkey:

  1. Use lstat instead of stat. This will change the behavior of the program if you have symlinked directories in the directory you are scanning. The previous behavior was that the (linked) subdirectory would have its file count added to the overall count; the new behavior is that the linked directory will count as a single file, and its contents will not be counted.
  2. If the path of a file is too long, an error message will be emitted and the program will halt.

EDIT 2017-06-29

With any luck, this will be the last edit of this answer :)

I've copied this code into a GitHub repository to make it a bit easier to get the code (instead of copy/paste, you can just download the source), plus it makes it easier for anyone to suggest a modification by submitting a pull-request from GitHub.

The source is available under Apache License 2.0. Patches* welcome!


  • "patch" is what old people like me call a "pull request".
无畏 2024-08-11 23:47:00

使用查找。例如:

find . -name "*.ext" | wc -l

Use find. For example:

find . -name "*.ext" | wc -l
听你说爱我 2024-08-11 23:47:00

查找lsperl 对 40,000 个文件进行测试具有相同的速度(尽管我没有尝试清除缓存):

[user@server logs]$ time find . | wc -l
42917

real    0m0.054s
user    0m0.018s
sys     0m0.040s

[user@server logs]$ time /bin/ls -f | wc -l
42918

real    0m0.059s
user    0m0.027s
sys     0m0.037s

并且使用 Perl 的 opendirreaddir,同时:

[user@server logs]$ time perl -e 'opendir D, "."; @files = readdir D; closedir D; print scalar(@files)."\n"'
42918

real    0m0.057s
user    0m0.024s
sys     0m0.033s

注意:我使用 /bin /ls -f 确保绕过别名选项,该选项可能会减慢一点,-f 以避免文件排序。
不带 -flsfind/perl 慢两倍
除非 ls-f 一起使用,似乎是同一时间:

[user@server logs]$ time /bin/ls . | wc -l
42916

real    0m0.109s
user    0m0.070s
sys     0m0.044s

我还希望有一些脚本直接询问文件系统,而不需要所有不必要的信息。

测试基于彼得·范德海登格伦·杰克曼mark4o

find, ls, and perl tested against 40,000 files has the same speed (though I didn't try to clear the cache):

[user@server logs]$ time find . | wc -l
42917

real    0m0.054s
user    0m0.018s
sys     0m0.040s

[user@server logs]$ time /bin/ls -f | wc -l
42918

real    0m0.059s
user    0m0.027s
sys     0m0.037s

And with Perl's opendir and readdir, the same time:

[user@server logs]$ time perl -e 'opendir D, "."; @files = readdir D; closedir D; print scalar(@files)."\n"'
42918

real    0m0.057s
user    0m0.024s
sys     0m0.033s

Note: I used /bin/ls -f to make sure to bypass the alias option which might slow a little bit and -f to avoid file ordering.
ls without -f is twice slower than find/perl
except if ls is used with -f, it seems to be the same time:

[user@server logs]$ time /bin/ls . | wc -l
42916

real    0m0.109s
user    0m0.070s
sys     0m0.044s

I also would like to have some script to ask the file system directly without all the unnecessary information.

The tests were based on the answers of Peter van der Heijden, glenn jackman, and mark4o.

醉生梦死 2024-08-11 23:47:00

令我惊讶的是,一个简单的发现与 ls -f

> time ls -f my_dir | wc -l
17626

real    0m0.015s
user    0m0.011s
sys     0m0.009s

> time find my_dir -maxdepth 1 | wc -l
17625

real    0m0.014s
user    0m0.008s
sys     0m0.010s

ls -f 非常相似。当然,每次执行其中任何一个时,小数点后第三位的值都会发生一点变化,因此它们基本上是相同的。但请注意,find 返回一个额外单位,因为它计算实际目录本身(并且,如前所述,ls -f 返回两个额外单位,因为它也计算 .和 ..)。

Surprisingly for me, a bare-bones find is very much comparable to ls -f

> time ls -f my_dir | wc -l
17626

real    0m0.015s
user    0m0.011s
sys     0m0.009s

versus

> time find my_dir -maxdepth 1 | wc -l
17625

real    0m0.014s
user    0m0.008s
sys     0m0.010s

Of course, the values on the third decimal place shift around a bit every time you execute any of these, so they're basically identical. Notice however that find returns one extra unit, because it counts the actual directory itself (and, as mentioned before, ls -f returns two extra units, since it also counts . and ..).

少女七分熟 2024-08-11 23:47:00

快速的 Linux 文件计数

我所知道的最快的 Linux 文件计数是

locate -c -r '/home'

不需要需要调用grep!但如前所述,您应该拥有一个新的数据库(通过 cron 作业每天更新,或通过 sudo Updatedb 手动更新)。

ma​​nlocate

-c, --count
    Instead  of  writing  file  names on standard output, write the number of matching
    entries only.

Additional,你应该知道它也将目录计为文件!


顺便说一句:如果您想了解系统上的文件和目录的概述,

locate -S

请输入它输出目录、文件等的数量。

Fast Linux file count

The fastest Linux file count I know is

locate -c -r '/home'

There is no need to invoke grep! But as mentioned, you should have a fresh database (updated daily by a cron job, or manual by sudo updatedb).

From man locate

-c, --count
    Instead  of  writing  file  names on standard output, write the number of matching
    entries only.

Additional, you should know that it also counts the directories as files!


BTW: If you want an overview of your files and directories on your system type

locate -S

It outputs the number of directories, files, etc.

木森分化 2024-08-11 23:47:00

您可以根据您的要求更改输出,但这里是我编写的 Bash 单行代码,用于递归计算和报告一系列数字命名目录中的文件数量。

dir=/tmp/count_these/ ; for i in $(ls -1 ${dir} | sort -n) ; { echo "$i => $(find ${dir}${i} -type f | wc -l),"; }

这会递归地查找给定目录中的所有文件(不是目录),并以类似散列的格式返回结果。对 find 命令进行简单的调整可以使您要查找的文件类型更加具体,等等。

它的结果如下:

1 => 38,
65 => 95052,
66 => 12823,
67 => 10572,
69 => 67275,
70 => 8105,
71 => 42052,
72 => 1184,

You can change the output based on your requirements, but here is a Bash one-liner I wrote to recursively count and report the number of files in a series of numerically named directories.

dir=/tmp/count_these/ ; for i in $(ls -1 ${dir} | sort -n) ; { echo "$i => $(find ${dir}${i} -type f | wc -l),"; }

This looks recursively for all files (not directories) in the given directory and returns the results in a hash-like format. Simple tweaks to the find command could make what kind of files you're looking to count more specific, etc.

It results in something like this:

1 => 38,
65 => 95052,
66 => 12823,
67 => 10572,
69 => 67275,
70 => 8105,
71 => 42052,
72 => 1184,
毁梦 2024-08-11 23:47:00

您可以使用 tree 程序获取文件和目录的计数。

运行命令tree | tail -n 1 获取最后一行,这会显示“763 个目录,9290 个文件”之类的内容。这会递归地对文件和文件夹进行计数,不包括隐藏文件,可以使用标志 -a 添加隐藏文件。作为参考,在我的计算机上,树花了 4.8 秒来计算我的整个主目录,其中有 24,777 个目录,238,680 个文件。 查找 - 类型 f | wc -l 花了 5.3 秒,多了半秒,所以我认为 tree 在速度方面相当有竞争力。

只要您没有任何子文件夹,就是一种快速、简单的文件计数方法。

另外,纯粹为了好玩,您可以使用 tree | grep '^├' 仅显示当前目录中的文件/文件夹 - 这基本上是 ls 的慢得多的版本。

You can get a count of files and directories with the tree program.

Run the command tree | tail -n 1 to get the last line, which will say something like "763 directories, 9290 files". This counts files and folders recursively, excluding hidden files, which can be added with the flag -a. For reference, it took 4.8 seconds on my computer, for tree to count my whole home directory, which was 24,777 directories, 238,680 files. find -type f | wc -l took 5.3 seconds, half a second longer, so I think tree is pretty competitive speed-wise.

As long as you don't have any subfolders, tree is a quick and easy way to count the files.

Also, and purely for the fun of it, you can use tree | grep '^├' to only show the files/folders in the current directory - this is basically a much slower version of ls.

跨年 2024-08-11 23:47:00

Linux 上最快的方法(问题被标记为 Linux)是使用直接系统调用。这是一个小程序,用于计算目录中的文件(仅,不包括目录)。您可以计算数百万个文件,它比“ls -f”快约 2.5 倍,比 Christopher Schultz 的回答

#define _GNU_SOURCE
#include <dirent.h>
#include <stdio.h>
#include <fcntl.h>
#include <stdlib.h>
#include <sys/syscall.h>

#define BUF_SIZE 4096

struct linux_dirent {
    long d_ino;
    off_t d_off;
    unsigned short d_reclen;
    char d_name[];
};

int countDir(char *dir) {

    int fd, nread, bpos, numFiles = 0;
    char d_type, buf[BUF_SIZE];
    struct linux_dirent *dirEntry;

    fd = open(dir, O_RDONLY | O_DIRECTORY);
    if (fd == -1) {
        puts("open directory error");
        exit(3);
    }
    while (1) {
        nread = syscall(SYS_getdents, fd, buf, BUF_SIZE);
        if (nread == -1) {
            puts("getdents error");
            exit(1);
        }
        if (nread == 0) {
            break;
        }

        for (bpos = 0; bpos < nread;) {
            dirEntry = (struct linux_dirent *) (buf + bpos);
            d_type = *(buf + bpos + dirEntry->d_reclen - 1);
            if (d_type == DT_REG) {
                // Increase counter
                numFiles++;
            }
            bpos += dirEntry->d_reclen;
        }
    }
    close(fd);

    return numFiles;
}

int main(int argc, char **argv) {

    if (argc != 2) {
        puts("Pass directory as parameter");
        return 2;
    }
    printf("Number of files in %s: %d\n", argv[1], countDir(argv[1]));
    return 0;
}

PS:它不是递归的,但你可以修改它来实现这一点。

The fastest way on Linux (the question is tagged as Linux), is to use a direct system call. Here's a little program that counts files (only, no directories) in a directory. You can count millions of files and it is around 2.5 times faster than "ls -f" and around 1.3-1.5 times faster than Christopher Schultz's answer.

#define _GNU_SOURCE
#include <dirent.h>
#include <stdio.h>
#include <fcntl.h>
#include <stdlib.h>
#include <sys/syscall.h>

#define BUF_SIZE 4096

struct linux_dirent {
    long d_ino;
    off_t d_off;
    unsigned short d_reclen;
    char d_name[];
};

int countDir(char *dir) {

    int fd, nread, bpos, numFiles = 0;
    char d_type, buf[BUF_SIZE];
    struct linux_dirent *dirEntry;

    fd = open(dir, O_RDONLY | O_DIRECTORY);
    if (fd == -1) {
        puts("open directory error");
        exit(3);
    }
    while (1) {
        nread = syscall(SYS_getdents, fd, buf, BUF_SIZE);
        if (nread == -1) {
            puts("getdents error");
            exit(1);
        }
        if (nread == 0) {
            break;
        }

        for (bpos = 0; bpos < nread;) {
            dirEntry = (struct linux_dirent *) (buf + bpos);
            d_type = *(buf + bpos + dirEntry->d_reclen - 1);
            if (d_type == DT_REG) {
                // Increase counter
                numFiles++;
            }
            bpos += dirEntry->d_reclen;
        }
    }
    close(fd);

    return numFiles;
}

int main(int argc, char **argv) {

    if (argc != 2) {
        puts("Pass directory as parameter");
        return 2;
    }
    printf("Number of files in %s: %d\n", argv[1], countDir(argv[1]));
    return 0;
}

PS: It is not recursive, but you could modify it to achieve that.

刘备忘录 2024-08-11 23:47:00

ls 花费更多时间对文件名进行排序。使用-f禁用排序,这会节省一些时间:

ls -f | wc -l

或者你可以使用find

find . -type f | wc -l

ls spends more time sorting the files names. Use -f to disable the sorting, which will save some time:

ls -f | wc -l

Or you can use find:

find . -type f | wc -l
ら栖息 2024-08-11 23:47:00

我在尝试对大约 10,000 个文件夹(每个文件夹大约有 10,000 个文件)的数据集中的文件进行计数时来到这里。许多方法的问题在于它们隐式统计 1 亿个文件,这需要很长时间。

我冒昧地扩展了该方法由 Christopher Schultz 编写,因此它支持通过参数传递目录(他的递归方法也使用 stat)。

将以下内容放入文件 dircnt_args.c 中:

#include <stdio.h>
#include <dirent.h>

int main(int argc, char *argv[]) {
    DIR *dir;
    struct dirent *ent;
    long count;
    long countsum = 0;
    int i;

    for(i=1; i < argc; i++) {
        dir = opendir(argv[i]);
        count = 0;
        while((ent = readdir(dir)))
            ++count;

        closedir(dir);

        printf("%s contains %ld files\n", argv[i], count);
        countsum += count;
    }
    printf("sum: %ld\n", countsum);

    return 0;
}

在执行 gcc -o dircnt_args dircnt_args.c 之后,您可以像这样调用它:

dircnt_args /your/directory/*

在 10,000 个文件夹中的 1 亿个文件上,以上完成相当快(第一次运行大约 5 分钟,后续缓存:大约 23 秒)。

在不到一个小时内完成的唯一其他方法是 ls ,在缓存上花费了大约 1 分钟: ls -f /your/directory/* | wc -l <​​/code>。不过,每个目录的计数因几个换行符而减少...

与预期不同,我对 find 的尝试都在一小时内没有返回:-/

I came here when trying to count the files in a data set of approximately 10,000 folders with approximately 10,000 files each. The problem with many of the approaches is that they implicitly stat 100 million files, which takes ages.

I took the liberty to extend the approach by Christopher Schultz so it supports passing directories via arguments (his recursive approach uses stat as well).

Put the following into file dircnt_args.c:

#include <stdio.h>
#include <dirent.h>

int main(int argc, char *argv[]) {
    DIR *dir;
    struct dirent *ent;
    long count;
    long countsum = 0;
    int i;

    for(i=1; i < argc; i++) {
        dir = opendir(argv[i]);
        count = 0;
        while((ent = readdir(dir)))
            ++count;

        closedir(dir);

        printf("%s contains %ld files\n", argv[i], count);
        countsum += count;
    }
    printf("sum: %ld\n", countsum);

    return 0;
}

After a gcc -o dircnt_args dircnt_args.c you can invoke it like this:

dircnt_args /your/directory/*

On 100 million files in 10,000 folders, the above completes quite quickly (approximately 5 minutes for the first run, and followup on cache: approximately 23 seconds).

The only other approach that finished in less than an hour was ls with about 1 min on cache: ls -f /your/directory/* | wc -l. The count is off by a couple of newlines per directory though...

Other than expected, none of my attempts with find returned within an hour :-/

如果没结果 2024-08-11 23:47:00

您应该使用“getdents”代替 ls/find

这是一篇非常好的文章,描述了 getdents 方法。

http://be-n.com/spw/you-can-list-a-million-files-in-a-directory-but-not-with-ls.html

以下是摘录:

ls 以及几乎所有其他列出目录的方法(包括 Python 的 os.listdirfind .)都依赖于 libc readdir()。然而,readdir()一次只能读取32K的目录条目,这意味着如果同一个目录中有很多文件(例如,5亿个目录条目),那么读取所有目录条目将需要非常长的时间。目录条目,尤其是在慢速磁盘上。对于包含大量文件的目录,您需要比依赖 readdir() 的工具更深入地挖掘。您需要直接使用 getdents() 系统调用,而不是 C 标准库。

我们可以从 这里找到使用 getdents() 列出文件的 C 代码

为了快速列出目录中的所有文件,您需要进行两项修改。

首先,将缓冲区大小从 X 增加到 5 MB 左右。

#define BUF_SIZE 1024*1024*5

然后修改主循环,打印出目录中每个文件的信息,以跳过 inode == 0 的条目。我通过添加来做到这一点,

if (dp->d_ino != 0) printf(...);

在我的例子中,我也只关心目录中的文件名,所以我也重写了printf() 语句仅打印文件名。

if(d->d_ino) printf("%sn ", (char *) d->d_name);

编译它(它不需要任何外部库,所以做起来非常简单)

gcc listdir.c -o listdir

现在只需运行

./listdir [directory with an insane number of files]

You should use "getdents" in place of ls/find

Here is one very good article which described the getdents approach.

http://be-n.com/spw/you-can-list-a-million-files-in-a-directory-but-not-with-ls.html

Here is the extract:

ls and practically every other method of listing a directory (including Python's os.listdir and find .) rely on libc readdir(). However, readdir() only reads 32K of directory entries at a time, which means that if you have a lot of files in the same directory (e.g., 500 million directory entries) it is going to take an insanely long time to read all the directory entries, especially on a slow disk. For directories containing a large number of files, you'll need to dig deeper than tools that rely on readdir(). You will need to use the getdents() system call directly, rather than helper methods from the C standard library.

We can find the C code to list the files using getdents() from here:

There are two modifications you will need to do in order quickly list all the files in a directory.

First, increase the buffer size from X to something like 5 megabytes.

#define BUF_SIZE 1024*1024*5

Then modify the main loop where it prints out the information about each file in the directory to skip entries with inode == 0. I did this by adding

if (dp->d_ino != 0) printf(...);

In my case I also really only cared about the file names in the directory so I also rewrote the printf() statement to only print the filename.

if(d->d_ino) printf("%sn ", (char *) d->d_name);

Compile it (it doesn't need any external libraries, so it's super simple to do)

gcc listdir.c -o listdir

Now just run

./listdir [directory with an insane number of files]
原野 2024-08-11 23:47:00

对于非常大、非常嵌套的目录,这里的答案比本页上的几乎所有其他内容都要快:

https://serverfault.com/a/ 691372/84703

定位 -r '.' | grep -c“^$PWD”

This answer here is faster than almost everything else on this page for very large, very nested directories:

https://serverfault.com/a/691372/84703

locate -r '.' | grep -c "^$PWD"

慈悲佛祖 2024-08-11 23:47:00

您可以尝试一下,如果在 Perl 中使用 opendir()readdir() 速度更快。有关这些函数的示例,请查看此处

You could try if using opendir() and readdir() in Perl is faster. For an example of those function, look here.

飘然心甜 2024-08-11 23:47:00

我意识到,当您拥有大量数据时,不使用内存处理比“管道”命令更快。所以我将结果保存到文件中并随后进行分析:

ls -1 /path/to/dir > count.txt && wc-l count.txt

I realized that not using in memory processing, when you have a huge amount of data, is faster than "piping" the commands. So I saved the result to a file and analyzed it afterwards:

ls -1 /path/to/dir > count.txt && wc-l count.txt
东京女 2024-08-11 23:47:00

文件数量最多的前 10 个目录。

dir=/ ; for i in $(ls -1 ${dir} | sort -n) ; { echo "$(find ${dir}${i} \
    -type f | wc -l) => $i,"; } | sort -nr | head -10

The first 10 directories with the highest number of files.

dir=/ ; for i in $(ls -1 ${dir} | sort -n) ; { echo "$(find ${dir}${i} \
    -type f | wc -l) => $i,"; } | sort -nr | head -10
软的没边 2024-08-11 23:47:00

我更喜欢使用以下命令来跟踪目录中文件数量的变化。

watch -d -n 0.01 'ls | wc -l'

该命令将保持一个窗口打开,以 0.1 秒的刷新率跟踪目录中的文件数量。

I prefer the following command to keep track of the changes in the number of files in a directory.

watch -d -n 0.01 'ls | wc -l'

The command will keeps a window open to keep track of the number of files that are in the directory with a refresh rate of 0.1 seconds.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文