使用 Perl 在 unix 上获取目录和子目录大小的最快方法是什么?

发布于 2024-08-29 20:05:24 字数 668 浏览 4 评论 0原文

我正在使用 Perl stat() 函数来获取目录及其子目录的大小。我有一个大约 20 个父目录的列表,其中有几千个递归子目录,每个子目录有几百个记录。 脚本的主要计算部分如下所示:

sub getDirSize {
my $dirSize = 0;
my @dirContent = <*>;

my $sizeOfFilesInDir = 0;
foreach my $dirContent (@dirContent) {
   if (-f $dirContent) {
        my $size = (stat($dirContent))[7];
        $dirSize += $size;
   } elsif (-d $dirContent) {
        $dirSize += getDirSize($dirContent);
   } 
}
return $dirSize;
}

该脚本执行了一个多小时,我想让它更快。

我尝试使用 shell du 命令,但 du 的输出(传输为字节)不准确。而且这也是相当耗时的。 我正在研究 HP-UNIX 11i v1。

I am using Perl stat() function to get the size of directory and its subdirectories. I have a list of about 20 parent directories which have few thousand recursive subdirs and every subdir has few hundred records.
Main computing part of script looks like this:

sub getDirSize {
my $dirSize = 0;
my @dirContent = <*>;

my $sizeOfFilesInDir = 0;
foreach my $dirContent (@dirContent) {
   if (-f $dirContent) {
        my $size = (stat($dirContent))[7];
        $dirSize += $size;
   } elsif (-d $dirContent) {
        $dirSize += getDirSize($dirContent);
   } 
}
return $dirSize;
}

The script is executing for more than one hour and I want to make it faster.

I was trying with the shell du command, but the output of du (transfered to bytes) is not accurate. And it is also quite time consuming.
I am working on HP-UNIX 11i v1.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

站稳脚跟 2024-09-05 20:05:24

在 perlmonks 上的 sfink 和 samtregar 的帮助下,尝试一下这个:

#!/usr/bin/perl
use warnings;
use strict;
use File::Find;
my $size = 0;
find( sub { $size += -f $_ ? -s _ : 0 }, shift(@ARGV) );
print $size, "\n";

这里我们递归指定目录的所有子目录,获取每个文件的大小,然后通过使用特殊的 ' 重新使用文件测试中的统计数据_' 大小测试语法。

我倾向于相信 du 足够可靠。

With some help from sfink and samtregar on perlmonks, try this one out:

#!/usr/bin/perl
use warnings;
use strict;
use File::Find;
my $size = 0;
find( sub { $size += -f $_ ? -s _ : 0 }, shift(@ARGV) );
print $size, "\n";

Here we're recursing all subdirs of the specified dir, getting the size of each file, and we re-use the stat from the file test by using the special '_' syntax for the size test.

I tend to believe that du would be reliable enough though.

○愚か者の日 2024-09-05 20:05:24

我曾经遇到过类似的问题,并使用并行化方法来加速。由于您有大约 20 个顶级目录,因此这可能是您可以尝试的非常简单的方法。
将您的顶级目录分成几个组(多少组最好是一个经验问题),调用fork()几次并分析子进程中的目录大小。在子进程结束时,将结果写入一些临时文件。当所有子项都完成后,从文件中读取结果并进行处理。

I once faced a similar problem, and used a parallelization approach to speed it up. Since you have ~20 top-tier directories, this might be a pretty straightforward approach for you to try.
Split your top-tier directories into several groups (how many groups is best is an empirical question), call fork() a few times and analyze directory sizes in the child processes. At the end of the child processes, write out your results to some temporary files. When all the children are done, read the results out of the files and process them.

美男兮 2024-09-05 20:05:24

每当你想加快某些事情的速度时,你的首要任务就是找出哪些事情慢了。使用诸如 Devel::NYTProf 之类的探查器来分析程序并找出您的位置应该集中精力。

除了重用最后一个 stat 中的数据之外,我还会删除递归,因为 Perl 在这方面很糟糕。我会构建一个堆栈(或队列)并对其进行处理,直到没有任何内容需要处理为止。

Whenever you want to speed up something, you're first task is to find out what's slow. Use a profiler such as Devel::NYTProf to analyze the program and find out where you should concentrate your efforts.

In addition to reusing that data from the last stat, I'd get rid of the recursion since Perl is horrible at it. I'd construct a stack (or a queue) and work on that until there is nothing left to process.

飘然心甜 2024-09-05 20:05:24

下面是 getDirSize() 的另一个变体,它不需要引用保存当前大小的变量,并接受一个参数来指示是否应考虑子目录:

#!/usr/bin/perl

print 'Size (without sub-directories): ' . getDirSize(".") . " bytes\n";
print 'Size (incl. sub-directories): ' . getDirSize(".", 1) . " bytes\n";

sub getDirSize
# Returns the size in bytes of the files in a given directory and eventually its sub-directories
# Parameters:
#   $dirPath (string): the path to the directory to examine
#   $subDirs (optional boolean): FALSE (or missing) = consider only the files in $dirPath, TRUE = include also sub-directories
# Returns:
#   $size (int): the size of the directory's contents
{
  my ($dirPath, $subDirs) = @_;  # Get the parameters

  my $size = 0;

  opendir(my $DH, $dirPath);
  foreach my $dirEntry (readdir($DH))
  {
    stat("${dirPath}/${dirEntry}");  # Stat once and then refer to "_"
    if (-f _)
    {
     # This is a file
     $size += -s _;
    }
    elsif (-d _)
    {
     # This is a sub-directory: add the size of its contents
     $size += getDirSize("${dirPath}/${dirEntry}", 1) if ($subDirs && ($dirEntry ne '.') && ($dirEntry ne '..'));
    } 
  }
  closedir($DH);

  return $size;
}

Below is another variant of getDirSize() which doesn't require a reference to a variable holding the current size and accepts a parameter to indicate whether sub-directories shall be considered or not:

#!/usr/bin/perl

print 'Size (without sub-directories): ' . getDirSize(".") . " bytes\n";
print 'Size (incl. sub-directories): ' . getDirSize(".", 1) . " bytes\n";

sub getDirSize
# Returns the size in bytes of the files in a given directory and eventually its sub-directories
# Parameters:
#   $dirPath (string): the path to the directory to examine
#   $subDirs (optional boolean): FALSE (or missing) = consider only the files in $dirPath, TRUE = include also sub-directories
# Returns:
#   $size (int): the size of the directory's contents
{
  my ($dirPath, $subDirs) = @_;  # Get the parameters

  my $size = 0;

  opendir(my $DH, $dirPath);
  foreach my $dirEntry (readdir($DH))
  {
    stat("${dirPath}/${dirEntry}");  # Stat once and then refer to "_"
    if (-f _)
    {
     # This is a file
     $size += -s _;
    }
    elsif (-d _)
    {
     # This is a sub-directory: add the size of its contents
     $size += getDirSize("${dirPath}/${dirEntry}", 1) if ($subDirs && ($dirEntry ne '.') && ($dirEntry ne '..'));
    } 
  }
  closedir($DH);

  return $size;
}
战皆罪 2024-09-05 20:05:24

我看到了几个问题。一个@dirContent 被显式设置为<*>;每次输入 getDirSize 时都会重置该值。结果将是一个无限循环,至少直到耗尽堆栈为止(因为它是递归调用)。其次,有特殊的文件句柄符号用于从 stat 调用中检索信息——下划线 (_)。请参阅:http://perldoc.perl.org/functions/stat.html。您的代码按原样调用 stat 三次以获得基本相同的信息(-f、stat 和 -d)。由于文件 I/O 成本很高,因此您真正想要的是调用 stat 一次,然后使用“_”引用数据。这是一些示例代码,我相信它可以完成您想要做的事情

#!/usr/bin/perl

my $size = 0;
getDirSize(".",\$size);

print "Size: $size\n";

sub getDirSize {
  my $dir  = shift;
  my $size = shift;

  opendir(D,"$dir");
  foreach my $dirContent (grep(!/^\.\.?/,readdir(D))) {
     stat("$dir/$dirContent");
     if (-f _) {
       $size += -s _;
     } elsif (-d _) {
       getDirSize("$dir/$dirContent",$size);
     } 
  }
  closedir(D);
}

I see a couple of problems. One @dirContent is explicitly set to <*> this will be reset each time you enter getDirSize. The result will be an infinite loop at least until you exhaust the stack (since it is a recursive call). Secondly, there is special filehandle notation for retrieving information from a stat call -- underscore (_). See: http://perldoc.perl.org/functions/stat.html. Your code as-is is calling stat three times for essentially the same information (-f, stat, and -d). Since file I/O is expensive, what you really want is to call stat once and then reference the data using "_". Here is some sample code that I believe accomplishes what you are trying to do

#!/usr/bin/perl

my $size = 0;
getDirSize(".",\$size);

print "Size: $size\n";

sub getDirSize {
  my $dir  = shift;
  my $size = shift;

  opendir(D,"$dir");
  foreach my $dirContent (grep(!/^\.\.?/,readdir(D))) {
     stat("$dir/$dirContent");
     if (-f _) {
       $size += -s _;
     } elsif (-d _) {
       getDirSize("$dir/$dirContent",$size);
     } 
  }
  closedir(D);
}
大海や 2024-09-05 20:05:24

大佬的回答很好。我稍微修改了它,因为我想获取 Windows 计算机上给定路径下的所有文件夹的大小。

我就是这样做的。

#!/usr/bin/perl
use strict;
use warnings;
use File::stat;


my $dirname = "C:\\Users\\xxx\\Documents\\initial-docs";
opendir (my $DIR, $dirname) || die "Error while opening dir $dirname: $!\n";

my $dirCount = 0;
foreach my $dirFileName(sort readdir $DIR)
{

      next if $dirFileName eq '.' or $dirFileName eq '..';

      my $dirFullPath = "$dirname\\$dirFileName";
      #only check if its a dir and skip files
      if (-d $dirFullPath )
      {
          $dirCount++;
          my $dirSize = getDirSize($dirFullPath, 1); #bytes
          my $dirSizeKB = $dirSize/1000;
          my $dirSizeMB = $dirSizeKB/1000;
          my $dirSizeGB = $dirSizeMB/1000;
          print("$dirCount - dir-name: $dirFileName  - Size: $dirSizeMB (MB) ... \n");

      }   
}

print "folders in $dirname: $dirCount ...\n";

sub getDirSize
{
  my ($dirPath, $subDirs) = @_;  # Get the parameters

  my $size = 0;

  opendir(my $DH, $dirPath);
  foreach my $dirEntry (readdir($DH))
  {
    stat("${dirPath}/${dirEntry}");  # Stat once and then refer to "_"
    if (-f _)
    {
     # This is a file
     $size += -s _;
    }
    elsif (-d _)
    {
     # This is a sub-directory: add the size of its contents
     $size += getDirSize("${dirPath}/${dirEntry}", 1) if ($subDirs && ($dirEntry ne '.') && ($dirEntry ne '..'));
    } 
  }
  closedir($DH);

  return $size;
}
1
;

输出:

1 - dir-name: acct-requests  - Size: 0.458696 (MB) ...
2 - dir-name: environments  - Size: 0.771527 (MB) ...
3 - dir-name: logins  - Size: 0.317982 (MB) ...
folders in C:\Users\xxx\Documents\initial-docs: 3 ...

Bigs answer is good. I modified it slightly as I wanted to get the sizes of all the folders under a given path on my windows machine.

This is how I did it.

#!/usr/bin/perl
use strict;
use warnings;
use File::stat;


my $dirname = "C:\\Users\\xxx\\Documents\\initial-docs";
opendir (my $DIR, $dirname) || die "Error while opening dir $dirname: $!\n";

my $dirCount = 0;
foreach my $dirFileName(sort readdir $DIR)
{

      next if $dirFileName eq '.' or $dirFileName eq '..';

      my $dirFullPath = "$dirname\\$dirFileName";
      #only check if its a dir and skip files
      if (-d $dirFullPath )
      {
          $dirCount++;
          my $dirSize = getDirSize($dirFullPath, 1); #bytes
          my $dirSizeKB = $dirSize/1000;
          my $dirSizeMB = $dirSizeKB/1000;
          my $dirSizeGB = $dirSizeMB/1000;
          print("$dirCount - dir-name: $dirFileName  - Size: $dirSizeMB (MB) ... \n");

      }   
}

print "folders in $dirname: $dirCount ...\n";

sub getDirSize
{
  my ($dirPath, $subDirs) = @_;  # Get the parameters

  my $size = 0;

  opendir(my $DH, $dirPath);
  foreach my $dirEntry (readdir($DH))
  {
    stat("${dirPath}/${dirEntry}");  # Stat once and then refer to "_"
    if (-f _)
    {
     # This is a file
     $size += -s _;
    }
    elsif (-d _)
    {
     # This is a sub-directory: add the size of its contents
     $size += getDirSize("${dirPath}/${dirEntry}", 1) if ($subDirs && ($dirEntry ne '.') && ($dirEntry ne '..'));
    } 
  }
  closedir($DH);

  return $size;
}
1
;

OUTPUT:

1 - dir-name: acct-requests  - Size: 0.458696 (MB) ...
2 - dir-name: environments  - Size: 0.771527 (MB) ...
3 - dir-name: logins  - Size: 0.317982 (MB) ...
folders in C:\Users\xxx\Documents\initial-docs: 3 ...
感悟人生的甜 2024-09-05 20:05:24

如果您的主目录是目录和文件 inode 的最大消耗者,那么就不要计算它。计算系统的另一半,并从中推断出系统其余部分的大小。 (您可以在几毫秒内从 df 获取已用磁盘空间)。您可能需要添加一个小的“捏造”因素才能获得相同的数字。 (还请记住,如果您以 root 身份计算一些可用空间,那么与 Linux 上的 ext2/ext3 中的其他用户相比,您将拥有一些额外的 5%,不知道 HPUX)。

If your main directory is overwhelmingly largest consumer of directory and file inodes then don't calculate it. Calculate the other half of system and deduce the size of the rest of the system from that. (you can get used disk space from df in a couple of ms'). You might need to add a small 'fudge' factor to get to the same numbers. (also remember that if you calculate some free space as root, then you'll have some extra compared to other users 5% in ext2/ext3 on Linux, don't know about HPUX).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文