使用 Perl 在 unix 上获取目录和子目录大小的最快方法是什么？

发布于 2024-08-29 20:05:24 字数 668 浏览 4 评论 0原文

我正在使用 Perl stat() 函数来获取目录及其子目录的大小。我有一个大约 20 个父目录的列表，其中有几千个递归子目录，每个子目录有几百个记录。脚本的主要计算部分如下所示：

sub getDirSize {
my $dirSize = 0;
my @dirContent = <*>;

my $sizeOfFilesInDir = 0;
foreach my $dirContent (@dirContent) {
   if (-f $dirContent) {
        my $size = (stat($dirContent))[7];
        $dirSize += $size;
   } elsif (-d $dirContent) {
        $dirSize += getDirSize($dirContent);
   } 
}
return $dirSize;
}

该脚本执行了一个多小时，我想让它更快。

我尝试使用 shell du 命令，但 du 的输出（传输为字节）不准确。而且这也是相当耗时的。我正在研究 HP-UNIX 11i v1。

原文

I am using Perl stat() function to get the size of directory and its subdirectories. I have a list of about 20 parent directories which have few thousand recursive subdirs and every subdir has few hundred records.
Main computing part of script looks like this:

sub getDirSize {
my $dirSize = 0;
my @dirContent = <*>;

my $sizeOfFilesInDir = 0;
foreach my $dirContent (@dirContent) {
   if (-f $dirContent) {
        my $size = (stat($dirContent))[7];
        $dirSize += $size;
   } elsif (-d $dirContent) {
        $dirSize += getDirSize($dirContent);
   } 
}
return $dirSize;
}

The script is executing for more than one hour and I want to make it faster.

I was trying with the shell du command, but the output of du (transfered to bytes) is not accurate. And it is also quite time consuming.
I am working on HP-UNIX 11i v1.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

站稳脚跟 2024-09-05 20:05:24

在 perlmonks 上的 sfink 和 samtregar 的帮助下，尝试一下这个：

#!/usr/bin/perl
use warnings;
use strict;
use File::Find;
my $size = 0;
find( sub { $size += -f $_ ? -s _ : 0 }, shift(@ARGV) );
print $size, "\n";

这里我们递归指定目录的所有子目录，获取每个文件的大小，然后通过使用特殊的 ' 重新使用文件测试中的统计数据_' 大小测试语法。

我倾向于相信 du 足够可靠。

With some help from sfink and samtregar on perlmonks, try this one out:

#!/usr/bin/perl
use warnings;
use strict;
use File::Find;
my $size = 0;
find( sub { $size += -f $_ ? -s _ : 0 }, shift(@ARGV) );
print $size, "\n";

Here we're recursing all subdirs of the specified dir, getting the size of each file, and we re-use the stat from the file test by using the special '_' syntax for the size test.

I tend to believe that du would be reliable enough though.

回复收藏 0 原文

○愚か者の日 2024-09-05 20:05:24

我曾经遇到过类似的问题，并使用并行化方法来加速。由于您有大约 20 个顶级目录，因此这可能是您可以尝试的非常简单的方法。
将您的顶级目录分成几个组（多少组最好是一个经验问题），调用fork()几次并分析子进程中的目录大小。在子进程结束时，将结果写入一些临时文件。当所有子项都完成后，从文件中读取结果并进行处理。

回复收藏 0 原文

美男兮 2024-09-05 20:05:24

每当你想加快某些事情的速度时，你的首要任务就是找出哪些事情慢了。使用诸如 Devel::NYTProf 之类的探查器来分析程序并找出您的位置应该集中精力。

除了重用最后一个 stat 中的数据之外，我还会删除递归，因为 Perl 在这方面很糟糕。我会构建一个堆栈（或队列）并对其进行处理，直到没有任何内容需要处理为止。

回复收藏 0 原文

飘然心甜 2024-09-05 20:05:24

下面是 getDirSize() 的另一个变体，它不需要引用保存当前大小的变量，并接受一个参数来指示是否应考虑子目录：

#!/usr/bin/perl

print 'Size (without sub-directories): ' . getDirSize(".") . " bytes\n";
print 'Size (incl. sub-directories): ' . getDirSize(".", 1) . " bytes\n";

sub getDirSize
# Returns the size in bytes of the files in a given directory and eventually its sub-directories
# Parameters:
#   $dirPath (string): the path to the directory to examine
#   $subDirs (optional boolean): FALSE (or missing) = consider only the files in $dirPath, TRUE = include also sub-directories
# Returns:
#   $size (int): the size of the directory's contents
{
  my ($dirPath, $subDirs) = @_;  # Get the parameters

  my $size = 0;

  opendir(my $DH, $dirPath);
  foreach my $dirEntry (readdir($DH))
  {
    stat("${dirPath}/${dirEntry}");  # Stat once and then refer to "_"
    if (-f _)
    {
     # This is a file
     $size += -s _;
    }
    elsif (-d _)
    {
     # This is a sub-directory: add the size of its contents
     $size += getDirSize("${dirPath}/${dirEntry}", 1) if ($subDirs && ($dirEntry ne '.') && ($dirEntry ne '..'));
    } 
  }
  closedir($DH);

  return $size;
}

Below is another variant of getDirSize() which doesn't require a reference to a variable holding the current size and accepts a parameter to indicate whether sub-directories shall be considered or not:

#!/usr/bin/perl

print 'Size (without sub-directories): ' . getDirSize(".") . " bytes\n";
print 'Size (incl. sub-directories): ' . getDirSize(".", 1) . " bytes\n";

sub getDirSize
# Returns the size in bytes of the files in a given directory and eventually its sub-directories
# Parameters:
#   $dirPath (string): the path to the directory to examine
#   $subDirs (optional boolean): FALSE (or missing) = consider only the files in $dirPath, TRUE = include also sub-directories
# Returns:
#   $size (int): the size of the directory's contents
{
  my ($dirPath, $subDirs) = @_;  # Get the parameters

  my $size = 0;

  opendir(my $DH, $dirPath);
  foreach my $dirEntry (readdir($DH))
  {
    stat("${dirPath}/${dirEntry}");  # Stat once and then refer to "_"
    if (-f _)
    {
     # This is a file
     $size += -s _;
    }
    elsif (-d _)
    {
     # This is a sub-directory: add the size of its contents
     $size += getDirSize("${dirPath}/${dirEntry}", 1) if ($subDirs && ($dirEntry ne '.') && ($dirEntry ne '..'));
    } 
  }
  closedir($DH);

  return $size;
}

回复收藏 0 原文

战皆罪 2024-09-05 20:05:24

我看到了几个问题。一个@dirContent 被显式设置为<*>；每次输入 getDirSize 时都会重置该值。结果将是一个无限循环，至少直到耗尽堆栈为止（因为它是递归调用）。其次，有特殊的文件句柄符号用于从 stat 调用中检索信息——下划线 (_)。请参阅：http://perldoc.perl.org/functions/stat.html。您的代码按原样调用 stat 三次以获得基本相同的信息（-f、stat 和 -d）。由于文件 I/O 成本很高，因此您真正想要的是调用 stat 一次，然后使用“_”引用数据。这是一些示例代码，我相信它可以完成您想要做的事情

#!/usr/bin/perl

my $size = 0;
getDirSize(".",\$size);

print "Size: $size\n";

sub getDirSize {
  my $dir  = shift;
  my $size = shift;

  opendir(D,"$dir");
  foreach my $dirContent (grep(!/^\.\.?/,readdir(D))) {
     stat("$dir/$dirContent");
     if (-f _) {
       $size += -s _;
     } elsif (-d _) {
       getDirSize("$dir/$dirContent",$size);
     } 
  }
  closedir(D);
}

I see a couple of problems. One @dirContent is explicitly set to <*> this will be reset each time you enter getDirSize. The result will be an infinite loop at least until you exhaust the stack (since it is a recursive call). Secondly, there is special filehandle notation for retrieving information from a stat call -- underscore (_). See: http://perldoc.perl.org/functions/stat.html. Your code as-is is calling stat three times for essentially the same information (-f, stat, and -d). Since file I/O is expensive, what you really want is to call stat once and then reference the data using "_". Here is some sample code that I believe accomplishes what you are trying to do

#!/usr/bin/perl

my $size = 0;
getDirSize(".",\$size);

print "Size: $size\n";

sub getDirSize {
  my $dir  = shift;
  my $size = shift;

  opendir(D,"$dir");
  foreach my $dirContent (grep(!/^\.\.?/,readdir(D))) {
     stat("$dir/$dirContent");
     if (-f _) {
       $size += -s _;
     } elsif (-d _) {
       getDirSize("$dir/$dirContent",$size);
     } 
  }
  closedir(D);
}

回复收藏 0 原文

大海や 2024-09-05 20:05:24

大佬的回答很好。我稍微修改了它，因为我想获取 Windows 计算机上给定路径下的所有文件夹的大小。

我就是这样做的。

#!/usr/bin/perl
use strict;
use warnings;
use File::stat;


my $dirname = "C:\\Users\\xxx\\Documents\\initial-docs";
opendir (my $DIR, $dirname) || die "Error while opening dir $dirname: $!\n";

my $dirCount = 0;
foreach my $dirFileName(sort readdir $DIR)
{

      next if $dirFileName eq '.' or $dirFileName eq '..';

      my $dirFullPath = "$dirname\\$dirFileName";
      #only check if its a dir and skip files
      if (-d $dirFullPath )
      {
          $dirCount++;
          my $dirSize = getDirSize($dirFullPath, 1); #bytes
          my $dirSizeKB = $dirSize/1000;
          my $dirSizeMB = $dirSizeKB/1000;
          my $dirSizeGB = $dirSizeMB/1000;
          print("$dirCount - dir-name: $dirFileName  - Size: $dirSizeMB (MB) ... \n");

      }   
}

print "folders in $dirname: $dirCount ...\n";

sub getDirSize
{
  my ($dirPath, $subDirs) = @_;  # Get the parameters

  my $size = 0;

  opendir(my $DH, $dirPath);
  foreach my $dirEntry (readdir($DH))
  {
    stat("${dirPath}/${dirEntry}");  # Stat once and then refer to "_"
    if (-f _)
    {
     # This is a file
     $size += -s _;
    }
    elsif (-d _)
    {
     # This is a sub-directory: add the size of its contents
     $size += getDirSize("${dirPath}/${dirEntry}", 1) if ($subDirs && ($dirEntry ne '.') && ($dirEntry ne '..'));
    } 
  }
  closedir($DH);

  return $size;
}
1
;

输出：

1 - dir-name: acct-requests  - Size: 0.458696 (MB) ...
2 - dir-name: environments  - Size: 0.771527 (MB) ...
3 - dir-name: logins  - Size: 0.317982 (MB) ...
folders in C:\Users\xxx\Documents\initial-docs: 3 ...

Bigs answer is good. I modified it slightly as I wanted to get the sizes of all the folders under a given path on my windows machine.

This is how I did it.

#!/usr/bin/perl
use strict;
use warnings;
use File::stat;


my $dirname = "C:\\Users\\xxx\\Documents\\initial-docs";
opendir (my $DIR, $dirname) || die "Error while opening dir $dirname: $!\n";

my $dirCount = 0;
foreach my $dirFileName(sort readdir $DIR)
{

      next if $dirFileName eq '.' or $dirFileName eq '..';

      my $dirFullPath = "$dirname\\$dirFileName";
      #only check if its a dir and skip files
      if (-d $dirFullPath )
      {
          $dirCount++;
          my $dirSize = getDirSize($dirFullPath, 1); #bytes
          my $dirSizeKB = $dirSize/1000;
          my $dirSizeMB = $dirSizeKB/1000;
          my $dirSizeGB = $dirSizeMB/1000;
          print("$dirCount - dir-name: $dirFileName  - Size: $dirSizeMB (MB) ... \n");

      }   
}

print "folders in $dirname: $dirCount ...\n";

sub getDirSize
{
  my ($dirPath, $subDirs) = @_;  # Get the parameters

  my $size = 0;

  opendir(my $DH, $dirPath);
  foreach my $dirEntry (readdir($DH))
  {
    stat("${dirPath}/${dirEntry}");  # Stat once and then refer to "_"
    if (-f _)
    {
     # This is a file
     $size += -s _;
    }
    elsif (-d _)
    {
     # This is a sub-directory: add the size of its contents
     $size += getDirSize("${dirPath}/${dirEntry}", 1) if ($subDirs && ($dirEntry ne '.') && ($dirEntry ne '..'));
    } 
  }
  closedir($DH);

  return $size;
}
1
;

OUTPUT:

1 - dir-name: acct-requests  - Size: 0.458696 (MB) ...
2 - dir-name: environments  - Size: 0.771527 (MB) ...
3 - dir-name: logins  - Size: 0.317982 (MB) ...
folders in C:\Users\xxx\Documents\initial-docs: 3 ...

回复收藏 0 原文

感悟人生的甜 2024-09-05 20:05:24

如果您的主目录是目录和文件 inode 的最大消耗者，那么就不要计算它。计算系统的另一半，并从中推断出系统其余部分的大小。（您可以在几毫秒内从 df 获取已用磁盘空间）。您可能需要添加一个小的“捏造”因素才能获得相同的数字。（还请记住，如果您以 root 身份计算一些可用空间，那么与 Linux 上的 ext2/ext3 中的其他用户相比，您将拥有一些额外的 5%，不知道 HPUX）。

回复收藏 0 原文

~没有更多了~