如何在Perl中查找给定路径下不同目录中存在的文件

发布于 2024-12-21 02:45:54 字数 1384 浏览 0 评论 0原文

我正在寻找一种方法来查找驻留在给定路径中的几个目录中的文件。换句话说,这些目录将包含具有相同文件名的文件。我的脚本在查找 grep 文件名进行处理的正确路径时似乎存在层次结构问题。我有一个修复路径作为输入,脚本需要查看该路径并从那里查找文件,但我的脚本似乎停留在 2 层并从那里进行处理,而不是查看该层中的最后一个目录(在我的例子中)它处理“ln”和“nn”并开始处理子例程)。

修复输入路径是:-

/nfs/disks/version_2.0/

我想要通过子程序进行后处理的文件将存在于以下几个目录下。基本上我想检查 file1.abc 是否存在于所有目录 temp1、temp2 和 temp1 中。 ln目录下的temp3。如果 nn 目录下的 temp1、temp2、temp3 中存在 file2.abc,则相同。

我想要检查完整路径的文件将如下所示:-

/nfs/disks/version_2.0/dir_a/ln/temp1/file1.abc
/nfs/disks/version_2.0/dir_a/ln/temp2/file1.abc
/nfs/disks/version_2.0/dir_a/ln/temp3/file1.abc

/nfs/disks/version_2.0/dir_a/nn/temp1/file2.abc
/nfs/disks/version_2.0/dir_a/nn/temp2/file2.abc
/nfs/disks/version_2.0/dir_a/nn/temp3/file2.abc

我的脚本如下:-

#! /usr/bin/perl -w 
my $dir = '/nfs/fm/disks/version_2.0/' ;
opendir(TEMP, $dir) || die $! ;
foreach my $file (readdir(TEMP)) {
    next if ($file eq "." || $file eq "..") ;
    if (-d "$dir/$file") {
        my $d = "$dir/$file";   
        print "Directory:- $d\n" ;
        &getFile($d);
        &compare($file) ;
    }
}

请注意,我将 print "Directory:- $d\n" ; 放在那里用于调试目的,并且它打印了这个:-

/nfs/disks/version_2.0/dir_a/
/nfs/disks/version_2.0/dir_b/

所以我知道它进入了处理以下子例程的错误路径。

有人可以帮助指出我的脚本中的错误在哪里吗?谢谢!

I'm looking for a method to looks for file which resides in a few directories in a given path. In other words, those directories will be having files with same filename across. My script seem to have the hierarchy problem on looking into the correct path to grep the filename for processing. I have a fix path as input and the script will need to looks into the path and finding files from there but my script seem stuck on 2 tiers up and process from there rather than looking into the last directories in the tier (in my case here it process on "ln" and "nn" and start processing the subroutine).

The fix input path is:-

/nfs/disks/version_2.0/

The files that I want to do post processing by subroutine will be exist under several directories as below. Basically I wanted to check if the file1.abc do exists in all the directories temp1, temp2 & temp3 under ln directory. Same for file2.abc if exist in temp1, temp2, temp3 under nn directory.

The files that I wanted to check in full path will be like this:-

/nfs/disks/version_2.0/dir_a/ln/temp1/file1.abc
/nfs/disks/version_2.0/dir_a/ln/temp2/file1.abc
/nfs/disks/version_2.0/dir_a/ln/temp3/file1.abc

/nfs/disks/version_2.0/dir_a/nn/temp1/file2.abc
/nfs/disks/version_2.0/dir_a/nn/temp2/file2.abc
/nfs/disks/version_2.0/dir_a/nn/temp3/file2.abc

My script as below:-

#! /usr/bin/perl -w 
my $dir = '/nfs/fm/disks/version_2.0/' ;
opendir(TEMP, $dir) || die $! ;
foreach my $file (readdir(TEMP)) {
    next if ($file eq "." || $file eq "..") ;
    if (-d "$dir/$file") {
        my $d = "$dir/$file";   
        print "Directory:- $d\n" ;
        &getFile($d);
        &compare($file) ;
    }
}

Note that I put the print "Directory:- $d\n" ; there for debug purposes and it printed this:-

/nfs/disks/version_2.0/dir_a/
/nfs/disks/version_2.0/dir_b/

So I knew it get into the wrong path for processing the following subroutine.

Can somebody help to point me where is the error in my script? Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

秋意浓 2024-12-28 02:45:54

需要明确的是:脚本应该递归通过目录并查找具有特定文件名的文件?在这种情况下,我认为以下代码是问题所在:

if (-d "$dir/$file") {
    my $d = "$dir/$file";   
    print "Directory:- $d\n" ;
    &getFile($d);
    &compare($file) ;
}

我假设 &getFile($d) 旨在进入目录(即递归步骤)。这很好。但是,看起来 &compare($file) 是当您正在查看的对象不是目录时要执行的操作。因此,该代码块应该看起来像这样:

if (-d "$dir/$file") {
    &getFile("$dir/$file");  # the recursive step, for directories inside of this one
} elsif( -f "$dir/$file" ){
    &compare("$dir/$file");  # the action on files inside of the current directory
}

一般的伪代码应该像这样:

sub myFind {
    my $dir = shift;
    foreach my $file( stat $dir ){
        next if $file -eq "." || $file -eq ".."
        my $obj = "$dir/$file";
        if( -d $obj ){
            myFind( $obj );
        } elsif( -f $obj ){
            doSomethingWithFile( $obj );
        }
    }
}
myFind( "/nfs/fm/disks/version_2.0" );

作为旁注:这个脚本正在重新发明轮子。您只需要编写一个对单个文件进行处理的脚本。您可以完全从 shell 完成剩下的工作:

find /nfs/fm/disks/version_2.0 -type f -name "the-filename-you-want" -exec your_script.pl {} \;

To be clear: the script is supposed to recurse through a directory and look for files with a particular filename? In this case, I think the following code is the problem:

if (-d "$dir/$file") {
    my $d = "$dir/$file";   
    print "Directory:- $d\n" ;
    &getFile($d);
    &compare($file) ;
}

I'm assuming the &getFile($d) is meant to step into a directory (i.e., the recursive step). This is fine. However, it looks like the &compare($file) is the action that you want to take when the object that you're looking at isn't a directory. Therefore, that code block should look something like this:

if (-d "$dir/$file") {
    &getFile("$dir/$file");  # the recursive step, for directories inside of this one
} elsif( -f "$dir/$file" ){
    &compare("$dir/$file");  # the action on files inside of the current directory
}

The general pseudo-code should like like this:

sub myFind {
    my $dir = shift;
    foreach my $file( stat $dir ){
        next if $file -eq "." || $file -eq ".."
        my $obj = "$dir/$file";
        if( -d $obj ){
            myFind( $obj );
        } elsif( -f $obj ){
            doSomethingWithFile( $obj );
        }
    }
}
myFind( "/nfs/fm/disks/version_2.0" );

As a side note: this script is reinventing the wheel. You only need to write a script that does the processing on an individual file. You could do the rest entirely from the shell:

find /nfs/fm/disks/version_2.0 -type f -name "the-filename-you-want" -exec your_script.pl {} \;
小嗷兮 2024-12-28 02:45:54

哇,这就像重温 20 世纪 90 年代! Perl 代码已经有所发展,您确实需要学习新东西。看来您是在 3.0 或 4.0 版本中学习 Perl 的。以下是一些提示:

  • 在命令行上使用 use warnings; 而不是 -w
  • 使用use strict;。这将要求您使用 my 预先声明变量,这会将它们的范围限定为本地块或文件(如果它们不在本地块中)。这有助于捕获很多错误。
  • 不要将 & 放在子例程名称前面。
  • 使用 andornot 代替 &&||!
  • 了解 Perl 模块可以为您节省大量时间和精力。

当有人说检测重复项时,我立即想到哈希值。如果您使用基于文件名的哈希值,您可以轻松查看是否存在重复文件。

当然,散列的每个键只能有一个值。幸运的是,在 Perl 5.x 中,该值可以是对另一个数据结构的引用。

因此,我建议您使用包含对列表(旧说法中的数组)的引用的哈希。您可以将文件的每个实例推送到该列表。

使用您的示例,您将拥有如下所示的数据结构:

%file_hash = {
    file1.abc => [
       /nfs/disks/version_2.0/dir_a/ln/temp1
       /nfs/disks/version_2.0/dir_a/ln/temp2
       /nfs/disks/version_2.0/dir_a/ln/temp3
    ],
    file2.abc => [
       /nfs/disks/version_2.0/dir_a/nn/temp1
       /nfs/disks/version_2.0/dir_a/nn/temp2
       /nfs/disks/version_2.0/dir_a/nn/temp3
   ];

并且,这是一个执行此操作的程序:

#! /usr/bin/env perl
#
use strict;
use warnings;
use feature qw(say);        #Can use `say` which is like `print "\n"`;

use File::Basename; #imports `dirname` and `basename` commands
use File::Find;             #Implements Unix `find` command.

use constant DIR => "/nfs/disks/version_2.0";

# Find all duplicates
my %file_hash;
find (\&wanted, DIR);

# Print out all the duplicates
foreach my $file_name (sort keys %file_hash) {
    if (scalar (@{$file_hash{$file_name}}) > 1) {
        say qq(Duplicate File: "$file_name");
        foreach my $dir_name (@{$file_hash{$file_name}}) {
            say "    $dir_name";
        }
    }
}

sub wanted {
    return if not -f $_;    

    if (not exists $file_hash{$_}) {
        $file_hash{$_} = [];
    }
    push @{$file_hash{$_}}, $File::Find::dir;
}

以下是有关 File::Find 的一些内容:

  • 工作发生在子例程中 <代码>想要。
  • $_ 是文件的名称,我可以用它来查看这是一个文件还是目录
  • $File::Find::Name 是全名文件的名称,包括路径。
  • $File::Find::dir 是目录的名称。

如果数组引用不存在,我将使用 $file_hash{$_} = []; 创建它。这不是必需的,但我发现它令人欣慰,并且可以防止错误。要将 $file_hash{$_} 用作数组,我必须取消引用它。我通过在其前面放置一个 @ 来实现这一点,因此它可以是 @$file_hash{$_}@{$file_hash{$_} }

一旦找到所有文件,我就可以打印出整个结构。我做的唯一一件事就是检查以确保每个数组中有多个成员。如果只有一个成员,则不会有重复项。


对恩典的回应

您好 David W.,非常感谢您的解释和示例脚本。抱歉,也许我不太清楚定义我的问题陈述。我认为我不能在数据结构的路径查找中使用哈希。由于*.abc文件有几百个,而且每个*.abc文件甚至具有相同的文件名,但实际上每个目录结构的内容不同。

例如“/nfs/disks/version_2.0/dir_a/ln/temp1”下的file1.abc与“/nfs/disks/version_2.0/dir_a/”下的file1.abc内容不一样ln/temp2”和“/nfs/disks/version_2.0/dir_a/ln/temp3”。我的目的是 grep 每个目录结构( temp1、 temp2 和 temp3 )中的文件 *.abc 列表,并将文件名列表与主列表进行比较。您能帮忙阐明如何解决这个问题吗?谢谢。 – 昨天的恩典

我只是在示例代码中打印文件,但您可以打开它们并处理它们,而不是打印文件。毕竟,您现在已经有了文件名和目录。这又是我的程序的核心。这次,我打开文件并查看内容:

foreach my $file_name (sort keys %file_hash) {
    if (scalar (@{$file_hash{$file_name}}) > 1) {
        #say qq(Duplicate File: "$file_name");
        foreach my $dir_name (@{$file_hash{$file_name}}) {
            #say "    $dir_name";
            open (my $fh, "<", "$dir_name/$file_name")
              or die qq(Can't open file "$dir_name/$file_name" for reading);
            # Process your file here...
            close $fh;
        }
    }
}

如果您只想查找某些文件,则可以修改 wanted 函数以跳过您不需要的文件。例如,这里我只查找与 file*.txt 模式匹配的文件。注意我使用正则表达式 /^file.*\.txt$/ 来匹配文件名。正如您所看到的,它与之前的 wanted 子例程相同。唯一的区别是我的测试:我正在寻找一个文件(-f)并且具有正确名称(file*.txt)的东西:

sub wanted {
    return if not -f $_ and /^file.*\.txt$/;    

    if (not exists $file_hash{$_}) {
        $file_hash{$_} = [];
    }
    push @{$file_hash{$_}}, $File::Find::dir;
}

如果您是查看文件内容,您可以使用 MD5 哈希 来确定文件内容是否匹配或不匹配。这将文件减少为仅 16 到 28 个字符的字符串,甚至可以用作哈希键而不是文件名。这样,具有匹配 MD5 哈希值(从而匹配内容)的文件将位于同一哈希列表中。

您谈论了文件的“主列表”,并且您似乎认为该主列表需要与您正在查找的文件的内容相匹配。所以,我在我的程序中做了一个轻微的修改。我首先采用您谈到的主列表,并为每个文件生成 MD5 和。然后我将查看该目录中的所有文件,但只查看具有匹配 MD5 哈希值的文件...

顺便说一句,这尚未经过测试。

#! /usr/bin/env perl
#
use strict;
use warnings;
use feature qw(say);        #Can use `say` which is like `print "\n"`;

use File::Find;             #Implements Unix `find` command.
use Digest::file qw(digest_file_hex);

use constant DIR         => "/nfs/disks/version_2.0";
use constant MASTER_LIST_DIR => "/some/directory";

# First, I'm going thorugh the MASTER_LIST_DIR directory
# and finding all of the master list files. I'm going to take
# the MD5 hash of those files, and store them in a Perl hash 
# that's keyed by the name of file file. Thus, when I find a 
# file with a matching name, I can compare the MD5 of that file
# and the master file. If they match, the files are the same. If
# not, they're different.

# In this example, I'm inlining the function I use to find the files
# instead of making it a separat function.

my %master_hash;
find (
    {
        %master_hash($_) = digest_file_hex($_, "MD5") if -f;
    },
    MASTER_LIST_DIR
);

# Now I have the MD5 of all the master files, I'm going to search my
# DIR directory for the files that have the same MD5 hash as the
# master list files did. If they do have the same MD5 hash, I'll
# print out their names as before.

my %file_hash;
find (\&wanted, DIR);

# Print out all the duplicates
foreach my $file_name (sort keys %file_hash) {
    if (scalar (@{$file_hash{$file_name}}) > 1) {
        say qq(Duplicate File: "$file_name");
        foreach my $dir_name (@{$file_hash{$file_name}}) {
            say "    $dir_name";
        }
    }
}

# The wanted function has been modified since the last example.
# Here, I'm only going to put files in the %file_hash if they

sub wanted {
    if (-f $_ and $file_hash{$_} = digest_file_hex($_, "MD5")) {
        $file_hash{$_} //= [];    #Using TLP's syntax hint
        push @{$file_hash{$_}}, $File::Find::dir;
    }
}

Wow, it's like reliving the 1990s! Perl code has evolved somewhat, and you really need to learn the new stuff. It looks like you learned Perl in version 3.0 or 4.0. Here's some pointers:

  • Use use warnings; instead of -w on the command line.
  • Use use strict;. This will require you to predeclare variables using my which will scope them to the local block or the file if they're not in a local block. This helps catch a lot of errors.
  • Don't put & in front of subroutine names.
  • Use and, or, and not instead of &&, ||, and !.
  • Learn about Perl Modules which can save you a lot of time and effort.

When someone says detect duplicates, I immediately think of hashes. If you use a hash based upon your file's name, you can easily see if there are duplicate files.

Of course a hash can only have a single value for each key. Fortunately, in Perl 5.x, that value can be a reference to another data structure.

So, I recommend you use a hash that contains a reference to a list (array in old parlance). You can push each instance of the file to that list.

Using your example, you'd have a data structure that looks like this:

%file_hash = {
    file1.abc => [
       /nfs/disks/version_2.0/dir_a/ln/temp1
       /nfs/disks/version_2.0/dir_a/ln/temp2
       /nfs/disks/version_2.0/dir_a/ln/temp3
    ],
    file2.abc => [
       /nfs/disks/version_2.0/dir_a/nn/temp1
       /nfs/disks/version_2.0/dir_a/nn/temp2
       /nfs/disks/version_2.0/dir_a/nn/temp3
   ];

And, here's a program to do it:

#! /usr/bin/env perl
#
use strict;
use warnings;
use feature qw(say);        #Can use `say` which is like `print "\n"`;

use File::Basename; #imports `dirname` and `basename` commands
use File::Find;             #Implements Unix `find` command.

use constant DIR => "/nfs/disks/version_2.0";

# Find all duplicates
my %file_hash;
find (\&wanted, DIR);

# Print out all the duplicates
foreach my $file_name (sort keys %file_hash) {
    if (scalar (@{$file_hash{$file_name}}) > 1) {
        say qq(Duplicate File: "$file_name");
        foreach my $dir_name (@{$file_hash{$file_name}}) {
            say "    $dir_name";
        }
    }
}

sub wanted {
    return if not -f $_;    

    if (not exists $file_hash{$_}) {
        $file_hash{$_} = [];
    }
    push @{$file_hash{$_}}, $File::Find::dir;
}

Here's a few things about File::Find:

  • The work takes place in the subroutine wanted.
  • The $_ is the name of the file, and I can use this to see if this is a file or directory
  • $File::Find::Name is the full name of the file including the path.
  • $File::Find::dir is the name of the directory.

If the array reference doesn't exist, I create it with the $file_hash{$_} = [];. This isn't necessary, but I find it comforting, and it can prevent errors. To use $file_hash{$_} as an array, I have to dereference it. I do that by putting a @ in front of it, so it can be @$file_hash{$_} or, @{$file_hash{$_}}.

Once all the file are found, I can print out the entire structure. The only thing I do is check to make sure there is more than one member in each array. If there's only a single member, then there are no duplicates.


Response to Grace

Hi David W., thank you very much for your explainaion and sample script. Sorry maybe I'm not really clear in definding my problem statement. I think I can't use hash in my path finding for the data structure. Since the file*.abc is a few hundred and undertermined and each of the file*.abc even is having same filename but it is actually differ in content in each directory structures.

Such as the file1.abc resides under "/nfs/disks/version_2.0/dir_a/ln/temp1" is not the same content as file1.abc resides under "/nfs/disks/version_2.0/dir_a/ln/temp2" and "/nfs/disks/version_2.0/dir_a/ln/temp3". My intention is to grep the list of files*.abc in each of the directories structure (temp1, temp2 and temp3 ) and compare the filename list with a masterlist. Could you help to shed some lights on how to solve this? Thanks. – Grace yesterday

I'm just printing the file in my sample code, but instead of printing the file, you could open them and process them. After all, you now have the file name and the directory. Here's the heart of my program again. This time, I'm opening the file and looking at the content:

foreach my $file_name (sort keys %file_hash) {
    if (scalar (@{$file_hash{$file_name}}) > 1) {
        #say qq(Duplicate File: "$file_name");
        foreach my $dir_name (@{$file_hash{$file_name}}) {
            #say "    $dir_name";
            open (my $fh, "<", "$dir_name/$file_name")
              or die qq(Can't open file "$dir_name/$file_name" for reading);
            # Process your file here...
            close $fh;
        }
    }
}

If you are only looking for certain files, you could modify the wanted function to skip over files you don't want. For example, here I am only looking for files which match the file*.txt pattern. Note I use a regular expression of /^file.*\.txt$/ to match the name of the file. As you can see, it's the same as the previous wanted subroutine. The only difference is my test: I'm looking for something that is a file (-f) and has the correct name (file*.txt):

sub wanted {
    return if not -f $_ and /^file.*\.txt$/;    

    if (not exists $file_hash{$_}) {
        $file_hash{$_} = [];
    }
    push @{$file_hash{$_}}, $File::Find::dir;
}

If you are looking at the file contents, you can use the MD5 hash to determine if the file contents match or don't match. This reduces a file to a mere string of 16 to 28 characters which could even be used as a hash key instead of the file name. This way, files that have matching MD5 hashes (and thus matching contents) would be in the same hash list.

You talk about a "master list" of files and it seems you have the idea that this master list needs to match the content of the file you're looking for. So, I'm making a slight mod in my program. I am first taking that master list you talked about, and generating MD5 sums for each file. Then I'll look at all the files in that directory, but only take the ones with the matching MD5 hash...

By the way, this has not been tested.

#! /usr/bin/env perl
#
use strict;
use warnings;
use feature qw(say);        #Can use `say` which is like `print "\n"`;

use File::Find;             #Implements Unix `find` command.
use Digest::file qw(digest_file_hex);

use constant DIR         => "/nfs/disks/version_2.0";
use constant MASTER_LIST_DIR => "/some/directory";

# First, I'm going thorugh the MASTER_LIST_DIR directory
# and finding all of the master list files. I'm going to take
# the MD5 hash of those files, and store them in a Perl hash 
# that's keyed by the name of file file. Thus, when I find a 
# file with a matching name, I can compare the MD5 of that file
# and the master file. If they match, the files are the same. If
# not, they're different.

# In this example, I'm inlining the function I use to find the files
# instead of making it a separat function.

my %master_hash;
find (
    {
        %master_hash($_) = digest_file_hex($_, "MD5") if -f;
    },
    MASTER_LIST_DIR
);

# Now I have the MD5 of all the master files, I'm going to search my
# DIR directory for the files that have the same MD5 hash as the
# master list files did. If they do have the same MD5 hash, I'll
# print out their names as before.

my %file_hash;
find (\&wanted, DIR);

# Print out all the duplicates
foreach my $file_name (sort keys %file_hash) {
    if (scalar (@{$file_hash{$file_name}}) > 1) {
        say qq(Duplicate File: "$file_name");
        foreach my $dir_name (@{$file_hash{$file_name}}) {
            say "    $dir_name";
        }
    }
}

# The wanted function has been modified since the last example.
# Here, I'm only going to put files in the %file_hash if they

sub wanted {
    if (-f $_ and $file_hash{$_} = digest_file_hex($_, "MD5")) {
        $file_hash{$_} //= [];    #Using TLP's syntax hint
        push @{$file_hash{$_}}, $File::Find::dir;
    }
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文