当前位置：文江博客话题详情

使用 Perl 清理具有一个或多个重复项的文件系统

发布于 2024-07-23 08:36:57 字数 4870 浏览 9 评论 0原文

我有两个磁盘，一个是临时备份磁盘，到处都是重复的磁盘，另一个磁盘在我的笔记本电脑中也同样混乱。我需要备份唯一文件并删除重复项。因此，我需要执行以下操作：

查找所有非零大小的文件
计算所有文件的 MD5 摘要
查找具有重复文件名的文件
将唯一文件与主副本和其他副本分开。

通过此脚本的输出，我将：

备份唯一文件和主文件
删除其他副本

唯一文件 = 没有其他副本

主副本 = 第一个实例，其中存在其他副本，可能匹配优先路径

其他副本 =不是主副本

我创建了附加脚本，这对我来说似乎有意义，但是：

总文件！=唯一文件+主副本+其他副本

我有两个问题：

我的逻辑错误在哪里？
有没有更有效的方法来做到这一点？

我选择了磁盘哈希，这样在处理巨大的文件列表时就不会耗尽内存。

#!/usr/bin/perl

use strict;
use warnings;
use DB_File;
use File::Spec;
use Digest::MD5;

my $path_pref = '/usr/local/bin';
my $base = '/var/backup/test';

my $find = "$base/find.txt";
my $files = "$base/files.txt";

my $db_duplicate_file = "$base/duplicate.db";
my $db_duplicate_count_file = "$base/duplicate_count.db";
my $db_unique_file = "$base/unique.db";
my $db_master_copy_file = "$base/master_copy.db";
my $db_other_copy_file = "$base/other_copy.db";

open (FIND, "< $find");
open (FILES, "> $files");

print "Extracting non-zero files from:\n\t$find\n";
my $total_files = 0;
while (my $path = <FIND>) {
  chomp($path);
  next if ($path =~ /^\s*$/);
  if (-f $path && -s $path) {
    print FILES "$path\n";
    $total_files++;
    printf "\r$total_files";
  }
}

close(FIND);
close(FILES);
open (FILES, "< $files");

sub compare {
  my ($key1, $key2) = @_;
  $key1 cmp $key2;
}

$DB_BTREE->{'compare'} = \&compare;

my %duplicate_count = ();

tie %duplicate_count, "DB_File", $db_duplicate_count_file, O_RDWR|O_CREAT, 0666, $DB_BTREE
     or die "Cannot open $db_duplicate_count_file: $!\n";

my %unique = ();

tie %unique, "DB_File", $db_unique_file, O_RDWR|O_CREAT, 0666, $DB_BTREE
     or die "Cannot open $db_unique_file: $!\n";

my %master_copy = ();

tie %master_copy, "DB_File", $db_master_copy_file, O_RDWR|O_CREAT, 0666, $DB_BTREE
     or die "Cannot open $db_master_copy_file: $!\n";

my %other_copy = ();

tie %other_copy, "DB_File", $db_other_copy_file, O_RDWR|O_CREAT, 0666, $DB_BTREE
     or die "Cannot open $db_other_copy_file: $!\n";

print "\nFinding duplicate filenames and calculating their MD5 digests\n";

my $file_counter = 0;
my $percent_complete = 0;

while (my $path = <FILES>) {

  $file_counter++;

  # remove trailing whitespace
  chomp($path);

  # extract filename from path
  my ($vol,$dir,$filename) = File::Spec->splitpath($path);

  # calculate the file's MD5 digest
  open(FILE, $path) or die "Can't open $path: $!";
  binmode(FILE);
  my $md5digest = Digest::MD5->new->addfile(*FILE)->hexdigest;
  close(FILE);

  # filename not stored as duplicate
  if (!exists($duplicate_count{$filename})) {
    # assume unique
    $unique{$md5digest} = $path;
    # which implies 0 duplicates
    $duplicate_count{$filename} = 0;
  }
  # filename already found
  else {
    # delete unique record
    delete($unique{$md5digest});
    # second duplicate
    if ($duplicate_count{$filename}) {
      $duplicate_count{$filename}++;
    }
    # first duplicate
    else {
      $duplicate_count{$filename} = 1;
    }
    # the master copy is already assigned
    if (exists($master_copy{$md5digest})) {
      # the current path matches $path_pref, so becomes our new master copy
      if ($path =~ qq|^$path_pref|) {
        $master_copy{$md5digest} = $path;
      }
      else {
        # this one is a secondary copy
        $other_copy{$path} = $md5digest;
        # store with path as key, as there are duplicate digests
      }
    }
    # assume this is the master copy
    else {
      $master_copy{$md5digest} = $path;
    }
  }
  $percent_complete = int(($file_counter/$total_files)*100);
  printf("\rProgress: $percent_complete %%");
}

close(FILES);    

# Write out data to text files for debugging

open (UNIQUE, "> $base/unique.txt");
open (UNIQUE_MD5, "> $base/unique_md5.txt");

print "\n\nUnique files: ",scalar keys %unique,"\n";

foreach my $key (keys %unique) {
  print UNIQUE "$key\t", $unique{$key}, "\n";
  print UNIQUE_MD5 "$key\n";
}

close UNIQUE;
close UNIQUE_MD5;

open (MASTER, "> $base/master_copy.txt");
open (MASTER_MD5, "> $base/master_copy_md5.txt");

print "Master copies: ",scalar keys %master_copy,"\n";

foreach my $key (keys %master_copy) {
  print MASTER "$key\t", $master_copy{$key}, "\n";
  print MASTER_MD5 "$key\n";
}

close MASTER;
close MASTER_MD5;

open (OTHER, "> $base/other_copy.txt");
open (OTHER_MD5, "> $base/other_copy_md5.txt");

print "Other copies: ",scalar keys %other_copy,"\n";

foreach my $key (keys %other_copy) {
  print OTHER $other_copy{$key}, "\t$key\n";
  print OTHER_MD5 "$other_copy{$key}\n";
}

close OTHER;
close OTHER_MD5;

print "\n";

untie %duplicate_count;
untie %unique;
untie %master_copy;
untie %other_copy;

print "\n";

原文

I have two disks, one an ad-hoc backup disk, which is a mess with duplicates everywhere and another disk in my laptop which is an equal mess. I need to backup unique files and delete duplicates. So, I need to do the following:

Find all non-zero size files
Calculate the MD5 digest of all files
Find files with duplicate file names
Separate unique files, from master and other copies.

With the output of this script I will:

Backup the unique and master files
Delete the other copies

Unique file = no other copies

Master copy = first instance, where other copies exist, possibly matching preferential path

Other copies = not master copies

I've created the appended script, which seems to make sense to me, but:

total files != unique files + master copies + other copies

I have two questions:

Where's the error in my logic?
Is there a more efficient way of doing this?

I chose disk hashes, so that I don't run out of memory when processing enormous file lists.

#!/usr/bin/perl

use strict;
use warnings;
use DB_File;
use File::Spec;
use Digest::MD5;

my $path_pref = '/usr/local/bin';
my $base = '/var/backup/test';

my $find = "$base/find.txt";
my $files = "$base/files.txt";

my $db_duplicate_file = "$base/duplicate.db";
my $db_duplicate_count_file = "$base/duplicate_count.db";
my $db_unique_file = "$base/unique.db";
my $db_master_copy_file = "$base/master_copy.db";
my $db_other_copy_file = "$base/other_copy.db";

open (FIND, "< $find");
open (FILES, "> $files");

print "Extracting non-zero files from:\n\t$find\n";
my $total_files = 0;
while (my $path = <FIND>) {
  chomp($path);
  next if ($path =~ /^\s*$/);
  if (-f $path && -s $path) {
    print FILES "$path\n";
    $total_files++;
    printf "\r$total_files";
  }
}

close(FIND);
close(FILES);
open (FILES, "< $files");

sub compare {
  my ($key1, $key2) = @_;
  $key1 cmp $key2;
}

$DB_BTREE->{'compare'} = \&compare;

my %duplicate_count = ();

tie %duplicate_count, "DB_File", $db_duplicate_count_file, O_RDWR|O_CREAT, 0666, $DB_BTREE
     or die "Cannot open $db_duplicate_count_file: $!\n";

my %unique = ();

tie %unique, "DB_File", $db_unique_file, O_RDWR|O_CREAT, 0666, $DB_BTREE
     or die "Cannot open $db_unique_file: $!\n";

my %master_copy = ();

tie %master_copy, "DB_File", $db_master_copy_file, O_RDWR|O_CREAT, 0666, $DB_BTREE
     or die "Cannot open $db_master_copy_file: $!\n";

my %other_copy = ();

tie %other_copy, "DB_File", $db_other_copy_file, O_RDWR|O_CREAT, 0666, $DB_BTREE
     or die "Cannot open $db_other_copy_file: $!\n";

print "\nFinding duplicate filenames and calculating their MD5 digests\n";

my $file_counter = 0;
my $percent_complete = 0;

while (my $path = <FILES>) {

  $file_counter++;

  # remove trailing whitespace
  chomp($path);

  # extract filename from path
  my ($vol,$dir,$filename) = File::Spec->splitpath($path);

  # calculate the file's MD5 digest
  open(FILE, $path) or die "Can't open $path: $!";
  binmode(FILE);
  my $md5digest = Digest::MD5->new->addfile(*FILE)->hexdigest;
  close(FILE);

  # filename not stored as duplicate
  if (!exists($duplicate_count{$filename})) {
    # assume unique
    $unique{$md5digest} = $path;
    # which implies 0 duplicates
    $duplicate_count{$filename} = 0;
  }
  # filename already found
  else {
    # delete unique record
    delete($unique{$md5digest});
    # second duplicate
    if ($duplicate_count{$filename}) {
      $duplicate_count{$filename}++;
    }
    # first duplicate
    else {
      $duplicate_count{$filename} = 1;
    }
    # the master copy is already assigned
    if (exists($master_copy{$md5digest})) {
      # the current path matches $path_pref, so becomes our new master copy
      if ($path =~ qq|^$path_pref|) {
        $master_copy{$md5digest} = $path;
      }
      else {
        # this one is a secondary copy
        $other_copy{$path} = $md5digest;
        # store with path as key, as there are duplicate digests
      }
    }
    # assume this is the master copy
    else {
      $master_copy{$md5digest} = $path;
    }
  }
  $percent_complete = int(($file_counter/$total_files)*100);
  printf("\rProgress: $percent_complete %%");
}

close(FILES);    

# Write out data to text files for debugging

open (UNIQUE, "> $base/unique.txt");
open (UNIQUE_MD5, "> $base/unique_md5.txt");

print "\n\nUnique files: ",scalar keys %unique,"\n";

foreach my $key (keys %unique) {
  print UNIQUE "$key\t", $unique{$key}, "\n";
  print UNIQUE_MD5 "$key\n";
}

close UNIQUE;
close UNIQUE_MD5;

open (MASTER, "> $base/master_copy.txt");
open (MASTER_MD5, "> $base/master_copy_md5.txt");

print "Master copies: ",scalar keys %master_copy,"\n";

foreach my $key (keys %master_copy) {
  print MASTER "$key\t", $master_copy{$key}, "\n";
  print MASTER_MD5 "$key\n";
}

close MASTER;
close MASTER_MD5;

open (OTHER, "> $base/other_copy.txt");
open (OTHER_MD5, "> $base/other_copy_md5.txt");

print "Other copies: ",scalar keys %other_copy,"\n";

foreach my $key (keys %other_copy) {
  print OTHER $other_copy{$key}, "\t$key\n";
  print OTHER_MD5 "$other_copy{$key}\n";
}

close OTHER;
close OTHER_MD5;

print "\n";

untie %duplicate_count;
untie %unique;
untie %master_copy;
untie %other_copy;

print "\n";

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

冷…雨湿花 2024-07-30 08:36:57

看看算法，我想我明白为什么你会泄漏文件。第一次遇到文件副本时，将其标记为“唯一”：

if (!exists($duplicate_count{$filename})) {
   # assume unique
   $unique{$md5digest} = $path;
   # which implies 0 duplicates
   $duplicate_count{$filename} = 0;
}

下一次，您删除该唯一记录，而不存储路径：

 # delete unique record
delete($unique{$md5digest});

因此，无论 $unique{$md5digest} 处的文件路径是什么，您都会丢失它，并且不会包含在 unique+other+master 中。

您需要类似的东西：

if(my $original_path = delete $unique{$md5digest}) {
    // Where should this one go?
}

另外，正如我在上面的评论中提到的， IO::File 确实会清理这段代码。

Looking at the algorithm, I think I see why you are leaking files. The first time you encounter a file copy, you label it "unique":

if (!exists($duplicate_count{$filename})) {
   # assume unique
   $unique{$md5digest} = $path;
   # which implies 0 duplicates
   $duplicate_count{$filename} = 0;
}

The next time, you delete that unique record, without storing the path:

 # delete unique record
delete($unique{$md5digest});

So whatever filepath was at $unique{$md5digest}, you've lost it, and won't be included in unique+other+master.

You'll need something like:

if(my $original_path = delete $unique{$md5digest}) {
    // Where should this one go?
}

Also, as I mentioned in a comment above, IO::File would really clean up this code.

回复收藏 0 原文

爱情眠于流年 2024-07-30 08:36:57

这实际上并不是对程序更大逻辑的响应，但您应该每次都检查 open 中的错误（当我们这样做时，为什么不使用更现代的形式open 带有词法文件句柄和三个参数）：

open my $unique, '>', "$base/unique.txt"
  or die "Can't open $base/unique.txt for writing: $!";

如果您不想每次都明确询问，您还可以查看 autodie 模块。

This isn't really a response to the larger logic of the program, but you should be checking for errors in open every time (and while we're at it, why not use the more modern form of open with lexical filehandles and three arguments):

open my $unique, '>', "$base/unique.txt"
  or die "Can't open $base/unique.txt for writing: $!";

If you don't want to explicitly ask each time, you could also check out the autodie module.

回复收藏 0 原文

↙温凉少女 2024-07-30 08:36:57

一种明显的优化是使用文件大小作为初始比较基础，并且仅对低于特定大小的文件或两个大小相同的文件发生冲突时使用计算机 MD5。光盘上给定的文件越大，MD5 计算的成本就越高，但其确切大小与系统上的其他文件冲突的可能性也就越小。这样你可能可以节省很多运行时间。

您可能还需要考虑更改某些类型的文件的方法，这些文件包含嵌入的元数据，这些元数据可能会在不更改基础数据的情况下发生变化，这样即使 MD5 不匹配，您也可以找到其他重复的内容。当然，我指的是 MP3 或其他具有元数据标签的音乐文件，这些元数据标签可能会被分类器或播放器程序更新，但在其他方面包含相同的音频位。

回复收藏 0 原文

墨洒年华 2024-07-30 08:36:57

请参阅此处有关抽象性质解决方案的相关数据。

<块引用>
https:/ /stackoverflow.com/questions/405628/从您的计算机中删除重复图像文件的最佳方法是什么

重要说明，就像我们一样'我喜欢相信具有相同 MD5 的 2 个文件是同一个文件，但这不一定是真的。如果您的数据对您有意义，那么一旦您将其分解为 MD5 告诉您是同一文件的候选列表，您就需要线性地遍历这些文件的每一位来检查它们事实上是一样的。

这样来说，给定一个大小为 1 位的哈希函数（MD5），只有 2 种可能的组合。

0 1

如果你的哈希函数告诉你 2 个文件都返回“1”，你不会认为它们是同一个文件。

给定 2 位哈希值，只有 4 种可能的组合，

 00  01 10 11

2 个文件返回相同的值，您不会认为它们是同一个文件。

给定 3 位哈希值，只有 8 种可能的组合，

 000 001 010 011 
 100 101 110 111

2 个文件返回相同的值，您不会认为是同一个文件。

这种模式持续不断地增加，以至于人们出于某种奇怪的原因开始将“机会”纳入其中。即使在 128 位 (MD5) 下，共享相同哈希值的 2 个文件并不意味着它们实际上是同一个文件。了解的唯一方法是比较每一位。

如果您从头到尾读取它们，则会发生一个小的优化，因为一旦发现不同的位，您就可以停止读取，但为了确认相同，您需要读取每个位。

See here for related data on solutions in the abstract nature.

https://stackoverflow.com/questions/405628/what-is-the-best-method-to-remove-duplicate-image-files-from-your-computer

IMPORTANT Note, as much as we'd like to believe 2 files with the same MD5 are the same file, that is not necessarily true. If your data means anything to you, once you've broken it down to a list of candidates that MD5 tells you are the same file, you need to run through every bit of those files linearly to check they are in fact the same.

Put this way, given a hash function ( which MD5 is ) of size 1 bits, there are only 2 possible combination's.

0 1

if your hash function told you 2 files both returned a "1" you would not assume they are the same file.

Given a hash of 2 bits, there are only 4 possible combination's,

 00  01 10 11

2 Files returning the same value you would not assume to be the same file.

Given a hash of 3 bits, there are only 8 possible combinations

 000 001 010 011 
 100 101 110 111

2 files returning the same value you would not assume to be the same file.

This pattern goes on in ever increasing amounts, to a point that people for some bizarre reason start putting "chance" into the equation. Even at 128 bits ( MD5 ), 2 files sharing the same hash does not mean they are in fact the same file. the only way to know is by comparing every bit.

There is a minor optimization that occurs if you read them start to end, because you can stop reading as soon as you find a differing bit, but to confirm identical, you need to read every bit.

回复收藏 0 原文

~没有更多了~