使用 Perl 清理具有一个或多个重复项的文件系统
我有两个磁盘,一个是临时备份磁盘,到处都是重复的磁盘,另一个磁盘在我的笔记本电脑中也同样混乱。 我需要备份唯一文件并删除重复项。 因此,我需要执行以下操作:
- 查找所有非零大小的文件
- 计算所有文件的 MD5 摘要
- 查找具有重复文件名的文件
- 将唯一文件与主副本和其他副本分开。
通过此脚本的输出,我将:
- 备份唯一文件和主文件
- 删除其他副本
唯一文件 = 没有其他副本
主副本 = 第一个实例,其中存在其他副本,可能匹配优先路径
其他副本 =不是主副本
我创建了附加脚本,这对我来说似乎有意义,但是:
总文件!=唯一文件+主副本+其他副本
我有两个问题:
- 我的逻辑错误在哪里?
- 有没有更有效的方法来做到这一点?
我选择了磁盘哈希,这样在处理巨大的文件列表时就不会耗尽内存。
#!/usr/bin/perl
use strict;
use warnings;
use DB_File;
use File::Spec;
use Digest::MD5;
my $path_pref = '/usr/local/bin';
my $base = '/var/backup/test';
my $find = "$base/find.txt";
my $files = "$base/files.txt";
my $db_duplicate_file = "$base/duplicate.db";
my $db_duplicate_count_file = "$base/duplicate_count.db";
my $db_unique_file = "$base/unique.db";
my $db_master_copy_file = "$base/master_copy.db";
my $db_other_copy_file = "$base/other_copy.db";
open (FIND, "< $find");
open (FILES, "> $files");
print "Extracting non-zero files from:\n\t$find\n";
my $total_files = 0;
while (my $path = <FIND>) {
chomp($path);
next if ($path =~ /^\s*$/);
if (-f $path && -s $path) {
print FILES "$path\n";
$total_files++;
printf "\r$total_files";
}
}
close(FIND);
close(FILES);
open (FILES, "< $files");
sub compare {
my ($key1, $key2) = @_;
$key1 cmp $key2;
}
$DB_BTREE->{'compare'} = \&compare;
my %duplicate_count = ();
tie %duplicate_count, "DB_File", $db_duplicate_count_file, O_RDWR|O_CREAT, 0666, $DB_BTREE
or die "Cannot open $db_duplicate_count_file: $!\n";
my %unique = ();
tie %unique, "DB_File", $db_unique_file, O_RDWR|O_CREAT, 0666, $DB_BTREE
or die "Cannot open $db_unique_file: $!\n";
my %master_copy = ();
tie %master_copy, "DB_File", $db_master_copy_file, O_RDWR|O_CREAT, 0666, $DB_BTREE
or die "Cannot open $db_master_copy_file: $!\n";
my %other_copy = ();
tie %other_copy, "DB_File", $db_other_copy_file, O_RDWR|O_CREAT, 0666, $DB_BTREE
or die "Cannot open $db_other_copy_file: $!\n";
print "\nFinding duplicate filenames and calculating their MD5 digests\n";
my $file_counter = 0;
my $percent_complete = 0;
while (my $path = <FILES>) {
$file_counter++;
# remove trailing whitespace
chomp($path);
# extract filename from path
my ($vol,$dir,$filename) = File::Spec->splitpath($path);
# calculate the file's MD5 digest
open(FILE, $path) or die "Can't open $path: $!";
binmode(FILE);
my $md5digest = Digest::MD5->new->addfile(*FILE)->hexdigest;
close(FILE);
# filename not stored as duplicate
if (!exists($duplicate_count{$filename})) {
# assume unique
$unique{$md5digest} = $path;
# which implies 0 duplicates
$duplicate_count{$filename} = 0;
}
# filename already found
else {
# delete unique record
delete($unique{$md5digest});
# second duplicate
if ($duplicate_count{$filename}) {
$duplicate_count{$filename}++;
}
# first duplicate
else {
$duplicate_count{$filename} = 1;
}
# the master copy is already assigned
if (exists($master_copy{$md5digest})) {
# the current path matches $path_pref, so becomes our new master copy
if ($path =~ qq|^$path_pref|) {
$master_copy{$md5digest} = $path;
}
else {
# this one is a secondary copy
$other_copy{$path} = $md5digest;
# store with path as key, as there are duplicate digests
}
}
# assume this is the master copy
else {
$master_copy{$md5digest} = $path;
}
}
$percent_complete = int(($file_counter/$total_files)*100);
printf("\rProgress: $percent_complete %%");
}
close(FILES);
# Write out data to text files for debugging
open (UNIQUE, "> $base/unique.txt");
open (UNIQUE_MD5, "> $base/unique_md5.txt");
print "\n\nUnique files: ",scalar keys %unique,"\n";
foreach my $key (keys %unique) {
print UNIQUE "$key\t", $unique{$key}, "\n";
print UNIQUE_MD5 "$key\n";
}
close UNIQUE;
close UNIQUE_MD5;
open (MASTER, "> $base/master_copy.txt");
open (MASTER_MD5, "> $base/master_copy_md5.txt");
print "Master copies: ",scalar keys %master_copy,"\n";
foreach my $key (keys %master_copy) {
print MASTER "$key\t", $master_copy{$key}, "\n";
print MASTER_MD5 "$key\n";
}
close MASTER;
close MASTER_MD5;
open (OTHER, "> $base/other_copy.txt");
open (OTHER_MD5, "> $base/other_copy_md5.txt");
print "Other copies: ",scalar keys %other_copy,"\n";
foreach my $key (keys %other_copy) {
print OTHER $other_copy{$key}, "\t$key\n";
print OTHER_MD5 "$other_copy{$key}\n";
}
close OTHER;
close OTHER_MD5;
print "\n";
untie %duplicate_count;
untie %unique;
untie %master_copy;
untie %other_copy;
print "\n";
I have two disks, one an ad-hoc backup disk, which is a mess with duplicates everywhere and another disk in my laptop which is an equal mess. I need to backup unique files and delete duplicates. So, I need to do the following:
- Find all non-zero size files
- Calculate the MD5 digest of all files
- Find files with duplicate file names
- Separate unique files, from master and other copies.
With the output of this script I will:
- Backup the unique and master files
- Delete the other copies
Unique file = no other copies
Master copy = first instance, where other copies exist, possibly matching preferential path
Other copies = not master copies
I've created the appended script, which seems to make sense to me, but:
total files != unique files + master copies + other copies
I have two questions:
- Where's the error in my logic?
- Is there a more efficient way of doing this?
I chose disk hashes, so that I don't run out of memory when processing enormous file lists.
#!/usr/bin/perl
use strict;
use warnings;
use DB_File;
use File::Spec;
use Digest::MD5;
my $path_pref = '/usr/local/bin';
my $base = '/var/backup/test';
my $find = "$base/find.txt";
my $files = "$base/files.txt";
my $db_duplicate_file = "$base/duplicate.db";
my $db_duplicate_count_file = "$base/duplicate_count.db";
my $db_unique_file = "$base/unique.db";
my $db_master_copy_file = "$base/master_copy.db";
my $db_other_copy_file = "$base/other_copy.db";
open (FIND, "< $find");
open (FILES, "> $files");
print "Extracting non-zero files from:\n\t$find\n";
my $total_files = 0;
while (my $path = <FIND>) {
chomp($path);
next if ($path =~ /^\s*$/);
if (-f $path && -s $path) {
print FILES "$path\n";
$total_files++;
printf "\r$total_files";
}
}
close(FIND);
close(FILES);
open (FILES, "< $files");
sub compare {
my ($key1, $key2) = @_;
$key1 cmp $key2;
}
$DB_BTREE->{'compare'} = \&compare;
my %duplicate_count = ();
tie %duplicate_count, "DB_File", $db_duplicate_count_file, O_RDWR|O_CREAT, 0666, $DB_BTREE
or die "Cannot open $db_duplicate_count_file: $!\n";
my %unique = ();
tie %unique, "DB_File", $db_unique_file, O_RDWR|O_CREAT, 0666, $DB_BTREE
or die "Cannot open $db_unique_file: $!\n";
my %master_copy = ();
tie %master_copy, "DB_File", $db_master_copy_file, O_RDWR|O_CREAT, 0666, $DB_BTREE
or die "Cannot open $db_master_copy_file: $!\n";
my %other_copy = ();
tie %other_copy, "DB_File", $db_other_copy_file, O_RDWR|O_CREAT, 0666, $DB_BTREE
or die "Cannot open $db_other_copy_file: $!\n";
print "\nFinding duplicate filenames and calculating their MD5 digests\n";
my $file_counter = 0;
my $percent_complete = 0;
while (my $path = <FILES>) {
$file_counter++;
# remove trailing whitespace
chomp($path);
# extract filename from path
my ($vol,$dir,$filename) = File::Spec->splitpath($path);
# calculate the file's MD5 digest
open(FILE, $path) or die "Can't open $path: $!";
binmode(FILE);
my $md5digest = Digest::MD5->new->addfile(*FILE)->hexdigest;
close(FILE);
# filename not stored as duplicate
if (!exists($duplicate_count{$filename})) {
# assume unique
$unique{$md5digest} = $path;
# which implies 0 duplicates
$duplicate_count{$filename} = 0;
}
# filename already found
else {
# delete unique record
delete($unique{$md5digest});
# second duplicate
if ($duplicate_count{$filename}) {
$duplicate_count{$filename}++;
}
# first duplicate
else {
$duplicate_count{$filename} = 1;
}
# the master copy is already assigned
if (exists($master_copy{$md5digest})) {
# the current path matches $path_pref, so becomes our new master copy
if ($path =~ qq|^$path_pref|) {
$master_copy{$md5digest} = $path;
}
else {
# this one is a secondary copy
$other_copy{$path} = $md5digest;
# store with path as key, as there are duplicate digests
}
}
# assume this is the master copy
else {
$master_copy{$md5digest} = $path;
}
}
$percent_complete = int(($file_counter/$total_files)*100);
printf("\rProgress: $percent_complete %%");
}
close(FILES);
# Write out data to text files for debugging
open (UNIQUE, "> $base/unique.txt");
open (UNIQUE_MD5, "> $base/unique_md5.txt");
print "\n\nUnique files: ",scalar keys %unique,"\n";
foreach my $key (keys %unique) {
print UNIQUE "$key\t", $unique{$key}, "\n";
print UNIQUE_MD5 "$key\n";
}
close UNIQUE;
close UNIQUE_MD5;
open (MASTER, "> $base/master_copy.txt");
open (MASTER_MD5, "> $base/master_copy_md5.txt");
print "Master copies: ",scalar keys %master_copy,"\n";
foreach my $key (keys %master_copy) {
print MASTER "$key\t", $master_copy{$key}, "\n";
print MASTER_MD5 "$key\n";
}
close MASTER;
close MASTER_MD5;
open (OTHER, "> $base/other_copy.txt");
open (OTHER_MD5, "> $base/other_copy_md5.txt");
print "Other copies: ",scalar keys %other_copy,"\n";
foreach my $key (keys %other_copy) {
print OTHER $other_copy{$key}, "\t$key\n";
print OTHER_MD5 "$other_copy{$key}\n";
}
close OTHER;
close OTHER_MD5;
print "\n";
untie %duplicate_count;
untie %unique;
untie %master_copy;
untie %other_copy;
print "\n";
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
看看算法,我想我明白为什么你会泄漏文件。 第一次遇到文件副本时,将其标记为“唯一”:
下一次,您删除该唯一记录,而不存储路径:
因此,无论 $unique{$md5digest} 处的文件路径是什么,您都会丢失它,并且不会包含在 unique+other+master 中。
您需要类似的东西:
另外,正如我在上面的评论中提到的, IO::File 确实会清理这段代码。
Looking at the algorithm, I think I see why you are leaking files. The first time you encounter a file copy, you label it "unique":
The next time, you delete that unique record, without storing the path:
So whatever filepath was at $unique{$md5digest}, you've lost it, and won't be included in unique+other+master.
You'll need something like:
Also, as I mentioned in a comment above, IO::File would really clean up this code.
这实际上并不是对程序更大逻辑的响应,但您应该每次都检查
open
中的错误(当我们这样做时,为什么不使用更现代的形式open
带有词法文件句柄和三个参数):如果您不想每次都明确询问,您还可以查看
autodie
模块。This isn't really a response to the larger logic of the program, but you should be checking for errors in
open
every time (and while we're at it, why not use the more modern form ofopen
with lexical filehandles and three arguments):If you don't want to explicitly ask each time, you could also check out the
autodie
module.一种明显的优化是使用文件大小作为初始比较基础,并且仅对低于特定大小的文件或两个大小相同的文件发生冲突时使用计算机 MD5。 光盘上给定的文件越大,MD5 计算的成本就越高,但其确切大小与系统上的其他文件冲突的可能性也就越小。 这样你可能可以节省很多运行时间。
您可能还需要考虑更改某些类型的文件的方法,这些文件包含嵌入的元数据,这些元数据可能会在不更改基础数据的情况下发生变化,这样即使 MD5 不匹配,您也可以找到其他重复的内容。 当然,我指的是 MP3 或其他具有元数据标签的音乐文件,这些元数据标签可能会被分类器或播放器程序更新,但在其他方面包含相同的音频位。
One apparent optimization is to use file size as an initial comparison basis, and only computer MD5 for files below a certain size or if you have a collision of two files with the same size. The larger a given file is on disc, the more costly the MD5 computation, but also the less likely its exact size will conflict with another file on the system. You can probably save yourself a lot of runtime that way.
You also might want to consider changing your approach for certain kinds of files that contain embedded meta-data that might change without changing the underlying data, so you can find additional dupes even if the MD5's don't match. I'm speaking of course of MP3 or other music files that have metadata tags that might be updated by classifiers or player programs, but which otherwise contain the same audio bits.
请参阅此处有关抽象性质解决方案的相关数据。
重要说明,就像我们一样'我喜欢相信具有相同 MD5 的 2 个文件是同一个文件,但这不一定是真的。 如果您的数据对您有意义,那么一旦您将其分解为 MD5 告诉您是同一文件的候选列表,您就需要线性地遍历这些文件的每一位来检查它们事实上是一样的。
这样来说,给定一个大小为 1 位的哈希函数(MD5),只有 2 种可能的组合。
如果你的哈希函数告诉你 2 个文件都返回“1”,你不会认为它们是同一个文件。
给定 2 位哈希值,只有 4 种可能的组合,
2 个文件返回相同的值,您不会认为它们是同一个文件。
给定 3 位哈希值,只有 8 种可能的组合,
2 个文件返回相同的值,您不会认为是同一个文件。
这种模式持续不断地增加,以至于人们出于某种奇怪的原因开始将“机会”纳入其中。 即使在 128 位 (MD5) 下,共享相同哈希值的 2 个文件并不意味着它们实际上是同一个文件。 了解的唯一方法是比较每一位。
如果您从头到尾读取它们,则会发生一个小的优化,因为一旦发现不同的位,您就可以停止读取,但为了确认相同,您需要读取每个位。
See here for related data on solutions in the abstract nature.
IMPORTANT Note, as much as we'd like to believe 2 files with the same MD5 are the same file, that is not necessarily true. If your data means anything to you, once you've broken it down to a list of candidates that MD5 tells you are the same file, you need to run through every bit of those files linearly to check they are in fact the same.
Put this way, given a hash function ( which MD5 is ) of size 1 bits, there are only 2 possible combination's.
if your hash function told you 2 files both returned a "1" you would not assume they are the same file.
Given a hash of 2 bits, there are only 4 possible combination's,
2 Files returning the same value you would not assume to be the same file.
Given a hash of 3 bits, there are only 8 possible combinations
2 files returning the same value you would not assume to be the same file.
This pattern goes on in ever increasing amounts, to a point that people for some bizarre reason start putting "chance" into the equation. Even at 128 bits ( MD5 ), 2 files sharing the same hash does not mean they are in fact the same file. the only way to know is by comparing every bit.
There is a minor optimization that occurs if you read them start to end, because you can stop reading as soon as you find a differing bit, but to confirm identical, you need to read every bit.