如何匹配相似的文件名并重命名，以便 Beyond Compare 等 diff 工具将它们视为一对来执行二进制比较？

发布于 2025-01-01 01:39:05 字数 1110 浏览 3 评论 0原文

我正在寻找最佳方法来比较我认为相同但文件名不同的文件。像 BeyondCompare 这样的比较工具很棒，但它们还不能处理不同的文件名 - 当比较不同文件夹中的文件时，它们会尝试与两侧具有相同名称的文件进行比较。

（我没有为 BeyondCompare 工作，也没有经济利益，但我经常使用该工具，并发现它有一些很棒的功能）。

有 MindGems 快速重复文件查找器，用于匹配多个具有不同名称的文件夹树中任何位置的文件，但这基于 CRC 检查我相信，我正在使用这个工具，但我只是逐渐信任它，到目前为止没有错误，但不还没有像 BeyondCompare 那样信任它。 BeyondCompare 提供了对文件进行完整二进制比较的完整思路。

就我而言，这些文件往往具有相似的名称，区别在于单词的顺序、标点符号、大小写差异以及并非所有单词都存在。因此，使用正则表达式过滤器来匹配 Beyond Compare 等某些 diff 工具已经提供的文件并不容易，因为文件子字符串可能是无序的。

我正在寻找一种方法来匹配相似的文件名，然后将文件重命名为相同的文件，然后将它们“输入”到像 BeyondCompare 这样的工具。解决方案可以是脚本，也可以是应用程序的形式。

目前，我有一个想法（在 Perl 中实现）来匹配文件名以适应我的问题，其中文件名与上述相似。

您能提出更好的方法或完全不同的方法吗？

查找具有完全相同文件大小的文件列表
使用第一个文件创建字母数字子字符串的哈希值非字母数字字符或空格作为分隔符
使用第二个文件创建字母数字子字符串的哈希值非字母数字字符或空格作为分隔符
匹配出现次数
查找子字符串数量最多的文件。
根据以下内容计算配对比较的百分比分数匹配数除以最大子字符串数。
对每个文件与其他每个文件重复进行比较，并获得准确的结果文件大小
按百分比分数对比较进行排序以获得建议要比较的文件。
重命名该对中的一个文件，使其与另一个文件相同。放置在单独的文件夹中。
使用文件、文件夹比较模式运行 BeyondCompare 等比较工具。

原文

I'm looking for the best approach to comparing files that I believe are identical but which have different filenames. Comparison tools like BeyondCompare are great but they don't yet handle different filenames - when comparing files in separate folders they attempt comparisons with the files that have the same name on either side.

(I don't work for or have a financial interest in BeyondCompare, but I use the tool a lot and find it has some great features).

There is MindGems Fast Duplicate File Finder for matching files in any location throughout several folder trees that have different names but this is based on CRC checks I believe, I am using this tool but I am only gradually trusting it, so far no faults but don't trust it as much as BeyondCompare yet. BeyondCompare offers the complete piece of mind of doing a full binary compare on the file.

In my case the files tend to have similar names, the difference being ordering of the words, punctuation, case differences and not all words present. So it's not easy to use a regex filter to match the files that some diff tools like Beyond Compare already provide because the file substrings can be out of order.

I'm looking for a way to match similar filenames before renaming the files to be the same and then 'feeding' them to a tool like BeyondCompare. Solutions could be scripts or perhaps in the form of an application.

At the moment I have an idea for an algorithm (to implement in Perl) to match the filenames to suit my problem whereby the filenames are similar as described above.

Can you suggest something better or a completely different approach?

Find a list of files with the exact same filesize
Make a hash of alphanumeric substrings from first file, using
non-alphanumeric characters or space as delimiter
Make a hash of alphanumeric substrings from second file, using
non-alphanumeric characters or space as delimiter
Match occurrences
Find which file has the highest number of substrings.
Calculate a percentage score for the comparison on the pair based on
number of matches divided by the highest number of substrings.
Repeat comparison for each file with every other file with the exact
file size
sort the pair comparisons by percentage score to get suggestions of
files to compare.
Rename one file in the pair so that it is the same as the other. Place in separate folders.
Run a comparison tool like BeyondCompare with the files, folder comparison mode.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

赢得她心 2025-01-08 01:39:05

由于我已经有了 Fast Duplicate File Finder Pro，它会以 CSV 和 XML 格式输出重复项的文本报告。

我将处理 CSV 以查看分组并重命名文件，以便我可以超越比较来对它们进行完整的二进制比较。

更新：

这是我的代码。此 Perl 脚本将查看每对相同的文件（在正在比较的目录/文件夹中），并将其中一个重命名为与另一个相同，以便这两个文件夹可以通过 Beyond Compare 运行，这将执行以下操作：完整的二进制比较（如果打开了展平文件夹选项）。二进制比较确认匹配，这意味着可以清除每个重复对中的一个。

#!/usr/bin/perl -w 

use strict;
use warnings;


use File::Basename;

my $fdffCsv = undef;

# fixed
# put matching string - i.e. some or all of path of file to keep here e.g. C:\\files\\keep\\ or just keep
my $subpathOfFileToKeep = "keep";
# e.g. jpg mp3 pdf etc.
my $fileExtToCompare = "jpg";

# changes
my $currentGroup = undef;
my $group = undef;
my $filenameToKeep = "";

my $path = undef;
my $name = undef;
my $extension = undef;
my $filename = undef;

open ( $fdffCsv, '<', "fast_duplicate_filefinder_export_as_csv.csv" );

my @filesToRenameArray = ();

while ( <$fdffCsv> )
{
  my $line = $_;

  my @lineColumns = split( /,/, $line );

  # is the first column and index value
  if ( $lineColumns[0] =~ m/\d+/ )
  {
    $group = $lineColumns[0];

    ( $line ) =~ /("[^"]+")/;
    $filename = $1;

    $filename =~ s/\"//g;

    if ( defined $currentGroup )
    {
      if ( $group == $currentGroup )
      {
        ( $name, $path, $extension ) = fileparse ( $filename, '\..*"' );

    store_keep_and_rename();
      }
      else # group changed
      {
        match_the_filenames();

    ( $name, $path, $extension ) = fileparse ( $filename, '\..*"' );

    store_keep_and_rename();
      }
    }
    else # first time - beginning of file
    {
      $currentGroup = $group;

      ( $name, $path, $extension ) = fileparse ( $filename, '\..*"' );

      store_keep_and_rename();
    }
  }
}

close( $fdffCsv );

match_the_filenames();

sub store_keep_and_rename
{
        if ( $path =~ /($subpathOfFileToKeep)/ )
      {
        $filenameToKeep = $name.$extension;
      }
      else
      {
        push( @filesToRenameArray, $filename );
      }
}

sub match_the_filenames
{
    my $sizeOfFilesToRenameArraySize = scalar( @filesToRenameArray );

        if ( $sizeOfFilesToRenameArraySize > 0 )
    {
      for (my $index = 0; $index < $sizeOfFilesToRenameArraySize; $index++ )
      {
        my $PreRename = $filesToRenameArray[$index];

        my ( $preName, $prePath, $preExtension ) = fileparse ( $PreRename, '\..*' );
        my $filenameToChange = $preName.$preExtension;

        my $PostRename = $prePath.$filenameToKeep;

        print STDOUT "Filename was: ".$PreRename."\n";
        print STDOUT "Filename will be: ".$PostRename."\n\n";

        rename $PreRename, $PostRename;
      }
    }

    undef( @filesToRenameArray ); @filesToRenameArray = ();

    $currentGroup = $group;
    }

As I already have Fast Duplicate File Finder Pro, this outputs a text report of the duplicates in CSV and XML format.

I will process the CSV to see the groupings and rename the files so that I can get beyond compare to do a full binary comparison on them.

Update:

And here is my code. This Perl script will look at each pair of files (in the directories/folders being compared) that are the same and rename one of them to be the same as the other so that the two folders can be run through Beyond Compare which will do a full binary compare (if the flatten folders option is switched on). Binary compare confirms the match so that means that one of each duplicate pair can be purged.

#!/usr/bin/perl -w 

use strict;
use warnings;


use File::Basename;

my $fdffCsv = undef;

# fixed
# put matching string - i.e. some or all of path of file to keep here e.g. C:\\files\\keep\\ or just keep
my $subpathOfFileToKeep = "keep";
# e.g. jpg mp3 pdf etc.
my $fileExtToCompare = "jpg";

# changes
my $currentGroup = undef;
my $group = undef;
my $filenameToKeep = "";

my $path = undef;
my $name = undef;
my $extension = undef;
my $filename = undef;

open ( $fdffCsv, '<', "fast_duplicate_filefinder_export_as_csv.csv" );

my @filesToRenameArray = ();

while ( <$fdffCsv> )
{
  my $line = $_;

  my @lineColumns = split( /,/, $line );

  # is the first column and index value
  if ( $lineColumns[0] =~ m/\d+/ )
  {
    $group = $lineColumns[0];

    ( $line ) =~ /("[^"]+")/;
    $filename = $1;

    $filename =~ s/\"//g;

    if ( defined $currentGroup )
    {
      if ( $group == $currentGroup )
      {
        ( $name, $path, $extension ) = fileparse ( $filename, '\..*"' );

    store_keep_and_rename();
      }
      else # group changed
      {
        match_the_filenames();

    ( $name, $path, $extension ) = fileparse ( $filename, '\..*"' );

    store_keep_and_rename();
      }
    }
    else # first time - beginning of file
    {
      $currentGroup = $group;

      ( $name, $path, $extension ) = fileparse ( $filename, '\..*"' );

      store_keep_and_rename();
    }
  }
}

close( $fdffCsv );

match_the_filenames();

sub store_keep_and_rename
{
        if ( $path =~ /($subpathOfFileToKeep)/ )
      {
        $filenameToKeep = $name.$extension;
      }
      else
      {
        push( @filesToRenameArray, $filename );
      }
}

sub match_the_filenames
{
    my $sizeOfFilesToRenameArraySize = scalar( @filesToRenameArray );

        if ( $sizeOfFilesToRenameArraySize > 0 )
    {
      for (my $index = 0; $index < $sizeOfFilesToRenameArraySize; $index++ )
      {
        my $PreRename = $filesToRenameArray[$index];

        my ( $preName, $prePath, $preExtension ) = fileparse ( $PreRename, '\..*' );
        my $filenameToChange = $preName.$preExtension;

        my $PostRename = $prePath.$filenameToKeep;

        print STDOUT "Filename was: ".$PreRename."\n";
        print STDOUT "Filename will be: ".$PostRename."\n\n";

        rename $PreRename, $PostRename;
      }
    }

    undef( @filesToRenameArray ); @filesToRenameArray = ();

    $currentGroup = $group;
    }

回复收藏 0 原文