如何根据文件名将文件排序到目录？

发布于 2024-07-13 21:12:25 字数 977 浏览 6 评论 0原文

我有大量文件需要对所有以某种可怕的约定命名的文件进行排序。
以下是一些示例：

(4)_mr__mcloughlin____.txt
12__sir_john_farr____.txt
(b)mr__chope____.txt
dame_elaine_kellett-bowman____.txt
dr__blackburn__.txt

这些名字应该是不同的人（说话者）。另一个 IT 部门的某人使用一些脚本从大量 XML 文件中生成了这些文件，但如您所见，命名极其愚蠢。

我需要对数以万计的文件进行排序，每个人都有多个文本文件；每个都有一些愚蠢的东西使文件名不同，无论是更多下划线还是一些随机数。它们需要按演讲者排序。

使用脚本来完成大部分工作会更容易，然后我可以返回并合并应该具有相同名称或其他名称的文件夹。

我考虑过很多方法来做到这一点。

解析每个文件中的名称，并将它们分类到每个唯一名称的文件夹中。
从文件名中获取所有唯一名称的列表，然后查看这个简化的唯一名称列表中是否有相似的名称，并询问我它们是否相同，一旦确定了这一点，它就会相应地对它们进行排序。

我计划使用 Perl，但如果值得的话我可以尝试一种新语言。我不知道如何一次将目录中的每个文件名读入字符串中，以便解析为实际名称。我也不完全确定如何在 perl 中使用正则表达式进行解析，但这可能可以通过谷歌搜索。

对于排序，我只想使用 shell 命令：

`cp filename.txt /example/destination/filename.txt`

但因为这就是我所知道的，所以这是最简单的。

我什至不知道我要做什么，所以如果有人知道最佳的操作顺序，我会洗耳恭听。我想我正在寻求很多帮助，我愿意接受任何建议。非常非常感谢任何可以提供帮助的人。

原文

I have a huge number of files to sort all named in some terrible convention.
Here are some examples:

(4)_mr__mcloughlin____.txt
12__sir_john_farr____.txt
(b)mr__chope____.txt
dame_elaine_kellett-bowman____.txt
dr__blackburn______.txt

These names are supposed to be a different person (speaker) each. Someone in another IT department produced these from a ton of XML files using some script but the naming is unfathomably stupid as you can see.

I need to sort literally tens of thousands of these files with multiple files of text for each person; each with something stupid making the filename different, be it more underscores or some random number. They need to be sorted by speaker.

This would be easier with a script to do most of the work then I could just go back and merge folders that should be under the same name or whatever.

There are a number of ways I was thinking about doing this.

parse the names from each file and sort them into folders for each unique name.
get a list of all the unique names from the filenames, then look through this simplified list of unique names for similar ones and ask me whether they are the same, and once it has determined this it will sort them all accordingly.

I plan on using Perl, but I can try a new language if it's worth it. I'm not sure how to go about reading in each filename in a directory one at a time into a string for parsing into an actual name. I'm not completely sure how to parse with regex in perl either, but that might be googleable.

For the sorting, I was just gonna use the shell command:

`cp filename.txt /example/destination/filename.txt`

but just cause that's all I know so it's easiest.

I dont even have a pseudocode idea of what im going to do either so if someone knows the best sequence of actions, im all ears. I guess I am looking for a lot of help, I am open to any suggestions. Many many many thanks to anyone who can help.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

故事和酒 2024-07-20 21:12:25

我希望我正确理解你的问题，恕我直言，这有点含糊。这段代码未经测试，但应该做我认为你想要的。

use File::Copy;

sub sanatize {
    local $_ = shift;
    s/\b(?:dame|dr|mr|sir)\b|\d+|\(\w+\)|.txt$//g;
    s/[ _]+/ /g;
    s/^ | $//g;
    return lc $_;
}

sub sort_files_to_dirs {
    my @files = @_;
    for my $filename (@files) {
        my $dirname = sanatize($filename);
        mkdir $dirname if not -e $dirname;
        copy($filename, "$dirname/$filename");
    }
}

I hope I understand your question right, it's a bit ambiguous IMHO. This code is untested, but should do what I think you want.

use File::Copy;

sub sanatize {
    local $_ = shift;
    s/\b(?:dame|dr|mr|sir)\b|\d+|\(\w+\)|.txt$//g;
    s/[ _]+/ /g;
    s/^ | $//g;
    return lc $_;
}

sub sort_files_to_dirs {
    my @files = @_;
    for my $filename (@files) {
        my $dirname = sanatize($filename);
        mkdir $dirname if not -e $dirname;
        copy($filename, "$dirname/$filename");
    }
}

回复收藏 0 原文

水中月 2024-07-20 21:12:25

当前所有文件都在同一目录中吗？如果是这种情况，那么您可以使用“opendir”和“readdir”来一一读取所有文件。使用文件名作为键构建哈希（删除所有“_”以及括号内的任何信息），以便得到类似以下内容 -

(4)_mr__mcloughlin____.txt -> 'mr mcloughlin'
12__sir_john_farr____.txt -> 'sir john farr'
(b)mr__chope____.txt -> 'mr chope'
dame_elaine_kellett-bowman____.txt -> 'dame elaine kellett-bowman'
dr__blackburn______.txt -> 'dr blackburn'

将哈希值设置为该名称出现的实例数，以便远的。因此，在这些条目之后，您应该有一个如下所示的哈希 -

'mr mcloughlin' => 1
'sir john farr' => 1
'mr chope' => 1
'dame elaine kellett-bowman' => 1
'dr blackburn' => 1

每当您在哈希中遇到新条目时，只需使用键名创建一个新目录。现在您所要做的就是将更改后的名称（使用相应的哈希值作为后缀）的文件复制到新目录中。例如，如果您偶然发现另一个条目，其内容为“mr mcloughlin”，那么您可以将其复制为

./mr mcloughlin/mr mcloughlin_2.txt

Are all the current files in the same directory? If that is the case then you could use 'opendir' and 'readdir' to read through all the files one by one. Build a hash using the file name as the key (remove all '_' as well as any information inside the brackets) so that you get something like this -

(4)_mr__mcloughlin____.txt -> 'mr mcloughlin'
12__sir_john_farr____.txt -> 'sir john farr'
(b)mr__chope____.txt -> 'mr chope'
dame_elaine_kellett-bowman____.txt -> 'dame elaine kellett-bowman'
dr__blackburn______.txt -> 'dr blackburn'

Set the value of the hash to be the number of instances of the name occurred so far. So after these entries you should have a hash that looks like this -

'mr mcloughlin' => 1
'sir john farr' => 1
'mr chope' => 1
'dame elaine kellett-bowman' => 1
'dr blackburn' => 1

Whenever you come across a new entry in your hash simply create a new directory using the key name. Now all you have to do is copy the file with the changed name (use the corresponding hash value as a suffix) into the new directory. So for eg., of you were to stumble upon another entry which reads as 'mr mcloughlin' then you could copy it as

./mr mcloughlin/mr mcloughlin_2.txt

回复收藏 0 原文

脸赞 2024-07-20 21:12:25

我会：

定义名称中的重要内容：
- dr__blackburn 与 dr_blackburn 不同吗？
- dr__blackburn 与 mr__blackburn 不同吗？
- 前导数字有意义吗？
- 前导/尾随下划线有意义吗？
- 等等
提出规则和算法将名称转换为目录（Leon 是一个非常好的开始）
读入名称并一次处理一个
- 我会使用 opendir 和递归的某种组合
- 当你处理它们时我会复制它们；莱昂的帖子再次是一个很好的例子
如果将来需要维护和使用这个脚本，我肯定会创建测试（例如使用 http://search.cpan.org/dist/Test-More/) 对于每个正则表达式路径；当您发现新的问题时，添加一个新的测试并确保它失败，然后修复正则表达式，然后重新运行测试以确保没有任何问题
（

回复收藏 0 原文

ぃ弥猫深巷。 2024-07-20 21:12:25

我已经有一段时间没有使用 Perl 了，所以我打算用 Ruby 来写这个。我将对其进行评论以建立一些伪代码。

DESTINATION = '/some/faraway/place/must/exist/and/ideally/be/empty'

# get a list of all .txt files in current directory
Dir["*.txt"].each do |filename|
  # strategy:
  # - chop off the extension
  # - switch to all lowercase
  # - get rid of everything but spaces, dashes, letters, underscores
  # - then swap any run of spaces, dashes, and underscores for a single space
  # - then strip whitespace off front and back
  name = File.basename(filename).downcase.
         gsub(/[^a-z_\s-]+/, '').gsub(/[_\s-]+/, ' ').strip
  target_folder = DESTINATION + '/' + name

  # make sure we dont overwrite a file
  if File.exists?(target_folder) && !File.directory?(target_folder)
    raise "Destination folder is a file"
  # if directory doesnt exist then create it
  elsif !File.exists?(target_folder)
    Dir.mkdir(target_folder)
  end
  # now copy the file
  File.copy(filename, target_folder)
end

无论如何，这就是想法 - 我已经确保所有 API 调用都是正确的，但这不是经过测试的代码。这看起来像您想要实现的目标吗？这可以帮助您用 Perl 编写代码吗？

I've not used Perl in a while so I'm going to write this in Ruby. I will comment it to establish some pseudocode.

DESTINATION = '/some/faraway/place/must/exist/and/ideally/be/empty'

# get a list of all .txt files in current directory
Dir["*.txt"].each do |filename|
  # strategy:
  # - chop off the extension
  # - switch to all lowercase
  # - get rid of everything but spaces, dashes, letters, underscores
  # - then swap any run of spaces, dashes, and underscores for a single space
  # - then strip whitespace off front and back
  name = File.basename(filename).downcase.
         gsub(/[^a-z_\s-]+/, '').gsub(/[_\s-]+/, ' ').strip
  target_folder = DESTINATION + '/' + name

  # make sure we dont overwrite a file
  if File.exists?(target_folder) && !File.directory?(target_folder)
    raise "Destination folder is a file"
  # if directory doesnt exist then create it
  elsif !File.exists?(target_folder)
    Dir.mkdir(target_folder)
  end
  # now copy the file
  File.copy(filename, target_folder)
end

That's the idea, anyway - I've made sure all the API calls are correct, but this isn't tested code. Does this look like what you're trying to accomplish? Might this help you write the code in Perl?

回复收藏 0 原文

浅忆流年 2024-07-20 21:12:25

您可以使用类似的方式分割文件名，

@tokens = split /_+/, $filename

对于所有这些文件名，@tokens 的最后一个条目应该是 ".txt"，但倒数第二个应该类似对于同一个人，其名字在某些地方拼写错误（例如“Dr. Jones”改为“Brian Jones”）。您可能想要使用某种编辑距离作为相似性度量来比较@各种文件名的标记[-2]；当两个条目的姓氏足够相似时，它们应该提示您作为合并的候选者。

You can split the filenames using something like

@tokens = split /_+/, $filename

The last entry of @tokens should be ".txt" for all of these filenames, but the second-to-last should be similar for the same person whose name has been misspelled in places (or "Dr. Jones" changed to "Brian Jones" for instance). You may want to use some sort of edit distance as a similarity metric to compare @tokens[-2] for various filenames; when two entries have similar enough last names, they should prompt you as a candidate for merging.

回复收藏 0 原文

深海夜未眠 2024-07-20 21:12:25

当您问一个非常笼统的问题时，只要我们有更好的规则编纂，任何语言都可以做到这一点。我们甚至没有具体，只有一个“样本”。

因此，盲目工作似乎需要人工监控。所以这个想法是一个筛子。您可以重复运行、检查、再次运行、一次又一次检查，直到将所有内容分类为一些小的手动任务。

下面的代码做了很多假设，因为您几乎把它留给了我们来处理。其中之一是样本是所有可能的姓氏的列表；如果还有其他姓氏，请添加它们并再次运行。

use strict;
use warnings;
use File::Copy;
use File::Find::Rule;
use File::Spec;
use Readonly;

Readonly my $SOURCE_ROOT    => '/mess/they/left';
Readonly my $DEST_DIRECTORY => '/where/i/want/all/this';

my @lname_list = qw<mcloughlin farr chope kelette-bowman blackburn>;
my $lname_regex 
    = join( '|'
          , sort {  ( $b =~ /\P{Alpha}/ ) <=> ( $a =~ /\P{Alpha}/ )
                 || ( length $b ) <=> ( length $a ) 
                 || $a cmp $b 
                 } @lname_list 
          )
    ;
my %dest_dir_for;

sub get_dest_directory { 
    my $case = shift;
    my $dest_dir = $dest_dir_for{$case};
    return $dest_dir if $dest_dir;

    $dest_dir = $dest_dir_for{$case}
        = File::Spec->catfile( $DEST_DIRECTORY, $case )
        ;
    unless ( -e $dest_dir ) { 
        mkdir $dest_dir;
    }
    return $dest_dir;
}

foreach my $file_path ( 
    File::Find::Rule->file
        ->name( '*.txt' )->in( $SOURCE_ROOT )
) {
    my $file_name =  [ File::Spec->splitpath( $file_path ) ]->[2];
    $file_name    =~ s/[^\p{Alpha}.-]+/_/g;
    $file_name    =~ s/^_//;
    $file_name    =~ s/_[.]/./;

    my ( $case )  =  $file_name =~ m/(^|_)($lname_regex)[._]/i;

    next unless $case;
    # as we next-ed, we're dealing with only the cases we want here. 

    move( $file_path
        , File::Spec->catfile( get_dest_directory( lc $case )
                             , $file_name 
                             )
        );
}

As you are asking a very general question, any language could do this as long as we have a better codification of rules. We don't even have the specifics, only a "sample".

So, working blind, it looks like human monitoring will be needed. So the idea is a sieve. Something you can repeatedly run and check and run again and check again and again until you've got everything sorted to a few small manual tasks.

The code below makes a lot of assumptions, because you pretty much left it to us to handle it. One of which is that the sample is a list of all the possible last names; if there are any other last names, add 'em and run it again.

use strict;
use warnings;
use File::Copy;
use File::Find::Rule;
use File::Spec;
use Readonly;

Readonly my $SOURCE_ROOT    => '/mess/they/left';
Readonly my $DEST_DIRECTORY => '/where/i/want/all/this';

my @lname_list = qw<mcloughlin farr chope kelette-bowman blackburn>;
my $lname_regex 
    = join( '|'
          , sort {  ( $b =~ /\P{Alpha}/ ) <=> ( $a =~ /\P{Alpha}/ )
                 || ( length $b ) <=> ( length $a ) 
                 || $a cmp $b 
                 } @lname_list 
          )
    ;
my %dest_dir_for;

sub get_dest_directory { 
    my $case = shift;
    my $dest_dir = $dest_dir_for{$case};
    return $dest_dir if $dest_dir;

    $dest_dir = $dest_dir_for{$case}
        = File::Spec->catfile( $DEST_DIRECTORY, $case )
        ;
    unless ( -e $dest_dir ) { 
        mkdir $dest_dir;
    }
    return $dest_dir;
}

foreach my $file_path ( 
    File::Find::Rule->file
        ->name( '*.txt' )->in( $SOURCE_ROOT )
) {
    my $file_name =  [ File::Spec->splitpath( $file_path ) ]->[2];
    $file_name    =~ s/[^\p{Alpha}.-]+/_/g;
    $file_name    =~ s/^_//;
    $file_name    =~ s/_[.]/./;

    my ( $case )  =  $file_name =~ m/(^|_)($lname_regex)[._]/i;

    next unless $case;
    # as we next-ed, we're dealing with only the cases we want here. 

    move( $file_path
        , File::Spec->catfile( get_dest_directory( lc $case )
                             , $file_name 
                             )
        );
}

回复收藏 0 原文

~没有更多了~