如何根据文件名将文件排序到目录?
我有大量文件需要对所有以某种可怕的约定命名的文件进行排序。
以下是一些示例:
(4)_mr__mcloughlin____.txt
12__sir_john_farr____.txt
(b)mr__chope____.txt
dame_elaine_kellett-bowman____.txt
dr__blackburn__.txt
这些名字应该是不同的人(说话者)。 另一个 IT 部门的某人使用一些脚本从大量 XML 文件中生成了这些文件,但如您所见,命名极其愚蠢。
我需要对数以万计的文件进行排序,每个人都有多个文本文件; 每个都有一些愚蠢的东西使文件名不同,无论是更多下划线还是一些随机数。 它们需要按演讲者排序。
使用脚本来完成大部分工作会更容易,然后我可以返回并合并应该具有相同名称或其他名称的文件夹。
我考虑过很多方法来做到这一点。
- 解析每个文件中的名称,并将它们分类到每个唯一名称的文件夹中。
- 从文件名中获取所有唯一名称的列表,然后查看这个简化的唯一名称列表中是否有相似的名称,并询问我它们是否相同,一旦确定了这一点,它就会相应地对它们进行排序。
我计划使用 Perl,但如果值得的话我可以尝试一种新语言。 我不知道如何一次将目录中的每个文件名读入字符串中,以便解析为实际名称。 我也不完全确定如何在 perl 中使用正则表达式进行解析,但这可能可以通过谷歌搜索。
对于排序,我只想使用 shell 命令:
`cp filename.txt /example/destination/filename.txt`
但因为这就是我所知道的,所以这是最简单的。
我什至不知道我要做什么,所以如果有人知道最佳的操作顺序,我会洗耳恭听。 我想我正在寻求很多帮助,我愿意接受任何建议。 非常非常感谢任何可以提供帮助的人。
B.
I have a huge number of files to sort all named in some terrible convention.
Here are some examples:
(4)_mr__mcloughlin____.txt
12__sir_john_farr____.txt
(b)mr__chope____.txt
dame_elaine_kellett-bowman____.txt
dr__blackburn______.txt
These names are supposed to be a different person (speaker) each. Someone in another IT department produced these from a ton of XML files using some script but the naming is unfathomably stupid as you can see.
I need to sort literally tens of thousands of these files with multiple files of text for each person; each with something stupid making the filename different, be it more underscores or some random number. They need to be sorted by speaker.
This would be easier with a script to do most of the work then I could just go back and merge folders that should be under the same name or whatever.
There are a number of ways I was thinking about doing this.
- parse the names from each file and sort them into folders for each unique name.
- get a list of all the unique names from the filenames, then look through this simplified list of unique names for similar ones and ask me whether they are the same, and once it has determined this it will sort them all accordingly.
I plan on using Perl, but I can try a new language if it's worth it. I'm not sure how to go about reading in each filename in a directory one at a time into a string for parsing into an actual name. I'm not completely sure how to parse with regex in perl either, but that might be googleable.
For the sorting, I was just gonna use the shell command:
`cp filename.txt /example/destination/filename.txt`
but just cause that's all I know so it's easiest.
I dont even have a pseudocode idea of what im going to do either so if someone knows the best sequence of actions, im all ears. I guess I am looking for a lot of help, I am open to any suggestions. Many many many thanks to anyone who can help.
B.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
我希望我正确理解你的问题,恕我直言,这有点含糊。 这段代码未经测试,但应该做我认为你想要的。
I hope I understand your question right, it's a bit ambiguous IMHO. This code is untested, but should do what I think you want.
当前所有文件都在同一目录中吗? 如果是这种情况,那么您可以使用“opendir”和“readdir”来一一读取所有文件。 使用文件名作为键构建哈希(删除所有“_”以及括号内的任何信息),以便得到类似以下内容 -
将哈希值设置为该名称出现的实例数,以便远的。 因此,在这些条目之后,您应该有一个如下所示的哈希 -
每当您在哈希中遇到新条目时,只需使用键名创建一个新目录。 现在您所要做的就是将更改后的名称(使用相应的哈希值作为后缀)的文件复制到新目录中。 例如,如果您偶然发现另一个条目,其内容为“mr mcloughlin”,那么您可以将其复制为
Are all the current files in the same directory? If that is the case then you could use 'opendir' and 'readdir' to read through all the files one by one. Build a hash using the file name as the key (remove all '_' as well as any information inside the brackets) so that you get something like this -
Set the value of the hash to be the number of instances of the name occurred so far. So after these entries you should have a hash that looks like this -
Whenever you come across a new entry in your hash simply create a new directory using the key name. Now all you have to do is copy the file with the changed name (use the corresponding hash value as a suffix) into the new directory. So for eg., of you were to stumble upon another entry which reads as 'mr mcloughlin' then you could copy it as
我会:
定义名称中的重要内容:
dr__blackburn
与dr_blackburn
不同吗?dr__blackburn
与mr__blackburn
不同吗?提出规则和算法将名称转换为目录(Leon 是一个非常好的开始)
读入名称并一次处理一个
如果将来需要维护和使用这个脚本,我肯定会创建测试(例如使用 http://search.cpan.org/dist/Test-More/) 对于每个正则表达式路径; 当您发现新的问题时,添加一个新的测试并确保它失败,然后修复正则表达式,然后重新运行测试以确保没有任何问题
(
I would:
define what's significant in the name:
dr__blackburn
different thandr_blackburn
?dr__blackburn
different thanmr__blackburn
?come up with rules and an algorithm to convert a name to a directory (Leon's is a very good start)
read in the names and process them one at a time
if this script will need to be maintained and used in the future, I would defintely create tests (e.g. using http://search.cpan.org/dist/Test-More/) for each regexp path; when you find a new wrinkle, add a new test and make sure it fails, then fix the regex, then re-run the test to make sure nothing broke
我已经有一段时间没有使用 Perl 了,所以我打算用 Ruby 来写这个。 我将对其进行评论以建立一些伪代码。
无论如何,这就是想法 - 我已经确保所有 API 调用都是正确的,但这不是经过测试的代码。 这看起来像您想要实现的目标吗? 这可以帮助您用 Perl 编写代码吗?
I've not used Perl in a while so I'm going to write this in Ruby. I will comment it to establish some pseudocode.
That's the idea, anyway - I've made sure all the API calls are correct, but this isn't tested code. Does this look like what you're trying to accomplish? Might this help you write the code in Perl?
您可以使用类似的方式分割文件名,
对于所有这些文件名,
@tokens
的最后一个条目应该是".txt"
,但倒数第二个应该类似对于同一个人,其名字在某些地方拼写错误(例如“Dr. Jones”改为“Brian Jones”)。 您可能想要使用某种编辑距离作为相似性度量来比较@各种文件名的标记[-2]
; 当两个条目的姓氏足够相似时,它们应该提示您作为合并的候选者。You can split the filenames using something like
The last entry of
@tokens
should be".txt"
for all of these filenames, but the second-to-last should be similar for the same person whose name has been misspelled in places (or "Dr. Jones" changed to "Brian Jones" for instance). You may want to use some sort of edit distance as a similarity metric to compare@tokens[-2]
for various filenames; when two entries have similar enough last names, they should prompt you as a candidate for merging.当您问一个非常笼统的问题时,只要我们有更好的规则编纂,任何语言都可以做到这一点。 我们甚至没有具体,只有一个“样本”。
因此,盲目工作似乎需要人工监控。 所以这个想法是一个筛子。 您可以重复运行、检查、再次运行、一次又一次检查,直到将所有内容分类为一些小的手动任务。
下面的代码做了很多假设,因为您几乎把它留给了我们来处理。 其中之一是样本是所有可能的姓氏的列表; 如果还有其他姓氏,请添加它们并再次运行。
As you are asking a very general question, any language could do this as long as we have a better codification of rules. We don't even have the specifics, only a "sample".
So, working blind, it looks like human monitoring will be needed. So the idea is a sieve. Something you can repeatedly run and check and run again and check again and again until you've got everything sorted to a few small manual tasks.
The code below makes a lot of assumptions, because you pretty much left it to us to handle it. One of which is that the sample is a list of all the possible last names; if there are any other last names, add 'em and run it again.