使用 AWK 替换一组文本的最后几行
我通过执行各种命令得到了这个输出,
d41d8cd98f00b204e9800998ecf8427e 1317522632 /home/evan/school_work/unix/Projects/Project2/finddups/test/New Text Document.txt
d41d8cd98f00b204e9800998ecf8427e 1317522632 /home/evan/school_work/unix/Projects/Project2/finddups/test/New Text Document - Copy.txt
d41d8cd98f00b204e9800998ecf8427e 1317522632 /home/evan/school_work/unix/Projects/Project2/finddups/test/New Text Document - Copy (2).txt
d41d8cd98f00b204e9800998ecf8427e 1317506438 /home/evan/school_work/unix/Projects/Project2/finddups/.svn/tmp/tempfile.tmp
2430ffcf28e7ef6990e46ae081f1fb08 1317522636 /home/evan/school_work/unix/Projects/Project2/finddups/test/New folder/junk2 - Copy.txt
2430ffcf28e7ef6990e46ae081f1fb08 1317506569 /home/evan/school_work/unix/Projects/Project2/finddups/test/New folder/junk2.txt
我想通过 awk 对其进行管道传输,使其看起来像这样
Duplicate: /home/evan/school_work/unix/Projects/Project2/finddups/test/New Text Document.txt
Duplicate: /home/evan/school_work/unix/Projects/Project2/finddups/test/New Text Document - Copy.txt
Duplicate: /home/evan/school_work/unix/Projects/Project2/finddups/test/New Text Document - Copy (2).txt
Original: /home/evan/school_work/unix/Projects/Project2/finddups/.svn/tmp/tempfile.tmp
Duplicate: /home/evan/school_work/unix/Projects/Project2/finddups/test/New folder/junk2 - Copy.txt
Original: /home/evan/school_work/unix/Projects/Project2/finddups/test/New folder/junk2.txt
有什么想法吗?
一些说明:
换行符或 EOF 之前的最后一个文件将始终是原始文件,之前的所有文件都应标记为重复项。
第一列是文件的 md5sum,第二列是修改日期。您会注意到组中的最后一个文件始终具有最旧的时间戳,这是我用来确定哪个文件是“原始”文件(最旧的文件)的标准。
这是我使用的命令来获取所有重复项的列表
find ${PWD} -type f -exec stat -c %Y {} \; -exec md5sum '{}' \; | sed -r 'N;s/([0-9]+)\n([^ ]+) /\2 \1/g' | sort -r | uniq -w 32 --all-repeated=separate
I have this output from doing various commands
d41d8cd98f00b204e9800998ecf8427e 1317522632 /home/evan/school_work/unix/Projects/Project2/finddups/test/New Text Document.txt
d41d8cd98f00b204e9800998ecf8427e 1317522632 /home/evan/school_work/unix/Projects/Project2/finddups/test/New Text Document - Copy.txt
d41d8cd98f00b204e9800998ecf8427e 1317522632 /home/evan/school_work/unix/Projects/Project2/finddups/test/New Text Document - Copy (2).txt
d41d8cd98f00b204e9800998ecf8427e 1317506438 /home/evan/school_work/unix/Projects/Project2/finddups/.svn/tmp/tempfile.tmp
2430ffcf28e7ef6990e46ae081f1fb08 1317522636 /home/evan/school_work/unix/Projects/Project2/finddups/test/New folder/junk2 - Copy.txt
2430ffcf28e7ef6990e46ae081f1fb08 1317506569 /home/evan/school_work/unix/Projects/Project2/finddups/test/New folder/junk2.txt
I want to pipe it through awk to make it look like this
Duplicate: /home/evan/school_work/unix/Projects/Project2/finddups/test/New Text Document.txt
Duplicate: /home/evan/school_work/unix/Projects/Project2/finddups/test/New Text Document - Copy.txt
Duplicate: /home/evan/school_work/unix/Projects/Project2/finddups/test/New Text Document - Copy (2).txt
Original: /home/evan/school_work/unix/Projects/Project2/finddups/.svn/tmp/tempfile.tmp
Duplicate: /home/evan/school_work/unix/Projects/Project2/finddups/test/New folder/junk2 - Copy.txt
Original: /home/evan/school_work/unix/Projects/Project2/finddups/test/New folder/junk2.txt
Any ideas?
Some clarifications:
The last file before the newline or EOF will awalys be the original file, everything before should be marked as a duplicate.
The first column is the md5sum of the file, second is the modification date. You will notice the last file in a group will always have the oldest time stamp, this is the criteria I am using to determine what file is "original", the oldest file.
Here are the commands im using the to get the list of all duplicates
find ${PWD} -type f -exec stat -c %Y {} \; -exec md5sum '{}' \; | sed -r 'N;s/([0-9]+)\n([^ ]+) /\2 \1/g' | sort -r | uniq -w 32 --all-repeated=separate
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
对行进行排序(使用
sort
),将散列存储在临时变量中,并使用if
语句将其与当前行进行比较。另一个if
语句应该消除可能的空行。例如:
<代码>|排序| awk '{ if ($0) { if (TEMP != $1) { print "原始:" $0 } else { print "重复:" $0 } TEMP = $1 } }'
编辑:
既然您提供了这些说明,您可以这样做:
|塔克 | awk '{ if ($0) { if (TEMP != $1) { print "原始:" $0 } else { print "重复:" $0 } TEMP = $1 } else { print "" } }' | tac
tac
反转行顺序,实现了第一个示例中 sort 的效果。第二个tac
恢复原始顺序。Sort the lines (using
sort
), store the hash in a temporary variable and compare it with the current using anif
statement. Anotherif
statement should get rid of possible blank lines.For example:
| sort | awk '{ if ($0) { if (TEMP != $1) { print "Original: " $0 } else { print "Duplicate:" $0 } TEMP = $1 } }'
Edit:
Since you provided those clarifications, you could do it this way:
| tac | awk '{ if ($0) { if (TEMP != $1) { print "Original: " $0 } else { print "Duplicate:" $0 } TEMP = $1 } else { print "" } }' | tac
tac
inverts the line order, achieving exactly what sort did in the first example. The secondtac
restores the original order.这个 sed oneliner 可能会起作用:
通过在源文件中附加一个换行符,问题变成了两个替换,从而消除了任何 EOF 不优雅的情况。
我想 sed 解决方案是可以接受的,因为您在源文件准备中使用了 sed。
This sed oneliner might work:
By appending a newline to the source file the problem becomes two substitutions negating any EOF inelegance.
I guess a sed solution is acceptable as you used sed in the source file prep.
您如何知道什么是重复项,什么是副本?这就是我的问题。
如果重复项的名称中都包含
Copy
,那就很容易了,但是您的第一个示例中,第一个重复项之一称为New Text Document.txt
,而原始项是在.svn
目录中,该目录永远不应该被查看。看起来第一列中有 MD5 哈希,这意味着您可以对其进行排序,然后使用 awk 循环输出并在哈希更改时打印一个空行。这会将您的文件分组在一起。
原件与副本将会困难得多。你必须为此制定一个好的标准。您可以选择最早的修改日期 (
mdate
)。你也可以对此进行排序。当您破坏哈希值时,您可以简单地假设列表中的第一个文件(因为它具有最早的日期)是原始文件。或者,您可以简单地假设文件名中嵌入单词
Copy
的文件是副本。然后,这可能并不那么重要。您希望程序仅识别重复项还是删除它们?如果程序只是识别重复项,则无需弄清楚哪些是原始的,哪些是重复的。你用你的眼睛可能比任何算法都能做得更好。顺便说一下,这三列到底是什么。我假设第一个是 has,最后一个是文件名,但是中间的是什么?
How do you know what's a duplicate and what's a copy? That would be my question.
It would be easy if the duplicates all had
Copy
in the name, but your first example, one of the first duplicates is calledNew Text Document.txt
, and the original is in the.svn
directory which should never have been looked at.It looks like you have the MD5 hash in the first column which means you could sort on that, and then use
awk
to loop through your output and print a blank line whenever the hash changes. That would group your files together.The original vs. copy is going to be much more difficult. You'll have to work out a good criteria for that. You might choose the earliest modification date (
mdate
). You could sort on that too. When you break on the hash, you could simply assume the first file in the list (because it has the earliest date) to be the original.Or, you could simply assume that the ones with the word
Copy
embedded in the file name are the copies. And, then, it might not really matter all that much. Do you want the program to merely identify duplicates or delete them? If the program is merely identifying duplicates, there's no need to figure out which ones are the original and which ones are the duplicates. You can probably do that better with your eye than any algorithm.By the way, what exactly are the three columns. I'm assuming the first is a has, and the last is the file name, but what is the middle one?
如果每组的最后一行(包括最后一组)之后出现空白行,并且文件名从不包含空格,也许这会起作用。它取决于空行的存在。如果最后一个空行丢失,最后一行将不会被打印。由于文件名中存在空格,这不起作用(所以大多数行不只有 3 个字段)。 awk 并不是真正最合适的工具。当 Awk 不合适时,我倾向于使用 Perl:
这会产生:
如果您必须使用 Awk,那么您需要在
NF >= 3
时处理$0
,删除哈希值和索引节点号(或数据行上的第二个值)以查找文件名。Maybe this will work, if blank lines appear after the last line of each group, including the very last group, and if the file names never contain blanks. It hinges on the presence of the blank lines.If the last blank line is missing, the last line will not be printed.This doesn't work because of the blanks in the file names (so most lines do not have just 3 fields). Awk is not really the most appropriate tool. I tend to use Perl when Awk is not suitable:
This produces:
If you must use Awk, then you'll need to work on
$0
whenNF >= 3
, removing the hash and inode number (or whatever the second value on the data line is) to find the filename.