使用 md5sum 查找重复项
我有一个双循环,打开一个文件并使用 awk 获取每行的第一部分和第二部分。第一部分是文件的 md5sum,第二块是文件名。但是,当我运行脚本查看是否有重复文件时,file1 会对 file1 进行罚款,因此它认为它们是重复的,即使它们是相同的文件。这是我的代码:
echo start
for i in $(<dump.txt) ; do
md=$(echo $i|awk -F'|' '{print $1}')
file=$(echo $i|awk -F'|' '{print $2}')
for j in $(<dump.txt) ; do
m=$(echo $j|awk -F'|' '{print $1}')
f=$(echo $j|awk -F'|' '{print $2}')
if [ "$md" == "$m" ]; then
echo $file and $f are duplicates
fi
done
done
echo end
转储文件如下所示:
404460c24654e3d64024851dd0562ff1 *./extest.sh
7a900fdfa67739adcb1b764e240be05f *./test.txt
7a900fdfa67739adcb1b764e240be05f *./test2.txt
88f5a6b83182ce5c34c4cf3b17f21af2 *./dump.txt
c8709e009da4cce3ee2675f2a1ae9d4f *./test3.txt
d41d8cd98f00b204e9800998ecf8427e *./checksums.txt
整个代码是:
#!/bin/sh
func ()
{
if [ "$1" == "" ]; then
echo "Default";
for i in `find` ;
do
#if [ -d $i ]; then
#echo $i "is a directory";
#fi
if [ -f $i ]; then
if [ "$i" != "./ex.sh" ]; then
#echo $i "is a file";
md5sum $i >> checksums.txt;
sort --output=dump.txt checksums.txt;
fi
fi
done
fi
if [ "$1" == "--long" ]; then
echo "--long";
for i in `find` ;
do
#if [ -d $i ]; then
#echo $i "is a directory";
#fi
if [ -f $i ]; then
echo $i "is a file";
fi
done
fi
if [ "$1" == "--rm" ]; then
echo "--rm";
for i in `find` ;
do
#if [ -d $i ]; then
#echo $i "is a directory";
#fi
if [ -f $i ]; then
echo $i "is a file";
fi
done
fi
}
parse () {
echo start
for i in $(<dump.txt) ; do
md=$(echo $i|awk -F'|' '{print $1}')
file=$(echo $i|awk -F'|' '{print $2}')
for j in $(<dump.txt) ; do
m=$(echo $j|awk -F'|' '{print $1}')
f=$(echo $j|awk -F'|' '{print $2}')
#echo $md
#echo $m
if [ "$file" != "$f" ] && [ "$md" == "$m" ]; then
echo Files $file and $f are duplicates.
fi
done
done
echo end
}
getArgs () {
if [ "$1" == "--long" ]; then
echo "got the first param $1";
else
if [ "$1" == "--rm" ]; then
echo "got the second param $1";
else
if [ "$1" == "" ]; then
echo "got default param";
else
echo "script.sh: unknown option $1";
exit;
fi
fi
fi
}
#start script
cat /dev/null > checksums.txt;
cat /dev/null > dump.txt;
getArgs $1;
func $1;
parse;
#end script
I have a double loop that opens a files and uses awk to take the first section and the second section of each line. The first section is the md5sum of a file and the second chunk is the filename. However when I run the script to see if I have duplicate files, file1 fines file1 and so it thinks they are duplicaes even though they are the same file. Here is my code:
echo start
for i in $(<dump.txt) ; do
md=$(echo $i|awk -F'|' '{print $1}')
file=$(echo $i|awk -F'|' '{print $2}')
for j in $(<dump.txt) ; do
m=$(echo $j|awk -F'|' '{print $1}')
f=$(echo $j|awk -F'|' '{print $2}')
if [ "$md" == "$m" ]; then
echo $file and $f are duplicates
fi
done
done
echo end
The dump file looks like this:
404460c24654e3d64024851dd0562ff1 *./extest.sh
7a900fdfa67739adcb1b764e240be05f *./test.txt
7a900fdfa67739adcb1b764e240be05f *./test2.txt
88f5a6b83182ce5c34c4cf3b17f21af2 *./dump.txt
c8709e009da4cce3ee2675f2a1ae9d4f *./test3.txt
d41d8cd98f00b204e9800998ecf8427e *./checksums.txt
The Entire code is:
#!/bin/sh
func ()
{
if [ "$1" == "" ]; then
echo "Default";
for i in `find` ;
do
#if [ -d $i ]; then
#echo $i "is a directory";
#fi
if [ -f $i ]; then
if [ "$i" != "./ex.sh" ]; then
#echo $i "is a file";
md5sum $i >> checksums.txt;
sort --output=dump.txt checksums.txt;
fi
fi
done
fi
if [ "$1" == "--long" ]; then
echo "--long";
for i in `find` ;
do
#if [ -d $i ]; then
#echo $i "is a directory";
#fi
if [ -f $i ]; then
echo $i "is a file";
fi
done
fi
if [ "$1" == "--rm" ]; then
echo "--rm";
for i in `find` ;
do
#if [ -d $i ]; then
#echo $i "is a directory";
#fi
if [ -f $i ]; then
echo $i "is a file";
fi
done
fi
}
parse () {
echo start
for i in $(<dump.txt) ; do
md=$(echo $i|awk -F'|' '{print $1}')
file=$(echo $i|awk -F'|' '{print $2}')
for j in $(<dump.txt) ; do
m=$(echo $j|awk -F'|' '{print $1}')
f=$(echo $j|awk -F'|' '{print $2}')
#echo $md
#echo $m
if [ "$file" != "$f" ] && [ "$md" == "$m" ]; then
echo Files $file and $f are duplicates.
fi
done
done
echo end
}
getArgs () {
if [ "$1" == "--long" ]; then
echo "got the first param $1";
else
if [ "$1" == "--rm" ]; then
echo "got the second param $1";
else
if [ "$1" == "" ]; then
echo "got default param";
else
echo "script.sh: unknown option $1";
exit;
fi
fi
fi
}
#start script
cat /dev/null > checksums.txt;
cat /dev/null > dump.txt;
getArgs $1;
func $1;
parse;
#end script
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这非常简单:
请注意,我将比较运算符从
==
更改为=
,这是常见的形式。我还用双引号将消息引起来,以明确它是单个字符串,并且我不希望在两个变量file
和f
。[更新:]
另一种查找重复项的方法(速度要快得多)是使用 awk 进行字符串处理:
It's pretty simple:
Note that I changed the comparison operator from
==
to=
, which is the common form. I also surrounded the message by double quotes to make it clear that it is a single string and that I don't want the word expansion to happen on the two variablesfile
andf
.[Update:]
Another way to find duplicates, which is much faster, is to use awk for string processing:
如果您决定使用 awk 解决它,那么您实际上并不需要循环或两个循环。它就像文本处理中的核头一样。
会带来你需要的东西。当然,您可以自定义文本信息。
请参阅下面的测试
已更新
you don't really need loop or two loops if you decide to solve it with awk. It is something like nuclear head in text processing.
will bring what you need. of course the text info you could customize.
see the test below
updated