使用 md5sum 查找重复项

发布于 2024-12-09 06:00:16 字数 2582 浏览 0 评论 0原文

我有一个双循环，打开一个文件并使用 awk 获取每行的第一部分和第二部分。第一部分是文件的 md5sum，第二块是文件名。但是，当我运行脚本查看是否有重复文件时，file1 会对 file1 进行罚款，因此它认为它们是重复的，即使它们是相同的文件。这是我的代码：

echo start
for i in $(<dump.txt) ; do
    md=$(echo $i|awk -F'|' '{print $1}')
    file=$(echo $i|awk -F'|' '{print $2}')
    for j in $(<dump.txt) ; do
        m=$(echo $j|awk -F'|' '{print $1}')
        f=$(echo $j|awk -F'|' '{print $2}')
        if [ "$md" == "$m" ]; then
            echo $file and $f are duplicates
        fi
    done
done
echo end

转储文件如下所示：

404460c24654e3d64024851dd0562ff1 *./extest.sh
7a900fdfa67739adcb1b764e240be05f *./test.txt
7a900fdfa67739adcb1b764e240be05f *./test2.txt
88f5a6b83182ce5c34c4cf3b17f21af2 *./dump.txt
c8709e009da4cce3ee2675f2a1ae9d4f *./test3.txt
d41d8cd98f00b204e9800998ecf8427e *./checksums.txt

整个代码是：

#!/bin/sh
func ()  
{
if [ "$1" == "" ]; then
echo "Default";
for i in `find` ; 
do
    #if [ -d $i ]; then
        #echo $i "is a directory";
    #fi
    if [ -f $i ]; then
        if [ "$i" != "./ex.sh" ]; then
            #echo $i "is a file";
            md5sum $i >> checksums.txt;
            sort --output=dump.txt checksums.txt;
        fi
    fi
done
fi

if [ "$1" == "--long" ]; then
echo "--long";
for i in `find` ; 
do
    #if [ -d $i ]; then
        #echo $i "is a directory";
    #fi
    if [ -f $i ]; then
        echo $i "is a file";        
    fi
done
fi

if [ "$1" == "--rm" ]; then
echo "--rm";
for i in `find` ; 
do
    #if [ -d $i ]; then
        #echo $i "is a directory";
    #fi
    if [ -f $i ]; then
        echo $i "is a file";        
    fi
done
fi
}

parse () {
echo start
for i in $(<dump.txt) ; do
    md=$(echo $i|awk -F'|' '{print $1}')
    file=$(echo $i|awk -F'|' '{print $2}')
    for j in $(<dump.txt) ; do
        m=$(echo $j|awk -F'|' '{print $1}')
        f=$(echo $j|awk -F'|' '{print $2}')
        #echo $md
        #echo $m
        if [ "$file" != "$f" ] && [ "$md" == "$m" ]; then
            echo Files $file and $f are duplicates.
        fi
    done
done
echo end
}

getArgs () {
if [ "$1" == "--long" ]; then
    echo "got the first param $1";
else
    if [ "$1" == "--rm" ]; then
        echo "got the second param $1";
    else
        if [ "$1" == "" ]; then
            echo "got default param";
        else
            echo "script.sh: unknown option $1";
            exit;
        fi  
    fi
fi
}


#start script
cat /dev/null > checksums.txt;
cat /dev/null > dump.txt;
getArgs $1;
func $1;
parse;
#end script

原文

I have a double loop that opens a files and uses awk to take the first section and the second section of each line. The first section is the md5sum of a file and the second chunk is the filename. However when I run the script to see if I have duplicate files, file1 fines file1 and so it thinks they are duplicaes even though they are the same file. Here is my code:

echo start
for i in $(<dump.txt) ; do
    md=$(echo $i|awk -F'|' '{print $1}')
    file=$(echo $i|awk -F'|' '{print $2}')
    for j in $(<dump.txt) ; do
        m=$(echo $j|awk -F'|' '{print $1}')
        f=$(echo $j|awk -F'|' '{print $2}')
        if [ "$md" == "$m" ]; then
            echo $file and $f are duplicates
        fi
    done
done
echo end

The dump file looks like this:

404460c24654e3d64024851dd0562ff1 *./extest.sh
7a900fdfa67739adcb1b764e240be05f *./test.txt
7a900fdfa67739adcb1b764e240be05f *./test2.txt
88f5a6b83182ce5c34c4cf3b17f21af2 *./dump.txt
c8709e009da4cce3ee2675f2a1ae9d4f *./test3.txt
d41d8cd98f00b204e9800998ecf8427e *./checksums.txt

The Entire code is:

#!/bin/sh
func ()  
{
if [ "$1" == "" ]; then
echo "Default";
for i in `find` ; 
do
    #if [ -d $i ]; then
        #echo $i "is a directory";
    #fi
    if [ -f $i ]; then
        if [ "$i" != "./ex.sh" ]; then
            #echo $i "is a file";
            md5sum $i >> checksums.txt;
            sort --output=dump.txt checksums.txt;
        fi
    fi
done
fi

if [ "$1" == "--long" ]; then
echo "--long";
for i in `find` ; 
do
    #if [ -d $i ]; then
        #echo $i "is a directory";
    #fi
    if [ -f $i ]; then
        echo $i "is a file";        
    fi
done
fi

if [ "$1" == "--rm" ]; then
echo "--rm";
for i in `find` ; 
do
    #if [ -d $i ]; then
        #echo $i "is a directory";
    #fi
    if [ -f $i ]; then
        echo $i "is a file";        
    fi
done
fi
}

parse () {
echo start
for i in $(<dump.txt) ; do
    md=$(echo $i|awk -F'|' '{print $1}')
    file=$(echo $i|awk -F'|' '{print $2}')
    for j in $(<dump.txt) ; do
        m=$(echo $j|awk -F'|' '{print $1}')
        f=$(echo $j|awk -F'|' '{print $2}')
        #echo $md
        #echo $m
        if [ "$file" != "$f" ] && [ "$md" == "$m" ]; then
            echo Files $file and $f are duplicates.
        fi
    done
done
echo end
}

getArgs () {
if [ "$1" == "--long" ]; then
    echo "got the first param $1";
else
    if [ "$1" == "--rm" ]; then
        echo "got the second param $1";
    else
        if [ "$1" == "" ]; then
            echo "got default param";
        else
            echo "script.sh: unknown option $1";
            exit;
        fi  
    fi
fi
}


#start script
cat /dev/null > checksums.txt;
cat /dev/null > dump.txt;
getArgs $1;
func $1;
parse;
#end script

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

颜漓半夏 2024-12-16 06:00:16

这非常简单：

if [ "$file" != "$f" ] && [ "$md" = "$m" ]; then
  echo "Files $file and $f are duplicates."
fi

请注意，我将比较运算符从 == 更改为 =，这是常见的形式。我还用双引号将消息引起来，以明确它是单个字符串，并且我不希望在两个变量 file 和f。

[更新：]

另一种查找重复项的方法（速度要快得多）是使用 awk 进行字符串处理：

awk -F'|' '
  NF == 2 {
    if (fname[$1] != "") {
      print("Files " fname[$1] " and " $2 " are duplicates.");
    }
    fname[$1] = $2;
  }
' dump.txt

It's pretty simple:

if [ "$file" != "$f" ] && [ "$md" = "$m" ]; then
  echo "Files $file and $f are duplicates."
fi

Note that I changed the comparison operator from == to =, which is the common form. I also surrounded the message by double quotes to make it clear that it is a single string and that I don't want the word expansion to happen on the two variables file and f.

[Update:]

Another way to find duplicates, which is much faster, is to use awk for string processing:

awk -F'|' '
  NF == 2 {
    if (fname[$1] != "") {
      print("Files " fname[$1] " and " $2 " are duplicates.");
    }
    fname[$1] = $2;
  }
' dump.txt

回复收藏 0 原文

好久不见√ 2024-12-16 06:00:16

如果您决定使用 awk 解决它，那么您实际上并不需要循环或两个循环。它就像文本处理中的核头一样。

   awk -F'|' '{if($1 in a)print "duplicate found:" $0 " AND "a[$1];else a[$1]=$0 }' yourfile

会带来你需要的东西。当然，您可以自定义文本信息。

请参阅下面的测试

kent$  cat md5chk.txt 
abcdefg|/foo/bar/a.txt
bbcdefg|/foo/bar2/ax.txt
cbcdefg|/foo/bar3/ay.txt
abcdefg|/foo/bar4/a.txt
1234567|/seven/7.txt
1234568|/seven/8.txt
1234567|/seven2/7.txt


kent$  awk -F'|' '{if($1 in a)print "duplicate found:" $0 " AND "a[$1];else a[$1]=$0 }' md5chk.txt
duplicate found:abcdefg|/foo/bar4/a.txt AND abcdefg|/foo/bar/a.txt
duplicate found:1234567|/seven2/7.txt AND 1234567|/seven/7.txt

已更新

awk     # the name of the tool/command
-F'|'   # declare delimiter is "|"
'{if($1 in a)  # if the first column was already saved
print "duplicate found:" $0 " AND "a[$1];  # print the info
else    # else
a[$1]=$0 }'  # save in an array named a, index=the 1st column (md5), value is the whole line.
yourfile  # your input file

you don't really need loop or two loops if you decide to solve it with awk. It is something like nuclear head in text processing.

   awk -F'|' '{if($1 in a)print "duplicate found:" $0 " AND "a[$1];else a[$1]=$0 }' yourfile

will bring what you need. of course the text info you could customize.

see the test below

kent$  cat md5chk.txt 
abcdefg|/foo/bar/a.txt
bbcdefg|/foo/bar2/ax.txt
cbcdefg|/foo/bar3/ay.txt
abcdefg|/foo/bar4/a.txt
1234567|/seven/7.txt
1234568|/seven/8.txt
1234567|/seven2/7.txt


kent$  awk -F'|' '{if($1 in a)print "duplicate found:" $0 " AND "a[$1];else a[$1]=$0 }' md5chk.txt
duplicate found:abcdefg|/foo/bar4/a.txt AND abcdefg|/foo/bar/a.txt
duplicate found:1234567|/seven2/7.txt AND 1234567|/seven/7.txt

updated

awk     # the name of the tool/command
-F'|'   # declare delimiter is "|"
'{if($1 in a)  # if the first column was already saved
print "duplicate found:" $0 " AND "a[$1];  # print the info
else    # else
a[$1]=$0 }'  # save in an array named a, index=the 1st column (md5), value is the whole line.
yourfile  # your input file

回复收藏 0 原文

~没有更多了~