为什么两个 md5sum 文件的比较无法正常工作？

发布于 2025-01-11 03:10:32 字数 1141 浏览 0 评论 0原文

我有 2 个列表，其中包含带有 md5sum 检查的文件，并且这些列表对于相同文件具有不同的路径。

第一个文件中带有校验和的内容示例（server.list）：

2c03ff18a643a1437ec0cf051b8b7b9d  /tmp/fastq1_L001_R1_001.fastq.gz
c430f587aba1aa9f4fdf69aeb4526621  /tmp/fastq1_L001_R2_001.fastq.gz/
6e6bcd84f264233cf7c428c0cfdc0c03  tmp/fastq1_L002_R1_001.fastq.gz

两个文件中带有校验和的内容示例（downloaded.list）：

2c03ff18a643a1437ec0cf051b8b7b9d  /home/projects/fastq1_L001_R1_001.fastq.gz
c430f587aba1aa9f4fdf69aeb4526621  /home/projects/fastq1_L001_R2_001.fastq.gz
6e6bcd84f264233cf7c428c0cfdc0c03  /home/projects/fastq1_L002_R1_001.fastq.gz

当我运行以下行时，我收到以下行：

awk -F"/" 'FNR==NR{filearray[$1]=$NF; next }!($1 in filearray){printf "%s has a different md5sum\n",$NF}' downloaded.list server.list

fastq1_L001_R1_001.fastq.gz has a different md5sum
fastq1_L001_R2_001.fastq.gz has a different md5sum
fastq1_L002_R2_001.fastq.gz has a different md5sum

为什么我收到此消息两个文件中的第一列是否相同？有人可以告诉我这个问题吗？

编辑：

如果我删除路径并只保留文件名，它就可以正常工作。

编辑2：

正如所指出的，文件路径形式还有另一种可能性，它不以 / 开头。在这种情况下，我无法使用 / 作为字段分隔符。

原文

I have 2 lists with files with their md5sum checks and the lists have different paths for the same files.

Example of content in first file with check sums (server.list):

2c03ff18a643a1437ec0cf051b8b7b9d  /tmp/fastq1_L001_R1_001.fastq.gz
c430f587aba1aa9f4fdf69aeb4526621  /tmp/fastq1_L001_R2_001.fastq.gz/
6e6bcd84f264233cf7c428c0cfdc0c03  tmp/fastq1_L002_R1_001.fastq.gz

Example of content in two file with check sums (downloaded.list):

2c03ff18a643a1437ec0cf051b8b7b9d  /home/projects/fastq1_L001_R1_001.fastq.gz
c430f587aba1aa9f4fdf69aeb4526621  /home/projects/fastq1_L001_R2_001.fastq.gz
6e6bcd84f264233cf7c428c0cfdc0c03  /home/projects/fastq1_L002_R1_001.fastq.gz

When I run the following line, I got the following lines:

awk -F"/" 'FNR==NR{filearray[$1]=$NF; next }!($1 in filearray){printf "%s has a different md5sum\n",$NF}' downloaded.list server.list

fastq1_L001_R1_001.fastq.gz has a different md5sum
fastq1_L001_R2_001.fastq.gz has a different md5sum
fastq1_L002_R2_001.fastq.gz has a different md5sum

Why I am getting this message since the first column is the same in both files? Can someone enlighten me on this issue?

Edit:

If I remove the path and leave only the file name, it works just fine.

Edit 2:

As pointed out, there is another possibility of file path form, which does not start with /. In this case, I cannot use / as the field separator.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

云淡月浅 2025-01-18 03:10:32

假设：

文件名（无路径）和 md5sum 必须匹配
文件名可能不会以相同的顺序列出
文件名可能不存在于两个文件中

示例数据：

$ head downloaded.list server.list
==> downloaded.list <==
2c03ff18a643a1437ec0cf051b8b7b9d  /home/projects/fastq1_L001_R1_001.fastq.gz   # match
YYYYf587aba1aa9f4fdf69aeb4526621  /home/projects/fastq1_L001_R5_911.fastq.gz   # different md5sum
c430f587aba1aa9f4fdf69aeb4526621  /home/projects/fastq1_L001_R2_001.fastq.gz   # match
MNOPf587aba1aa9f4fdf69aeb4526621  /home/projects/fastq1_L001_R8_abc.fastq.gz   # filename does not exist in other file
ABCDf587aba1aa9f4fdf69aeb4526621  /home/projects/fastq1_L001_R9_004.fastq.gz   # different filename but matching md5sum (vs last line of other file)

==> server.list <==
2c03ff18a643a1437ec0cf051b8b7b9d  /tmp/fastq1_L001_R1_001.fastq.gz             # match
c430f587aba1aa9f4fdf69aeb4526621  /tmp/fastq1_L001_R2_001.fastq.gz             # match
XXXXf587aba1aa9f4fdf69aeb4526621  /tmp/fastq1_L001_R5_911.fastq.gz             # different md5sum
TUVWff18a643a1437ec0cf051b8b7b9d  /tmp/fastq1_L999_R6_922.fastq.gz             # filename does not exist in other file
ABCDf587aba1aa9f4fdf69aeb4526621  /tmp/fastq1_L001_R7_933.fastq.gz             # different filename but matching md5sum (vs last line of other file)

解决空白问题以及验证的一个 awk 想法文件名匹配：

awk '                                    # stick with default field delimiter of white space but ...
{ md5sum=$1
  n=split($2,arr,"/")                    # split 2nd field on "/" delimiter
  fname=arr[n]

  if (FNR==NR)
     filearray[fname]=md5sum
  else {
     if (fname in filearray && filearray[fname] == $1)
        next
     printf "%s has a different md5sum\n",fname
  }
}
' downloaded.list server.list

这会生成：

fastq1_L001_R5_911.fastq.gz has a different md5sum
fastq1_L999_R6_922.fastq.gz has a different md5sum
fastq1_L001_R7_933.fastq.gz has a different md5sum

Assumptions:

filename (sans path) and md5sum have to match
filenames may not be listed in the same order
filenames may not exist in both files

Sample data:

$ head downloaded.list server.list
==> downloaded.list <==
2c03ff18a643a1437ec0cf051b8b7b9d  /home/projects/fastq1_L001_R1_001.fastq.gz   # match
YYYYf587aba1aa9f4fdf69aeb4526621  /home/projects/fastq1_L001_R5_911.fastq.gz   # different md5sum
c430f587aba1aa9f4fdf69aeb4526621  /home/projects/fastq1_L001_R2_001.fastq.gz   # match
MNOPf587aba1aa9f4fdf69aeb4526621  /home/projects/fastq1_L001_R8_abc.fastq.gz   # filename does not exist in other file
ABCDf587aba1aa9f4fdf69aeb4526621  /home/projects/fastq1_L001_R9_004.fastq.gz   # different filename but matching md5sum (vs last line of other file)

==> server.list <==
2c03ff18a643a1437ec0cf051b8b7b9d  /tmp/fastq1_L001_R1_001.fastq.gz             # match
c430f587aba1aa9f4fdf69aeb4526621  /tmp/fastq1_L001_R2_001.fastq.gz             # match
XXXXf587aba1aa9f4fdf69aeb4526621  /tmp/fastq1_L001_R5_911.fastq.gz             # different md5sum
TUVWff18a643a1437ec0cf051b8b7b9d  /tmp/fastq1_L999_R6_922.fastq.gz             # filename does not exist in other file
ABCDf587aba1aa9f4fdf69aeb4526621  /tmp/fastq1_L001_R7_933.fastq.gz             # different filename but matching md5sum (vs last line of other file)

One awk idea to address white space issues as well as verifying filename matches:

awk '                                    # stick with default field delimiter of white space but ...
{ md5sum=$1
  n=split($2,arr,"/")                    # split 2nd field on "/" delimiter
  fname=arr[n]

  if (FNR==NR)
     filearray[fname]=md5sum
  else {
     if (fname in filearray && filearray[fname] == $1)
        next
     printf "%s has a different md5sum\n",fname
  }
}
' downloaded.list server.list

This generates:

fastq1_L001_R5_911.fastq.gz has a different md5sum
fastq1_L999_R6_922.fastq.gz has a different md5sum
fastq1_L001_R7_933.fastq.gz has a different md5sum

回复收藏 0 原文

旧时浪漫 2025-01-18 03:10:32

用作数组键的 $1 上的空格导致了问题。删除它：

awk -F"/" '{gsub(/ /, "", $1)}; FNR==NR{filearray[ $1]=$NF; next }!($1 in filearray){printf "%s has a different md5sum\n",$NF}' list1.txt list2.txt

The whitespace on $1 used as an array key is causing problems. Removing it:

awk -F"/" '{gsub(/ /, "", $1)}; FNR==NR{filearray[ $1]=$NF; next }!($1 in filearray){printf "%s has a different md5sum\n",$NF}' list1.txt list2.txt

回复收藏 0 原文

~没有更多了~

关于作者

凉月流沐

暂无简介

文章

26 人气

关注发私信

卷耳

文章 0 评论 0

关注

佚名

文章 0 评论 0

关注

℉服软

文章 0 评论 0

关注

qq_2gSKZM

文章 0 评论 0

关注

凉宸

文章 0 评论 0

关注

gyhjy

文章 0 评论 0

友情链接

文江博客

为什么两个 md5sum 文件的比较无法正常工作？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

卷耳

佚名

℉服软

qq_2gSKZM

凉宸

gyhjy

友情链接

为什么两个 md5sum 文件的比较无法正常工作？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

卷耳

佚名

℉服软

qq_2gSKZM

凉宸

gyhjy

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。