awk 中的 Uniq；使用 awk 删除列中的重复值

发布于 2024-09-04 03:46:08 字数 1184 浏览 5 评论 0原文

我有一个大型数据文件，格式如下：

ENST00000371026 WDR78,WDR78,WDR78,  WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 2,
ENST00000371023 WDR32   WD repeat domain 32 isoform 2
ENST00000400908 RERE,KIAA0458,  atrophin-1 like protein isoform a,Homo sapiens mRNA for KIAA0458 protein, partial cds.,

列以制表符分隔。列中的多个值以逗号分隔。我想删除第二列中的重复值以产生如下结果：

ENST00000371026 WDR78   WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 2,
ENST00000371023 WDR32   WD repeat domain 32 isoform 2
ENST00000400908 RERE,KIAA0458   atrophin-1 like protein isoform a,Homo sapiens mRNA for KIAA0458 protein, partial cds.,

我尝试了下面的代码，但它似乎没有删除重复值。

awk ' 
BEGIN { FS="\t" } ;
{
  split($2, valueArray,",");
  j=0;
  for (i in valueArray) 
  { 
    if (!( valueArray[i] in duplicateArray))
    {
      duplicateArray[j] = valueArray[i];
      j++;
    }
  };
  printf $1 "\t";
  for (j in duplicateArray) 
  {
    if (duplicateArray[j]) {
      printf duplicateArray[j] ",";
    }
  }
  printf "\t";
  print $3

}' knownGeneFromUCSC.txt

如何正确删除第 2 列中的重复项？

原文

I have a large datafile in the following format below:

ENST00000371026 WDR78,WDR78,WDR78,  WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 2,
ENST00000371023 WDR32   WD repeat domain 32 isoform 2
ENST00000400908 RERE,KIAA0458,  atrophin-1 like protein isoform a,Homo sapiens mRNA for KIAA0458 protein, partial cds.,

The columns are tab separated. Multiple values within columns are comma separated. I would like to remove the duplicate values in the second column to result in something like this:

ENST00000371026 WDR78   WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 2,
ENST00000371023 WDR32   WD repeat domain 32 isoform 2
ENST00000400908 RERE,KIAA0458   atrophin-1 like protein isoform a,Homo sapiens mRNA for KIAA0458 protein, partial cds.,

I tried the following code below but it doesn't seem to remove the duplicate values.

awk ' 
BEGIN { FS="\t" } ;
{
  split($2, valueArray,",");
  j=0;
  for (i in valueArray) 
  { 
    if (!( valueArray[i] in duplicateArray))
    {
      duplicateArray[j] = valueArray[i];
      j++;
    }
  };
  printf $1 "\t";
  for (j in duplicateArray) 
  {
    if (duplicateArray[j]) {
      printf duplicateArray[j] ",";
    }
  }
  printf "\t";
  print $3

}' knownGeneFromUCSC.txt

How can I remove the duplicates in column 2 correctly?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

最丧也最甜 2024-09-11 03:46:08

由于 NR==2，您的脚本仅作用于文件中的第二条记录（行）。我把它拿出来了，但这可能就是你想要的。如果是这样，你应该把它放回去。

in 运算符检查是否存在索引，而不是值，因此我将 duplicateArray 设为关联数组^* 使用 valueArray 中的值作为其索引。这样就不必在循环内循环遍历两个数组。

split 语句将“WDR78,WDR78,WDR78”视为四个字段而不是三个，因此我添加了一个 if 以防止它打印空值，这会导致“ ,WDR78,”如果 if 不存在，则会被打印。

^{* 实际上，AWK 中的所有数组都是关联的。}

awk '
BEGIN { FS="\t" } ;
{
  split($2, valueArray,",");
  j=0;
  for (i in valueArray)
  { 
    if (!(valueArray[i] in duplicateArray))
    { 
      duplicateArray[valueArray[i]] = 1
    }
  };
  printf $1 "\t";
  for (j in duplicateArray)
  {
    if (j)    # prevents printing an extra comma
    {
      printf j ",";
    }
  }
  printf "\t";
  print $3
  delete duplicateArray    # for non-gawk, use split("", duplicateArray)
}'

Your script acts only on the second record (line) in the file because of NR==2. I took it out, but it may be what you intend. If so, you should put it back.

The in operator checks for the presence of the index, not the value, so I made duplicateArray an associative array^* that uses the values from valueArray as its indices. This saves from having to iterate over both arrays in a loop within a loop.

The split statement sees "WDR78,WDR78,WDR78," as four fields rather than three so I added an if to keep it from printing a null value which would result in ",WDR78," being printed if the if weren't there.

^{* In reality all arrays in AWK are associative.}

awk '
BEGIN { FS="\t" } ;
{
  split($2, valueArray,",");
  j=0;
  for (i in valueArray)
  { 
    if (!(valueArray[i] in duplicateArray))
    { 
      duplicateArray[valueArray[i]] = 1
    }
  };
  printf $1 "\t";
  for (j in duplicateArray)
  {
    if (j)    # prevents printing an extra comma
    {
      printf j ",";
    }
  }
  printf "\t";
  print $3
  delete duplicateArray    # for non-gawk, use split("", duplicateArray)
}'

回复收藏 0 原文

情归归情 2024-09-11 03:46:08

Perl：

perl -F'\t' -lane'
  $F[1] = join ",", grep !$_{$_}++, split ",", $F[1]; 
  print join "\t", @F; %_ = ();
  ' infile

awk：

awk -F'\t' '{
  n = split($2, t, ","); _2 = x
  split(x, _) # use delete _ if supported
  for (i = 0; ++i <= n;)
    _[t[i]]++ || _2 = _2 ? _2 "," t[i] : t[i]
  $2 = _2 
  }-3' OFS='\t' infile

awk 脚本中的第 4 行用于在过滤唯一值后保留第二个字段中值的原始顺序。

Perl:

perl -F'\t' -lane'
  $F[1] = join ",", grep !$_{$_}++, split ",", $F[1]; 
  print join "\t", @F; %_ = ();
  ' infile

awk:

awk -F'\t' '{
  n = split($2, t, ","); _2 = x
  split(x, _) # use delete _ if supported
  for (i = 0; ++i <= n;)
    _[t[i]]++ || _2 = _2 ? _2 "," t[i] : t[i]
  $2 = _2 
  }-3' OFS='\t' infile

The line 4 in the awk script is used to preserve the original order of the values in the second field after filtering the unique values.

回复收藏 0 原文

风月客 2024-09-11 03:46:08

抱歉，我知道您问过 awk...但是 Perl 使这变得更加简单：

$ perl -n -e ' @t = split(/\t/);
  %t2 = map { $_ => 1 } split(/,/,$t[1]);
  $t[1] = join(",",keys %t2);
  print join("\t",@t); ' knownGeneFromUCSC.txt

Sorry, I know you asked about awk... but Perl makes this much more simple:

$ perl -n -e ' @t = split(/\t/);
  %t2 = map { $_ => 1 } split(/,/,$t[1]);
  $t[1] = join(",",keys %t2);
  print join("\t",@t); ' knownGeneFromUCSC.txt

回复收藏 0 原文

流殇 2024-09-11 03:46:08

Pure Bash 4.0（一个关联数组）：

declare -a part                            # parts of a line
declare -a part2                           # parts 2. column
declare -A check                           # used to remember items in part2

while read  line ; do
  part=( $line )                           # split line using whitespaces
  IFS=','                                  # separator is comma
  part2=( ${part[1]} )                     # split 2. column using comma
  if [ ${#part2[@]} -gt 1 ] ; then         # more than 1 field in 2. column?
    check=()                               # empty check array
    new2=''                                # empty new 2. column
    for item in ${part2[@]} ; do 
      (( check[$item]++ ))                 # remember items in 2. column
      if [ ${check[$item]} -eq 1 ] ; then  # not yet seen?
        new2=$new2,$item                   # add to new 2. column
      fi 
    done
    part[1]=${new2#,}                      # remove leading comma
  fi 
  IFS=
\t'                                # separator for the output
  echo "${part[*]}"                        # rebuild line
done < "$infile"

Pure Bash 4.0 (one associative array):

declare -a part                            # parts of a line
declare -a part2                           # parts 2. column
declare -A check                           # used to remember items in part2

while read  line ; do
  part=( $line )                           # split line using whitespaces
  IFS=','                                  # separator is comma
  part2=( ${part[1]} )                     # split 2. column using comma
  if [ ${#part2[@]} -gt 1 ] ; then         # more than 1 field in 2. column?
    check=()                               # empty check array
    new2=''                                # empty new 2. column
    for item in ${part2[@]} ; do 
      (( check[$item]++ ))                 # remember items in 2. column
      if [ ${check[$item]} -eq 1 ] ; then  # not yet seen?
        new2=$new2,$item                   # add to new 2. column
      fi 
    done
    part[1]=${new2#,}                      # remove leading comma
  fi 
  IFS=
\t'                                # separator for the output
  echo "${part[*]}"                        # rebuild line
done < "$infile"

回复收藏 0 原文

~没有更多了~