awk 中的 Uniq;使用 awk 删除列中的重复值
我有一个大型数据文件,格式如下:
ENST00000371026 WDR78,WDR78,WDR78, WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 2,
ENST00000371023 WDR32 WD repeat domain 32 isoform 2
ENST00000400908 RERE,KIAA0458, atrophin-1 like protein isoform a,Homo sapiens mRNA for KIAA0458 protein, partial cds.,
列以制表符分隔。列中的多个值以逗号分隔。我想删除第二列中的重复值以产生如下结果:
ENST00000371026 WDR78 WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 2,
ENST00000371023 WDR32 WD repeat domain 32 isoform 2
ENST00000400908 RERE,KIAA0458 atrophin-1 like protein isoform a,Homo sapiens mRNA for KIAA0458 protein, partial cds.,
我尝试了下面的代码,但它似乎没有删除重复值。
awk '
BEGIN { FS="\t" } ;
{
split($2, valueArray,",");
j=0;
for (i in valueArray)
{
if (!( valueArray[i] in duplicateArray))
{
duplicateArray[j] = valueArray[i];
j++;
}
};
printf $1 "\t";
for (j in duplicateArray)
{
if (duplicateArray[j]) {
printf duplicateArray[j] ",";
}
}
printf "\t";
print $3
}' knownGeneFromUCSC.txt
如何正确删除第 2 列中的重复项?
I have a large datafile in the following format below:
ENST00000371026 WDR78,WDR78,WDR78, WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 2,
ENST00000371023 WDR32 WD repeat domain 32 isoform 2
ENST00000400908 RERE,KIAA0458, atrophin-1 like protein isoform a,Homo sapiens mRNA for KIAA0458 protein, partial cds.,
The columns are tab separated. Multiple values within columns are comma separated. I would like to remove the duplicate values in the second column to result in something like this:
ENST00000371026 WDR78 WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 2,
ENST00000371023 WDR32 WD repeat domain 32 isoform 2
ENST00000400908 RERE,KIAA0458 atrophin-1 like protein isoform a,Homo sapiens mRNA for KIAA0458 protein, partial cds.,
I tried the following code below but it doesn't seem to remove the duplicate values.
awk '
BEGIN { FS="\t" } ;
{
split($2, valueArray,",");
j=0;
for (i in valueArray)
{
if (!( valueArray[i] in duplicateArray))
{
duplicateArray[j] = valueArray[i];
j++;
}
};
printf $1 "\t";
for (j in duplicateArray)
{
if (duplicateArray[j]) {
printf duplicateArray[j] ",";
}
}
printf "\t";
print $3
}' knownGeneFromUCSC.txt
How can I remove the duplicates in column 2 correctly?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
由于
NR==2
,您的脚本仅作用于文件中的第二条记录(行)。我把它拿出来了,但这可能就是你想要的。如果是这样,你应该把它放回去。in
运算符检查是否存在索引,而不是值,因此我将duplicateArray
设为关联数组* 使用valueArray
中的值作为其索引。这样就不必在循环内循环遍历两个数组。split
语句将“WDR78,WDR78,WDR78”视为四个字段而不是三个,因此我添加了一个if
以防止它打印空值,这会导致“ ,WDR78,”如果if
不存在,则会被打印。* 实际上,AWK 中的所有数组都是关联的。
Your script acts only on the second record (line) in the file because of
NR==2
. I took it out, but it may be what you intend. If so, you should put it back.The
in
operator checks for the presence of the index, not the value, so I madeduplicateArray
an associative array* that uses the values fromvalueArray
as its indices. This saves from having to iterate over both arrays in a loop within a loop.The
split
statement sees "WDR78,WDR78,WDR78," as four fields rather than three so I added anif
to keep it from printing a null value which would result in ",WDR78," being printed if theif
weren't there.* In reality all arrays in AWK are associative.
Perl:
awk:
awk 脚本中的第 4 行用于在过滤唯一值后保留第二个字段中值的原始顺序。
Perl:
awk:
The line 4 in the awk script is used to preserve the original order of the values in the second field after filtering the unique values.
抱歉,我知道您问过 awk...但是 Perl 使这变得更加简单:
Sorry, I know you asked about awk... but Perl makes this much more simple:
Pure Bash 4.0(一个关联数组):
Pure Bash 4.0 (one associative array):