管道尴尬和grep保存文件的特定字段

发布于 2025-01-24 13:05:12 字数 540 浏览 1 评论 0原文

我想实现的目标:

  • GREP:提取带有重叠号和长度
  • 尴尬的行:删除“长度:”第2列
  • 排序:按长度(按降序)排序(按降序)

当前代码

grep "length:" test_reads.fa.contigs.vcake_output | awk -F:'{print $2}' |sort -g -r > contig.txt

示例test_reads.fa.contigs的当前代码示例内容。 VCAKE_OUTPUT

>Contig_11 length:42
ACTCTGAGTGATCTTGGCGTAATAGGCCTGCTTAATGATCGT
>Contig_0 length:99995
ATTTATGCCGTTGGCCACGAATTCAGAATCATATTA

预期输出

>Contig_0 99995
>Contig_11 42

What I want to achieve:

  • grep: extract lines with the contig number and length
  • awk: remove "length:" from column 2
  • sort: sort by length (in descending order)

Current code

grep "length:" test_reads.fa.contigs.vcake_output | awk -F:'{print $2}' |sort -g -r > contig.txt

Example content of test_reads.fa.contigs.vcake_output:

>Contig_11 length:42
ACTCTGAGTGATCTTGGCGTAATAGGCCTGCTTAATGATCGT
>Contig_0 length:99995
ATTTATGCCGTTGGCCACGAATTCAGAATCATATTA

Expected output

>Contig_0 99995
>Contig_11 42

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

烟酉 2025-01-31 13:05:12

在您显示的样本中,请尝试以下内容以下awk + sort解决方案。

awk -F'[: ]' '/^>/{print $1,$3}' Input_file | sort -nrk2

说明: 简单说明将是,运行awk先于读取input_file,其中将字段分隔符设置为:或空间和检查条件是否从>开始,然后打印其第一和第二字段,然后将其输出(作为标准输入)发送到sort> sort命令,其中从第二个字段对其进行排序要获得所需的输出。

With your shown samples, please try following awk + sort solution here.

awk -F'[: ]' '/^>/{print $1,$3}' Input_file | sort -nrk2

Explanation: Simple explanation would be, running awk program to read Input_file first, where setting field separator as : OR space and checking condition if line starts from > then printing its 1st and 2nd fields then sending its output(as a standard input) to sort command where sorting it from 2nd field to get required output.

命硬 2025-01-31 13:05:12

这是一个gnu-wawk解决方案,它在单个命令中都可以完成所有操作,而无需调用sort

awk -F '[:[:blank:]]' '
$2 == "length" {arr[$1] = $3}
END {
   PROCINFO["sorted_in"] = "@ind_num_asc"
   for (i in arr)
      print i, arr[i]
}' file

>Contig_0 99995
>Contig_11 42

Here is a gnu-awk solution that does it all in a single command without invoking sort:

awk -F '[:[:blank:]]' '
$2 == "length" {arr[$1] = $3}
END {
   PROCINFO["sorted_in"] = "@ind_num_asc"
   for (i in arr)
      print i, arr[i]
}' file

>Contig_0 99995
>Contig_11 42
拥醉 2025-01-31 13:05:12

也许是,将Grep and Awk结合在一起:

awk -F '[ :]' '$2 == "length" {print $1, $3}' file | sort ...

Perhaps this, combining grep and awk:

awk -F '[ :]' '$2 == "length" {print $1, $3}' file | sort ...
命硬 2025-01-31 13:05:12

假设:

  • 如果多个行的长度相同,则使用“版本”对第一个列进行排序,然后

将一些其他行添加到示例输入中:

$ cat test_reads.fa.contigs.vcake_output
>Contig_0 length:99995
ATTTATGCCGTTGGCCACGAATTCAGAATCATATTA
>Contig_11 length:42
ACTCTGAGTGATCTTGGCGTAATAGGCCTGCTTAATGATCGT
>Contig_17 length:93
ACTCTGAGTGATCTTGGCGTAATAGGCCTGCTTAATGATCGT
>Contig_837 ignore-this-length:1000000
ACTCTGAGTGATCTTGGCGTAATAGGCCTGCTTAATGATCGT
>Contig_8 length:42
ACTCTGAGTGATCTTGGCGTAATAGGCCTGCTTAATGATCGT

一个sed/sort构想:

$ sed -rn 's/(>[^ ]+) length:(.*)$/\1 \2/p' test_reads.fa.contigs.vcake_output | sort -k2,2nr -k1,1V

wery:

  • - en - 启用扩展正则支持并抑制输入数据的正常打印
  • (> [^])+)+) - (1st捕获组) - >遵循由1个或多个非空间字符
  • 长度: - 空间,然后是长度:
  • (。*) - (第二捕获组) - 0或0更多字符(遵循结肠)
  • $ - 线的结尾
  • \ 1 \ 2/p

    - 打印1st捕获组 + < space> + +第二个捕获组

  • -k2,2nr - 在r Everse n umeric order订单
  • -k1, 1V - 按v ersion订单中的第1(空格删除)字段进行排序

>Contig_0 99995
>Contig_17 93
>Contig_8 42
>Contig_11 42

Assumptions:

  • if more than one row has the same length then additionally sort the 1st column using 'version' sort

Adding some additional lines to the sample input:

$ cat test_reads.fa.contigs.vcake_output
>Contig_0 length:99995
ATTTATGCCGTTGGCCACGAATTCAGAATCATATTA
>Contig_11 length:42
ACTCTGAGTGATCTTGGCGTAATAGGCCTGCTTAATGATCGT
>Contig_17 length:93
ACTCTGAGTGATCTTGGCGTAATAGGCCTGCTTAATGATCGT
>Contig_837 ignore-this-length:1000000
ACTCTGAGTGATCTTGGCGTAATAGGCCTGCTTAATGATCGT
>Contig_8 length:42
ACTCTGAGTGATCTTGGCGTAATAGGCCTGCTTAATGATCGT

One sed/sort idea:

$ sed -rn 's/(>[^ ]+) length:(.*)$/\1 \2/p' test_reads.fa.contigs.vcake_output | sort -k2,2nr -k1,1V

Where:

  • -En - enable extended regex support and suppress normal printing of input data
  • (>[^ ])+) - (1st capture group) - > followed by 1 or more non-space characters
  • length: - space followed by length:
  • (.*) - (2nd capture group) - 0 or more characters (following the colon)
  • $ - end of line
  • \1 \2/p - print 1st capture group + <space> + 2nd capture group
  • -k2,2nr - sort by 2nd (spaced-delimited) field in reverse numeric order
  • -k1,1V - sort by 1st (space-delimited) field in Version order

This generates:

>Contig_0 99995
>Contig_17 93
>Contig_8 42
>Contig_11 42
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文