根据字段数量连接列
我有一个大型工作流程,被未表征的染色体所困扰 - 一个过程生成一个计数矩阵,其中包含典型染色体的 n 字段,对于具有未表征染色体的品系,字段为 n + 1 和n + 2。这对于下游使用 read.table() 来说是一个令人头痛的问题。
我的方法是首先确定 n 是什么,并用它来分离包含这些未表征染色体的 n + 1 和 n + 2 系:
awk -v nf="$canon" 'NF!=nf{print}{}' matrix.txt | head
chr22 KI270733v1 random 123189 123362 + 6 4 8 0 0 10
chrUn GL000220v1 105951 106963 - 0 0 0 0 10 0
这些行的目标是通过连接第一列和第二列(其中 n + 1)以及第一列、第二列和第三列(其中 )来匹配字段数 n n+ 2 生成:
chrUn-GL000220v1 105951 106963 - 0 0 0 0 10 0
chr22-KI270733v1-random 123189 123362 + 6 4 8 0 0 10
尝试
对矩阵进行子集化并将其分成 3 个文件,一个用于 NF==n、NF==n+1 和 NF==n。 NF==n+2 并加入列:
awk -v n="$canon" 'NF==n{print}{}' matrix.txt | head
chr1 15534236 15536814 - 0 10 0 0 0 3
(^ 无需执行任何操作)
awk -v n="$canon" 'NF==n+1{print}{}' matrix.txt | awk -v OFS="\t" '{print $1"-"$2,$3,$4,$5,$6,$7,$8,$9,$10}' | head
chrUn-GL000220v1 105992 107309 - 0 0 0 0 0 4
,
awk -v n="$canon" 'NF==n+2{print}{}' matrix.txt | awk -v OFS="\t" '{print $1"-"$2"-"$3,$4,$5,$6,$7,$8,$9,$10,$11,$12}' | head
chr22-KI270733v1-random 123189 123362 + 6 4 8 0 0 10
不幸的是,这个解决方案不是动态的 - 我必须指定列的范围。在前四列详细说明 Chr、Start、Stop、Strand 之后,工作流可以包含任意数量的列。
希望我已经很好地定义了问题,任何建议将不胜感激。
I have a large workflow that gets tripped up by uncharacterized chromosomes - a process produces a count matrix that has n fields for canonical chromosomes, and for lines with uncharacterized chromosomes, the fields are n + 1 and n + 2. This is a headache for using read.table()
downstream.
My approach is to first identify what n is, and use this to isolate the n + 1 and n + 2 lines containing these uncharacterized chromosomes:
awk -v nf="$canon" 'NF!=nf{print}{}' matrix.txt | head
chr22 KI270733v1 random 123189 123362 + 6 4 8 0 0 10
chrUn GL000220v1 105951 106963 - 0 0 0 0 10 0
The goal is for these lines to match the number of fields n by joining the 1st and 2nd columns where n + 1 and the 1st, 2nd and 3rd columns where n + 2 to produce:
chrUn-GL000220v1 105951 106963 - 0 0 0 0 10 0
chr22-KI270733v1-random 123189 123362 + 6 4 8 0 0 10
Attempt
I could subset the matrix and split it into 3 files, one for NF==n, NF==n+1 & NF==n+2 and join the columns:
awk -v n="$canon" 'NF==n{print}{}' matrix.txt | head
chr1 15534236 15536814 - 0 10 0 0 0 3
(^ no action needed)
awk -v n="$canon" 'NF==n+1{print}{}' matrix.txt | awk -v OFS="\t" '{print $1"-"$2,$3,$4,$5,$6,$7,$8,$9,$10}' | head
chrUn-GL000220v1 105992 107309 - 0 0 0 0 0 4
and
awk -v n="$canon" 'NF==n+2{print}{}' matrix.txt | awk -v OFS="\t" '{print $1"-"$2"-"$3,$4,$5,$6,$7,$8,$9,$10,$11,$12}' | head
chr22-KI270733v1-random 123189 123362 + 6 4 8 0 0 10
Unfortunately, this solution is not dynamic - I have to specify the range of columns. The workflow could contain any number of columns after the first four detailing Chr, Start, Stop, Strand.
Hopefully I have defined the problem well, any suggestions would be greatly appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
尝试:
累积到
$1
并清除其余的$i=""
。您还可以将值向左移动
if (NF != n) for (i = 2; i < NF; ++i) $i=$(i+(NF-n))
值并设置NF=n
。Try:
Accumulate into
$1
and clean$i=""
the rest.You could also move values to the left
if (NF != n) for (i = 2; i < NF; ++i) $i=$(i+(NF-n))
values and setNF=n
.