如何在文件的每一行中用字母替换一些数字(根据该行第5列和第6列中存在的字母)?
我有一个以空格分隔的文件,如下所示:
probeset_id submitted_id chr snp_pos alleleA alleleB 562_201 562_202 562_203 562_204 562_205 562_206 562_207 562_208 562_209 562_210 562_211 562_212 562_213 562_214 562_215 562_216 562_217 562_218 562_219 562_220 562_221 562_222 562_223 562_224 562_225 562_226 562_227 562_228 562_229 562_230 562_231 562_232 562_233 562_234 562_235 562_236 562_237 562_238 562_239 562_240 562_241 562_242 562_243 562_244 562_245 562_246 562_247 562_248 562_249 562_250 562_251 562_252 562_253 562_254 562_255 562_256 562_257 562_258 562_259 562_260 562_261 562_262 562_263 562_264 562_265 562_266 562_267 562_268 562_269 562_270 562_271 562_272 562_273 562_274 562_275 562_276 562_277 562_278 562_279 562_280 562_281 562_283 562_284 562_285 562_289 562_291 562_292 562_294 562_295 562_296 562_400 562_401 562_402 562_403 562_404 562_405
AX-75448119 Chr1_41908741 1 41908741 T C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 1 1 1 0 1 0 0 0 0 2 2 0 0 0 0 0 1 0 0 0 0 0
AX-75448118 Chr1_41908545 1 41908545 T A 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 0 0 1 2 2 2 2 2 2 2 2 2 0 0 0 0 0 1 1 0 1 1 0 0 0 0 0 0 1 2 2 2 0 1 1 1 2 -1 1 2 0 0 2 1 1 0 1 0 1 2 1 0 0 1 2 2 1 2 2 0 1 2 2 2 2 2 2 0 1 0 0 0 1 2 2 2 2 0
我想根据第 5 列和第 5 列用字母替换数字。 6th
- 将
0
替换为$5 $5
(第 5 列重复两次),例如,如果第 5 列是T
则替换0
为T T
- 将
2
替换为$6 $6
(第 6 列重复两次),例如,如果第 6 列是C< /code> 替换
2
由C C
- 将
1
替换为$5 $6
例如,如果第 5 和第 6 列是T
和C< /code> 分别将
1
替换为T C
- 将
-1
替换为? ?
我必须注意,第 5 列和第 6 列可以是 T、A、C 和 G
所以我想要的输出是:
AX-75448119 Chr1_41908741 1 41908741 T C T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T C T T T T T T T C T C T C T C T C T C T T T C T T T T T T T T C C C C T T T T T T T T T T T C T T T T T T T T T T
AX-75448118 Chr1_41908545 1 41908545 T A A A A A A A A A A A A A A A A A A A T T T T T T T T T T T T T T T T T T T T T T T A A A A A A A A A A A A A A A A A A A T T T T T T T T T T T A T A T T T A T A T T T T T T T T T T T T T A A A A A A A T T T A T A T A A A ? ? T A A A T T T T A A T A T A T T T A T T T A A A T A T T T T T A A A A A T A A A A A T T T A A A A A A A A A A A A A T T T A T T T T T T T A A A A A A A A A T T
我不知道 awk 是否可以实现这一点!如果没有,我会尝试使用Python,但我更喜欢像awk这样的Linux命令(它比Python快得多,因为我使用的文件有120万行,我的计算机可以用Python交换!)
I have a file which is space delimited which looks like this:
probeset_id submitted_id chr snp_pos alleleA alleleB 562_201 562_202 562_203 562_204 562_205 562_206 562_207 562_208 562_209 562_210 562_211 562_212 562_213 562_214 562_215 562_216 562_217 562_218 562_219 562_220 562_221 562_222 562_223 562_224 562_225 562_226 562_227 562_228 562_229 562_230 562_231 562_232 562_233 562_234 562_235 562_236 562_237 562_238 562_239 562_240 562_241 562_242 562_243 562_244 562_245 562_246 562_247 562_248 562_249 562_250 562_251 562_252 562_253 562_254 562_255 562_256 562_257 562_258 562_259 562_260 562_261 562_262 562_263 562_264 562_265 562_266 562_267 562_268 562_269 562_270 562_271 562_272 562_273 562_274 562_275 562_276 562_277 562_278 562_279 562_280 562_281 562_283 562_284 562_285 562_289 562_291 562_292 562_294 562_295 562_296 562_400 562_401 562_402 562_403 562_404 562_405
AX-75448119 Chr1_41908741 1 41908741 T C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 1 1 1 0 1 0 0 0 0 2 2 0 0 0 0 0 1 0 0 0 0 0
AX-75448118 Chr1_41908545 1 41908545 T A 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 0 0 1 2 2 2 2 2 2 2 2 2 0 0 0 0 0 1 1 0 1 1 0 0 0 0 0 0 1 2 2 2 0 1 1 1 2 -1 1 2 0 0 2 1 1 0 1 0 1 2 1 0 0 1 2 2 1 2 2 0 1 2 2 2 2 2 2 0 1 0 0 0 1 2 2 2 2 0
I want to replace the digits by letters according to column 5th & 6th
- Replace
0
by$5 $5
(two repeat of column 5th) e.g if the 5th column isT
replace0
byT T
- Replace
2
by$6 $6
(two repeat of column 6th) e.g if the 6th column isC
replace2
byC C
- Replace
1
by$5 $6
e.g if the 5th and 6th columns areT
andC
, respectively, replace1
byT C
- Replace
-1
by? ?
I have to note that column 5th and 6th can be T, A, C, and G
So what I would like to have as output is:
AX-75448119 Chr1_41908741 1 41908741 T C T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T C T T T T T T T C T C T C T C T C T C T T T C T T T T T T T T C C C C T T T T T T T T T T T C T T T T T T T T T T
AX-75448118 Chr1_41908545 1 41908545 T A A A A A A A A A A A A A A A A A A A T T T T T T T T T T T T T T T T T T T T T T T A A A A A A A A A A A A A A A A A A A T T T T T T T T T T T A T A T T T A T A T T T T T T T T T T T T T A A A A A A A T T T A T A T A A A ? ? T A A A T T T T A A T A T A T T T A T T T A A A T A T T T T T A A A A A T A A A A A T T T A A A A A A A A A A A A A T T T A T T T T T T T A A A A A A A A A T T
I don't know if this is possible by awk or not! if not I will give a try in python but I would rather a Linux command like awk (which is much faster than python because I'm using a file which has 1.2 million lines and my computer can swap by python!)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
输出
注意:您的规范仍然不完整,当第 5 列和第 6 列分别不是
T C
时,您从未说过如何处理1
。您遗漏了许多排列。Output
Note: Your spec is still incomplete, you never say what to do with
1
when the 5th and 6th columns are notT C
, respectively. There are a number of permutations that you are leaving out.这可能对你有用:
This might work for you:
这也会对前 4 列进行替换,我并没有理会 1 或 -1 的情况(留给读者作为练习),但您应该能够轻松扩展它以适应:
我真的怀疑awk 在这方面会比 perl 更快。
This will also do replacements on the first 4 columns, and I didn't bother with the 1 or -1 case (left as an exercise for the reader), but you should be able to easily expand this to suit:
I really doubt that awk will be faster than perl at this.
awk 绝对是你的朋友。
awk 逐行读取数据文件。您不需要/想要有任何类型的循环结构(除非您非常先进)。
您只需读取文件的每一行并将其打印出来(它将显示在您的屏幕上,所以不要做一个大文件)
请注意,我使用
$0
来指示“整行”数据”。awk 还具有引用每个数据字段的变量,您可以使用
$2
等值来打印文件中的第二个字段。因此,对于您的问题,您想要测试每一行,测试某些字段并设置新值。
要更改某个字段后的所有字段,请根据需要合并此代码,
}' inputfile ...
NR>1
表示仅处理大于 1 的行号。请注意,我们使用简单的逻辑来实施您的测试。添加越来越多的东西很容易。回想一下,很多时候使用“分层”逻辑是有意义的
if ($5==0) { ... } else if ($5 == 1) { ....}
一个问题是例如,您要求输出“C C”。当你在 awk 中执行类似 `$5="C C" 的操作时,awk 将重新校准其字段编号,因此 $5 将是 C,$6 将是 C,而不是之前的值。
我采取了打印“CC”的快捷方式,然后在最后使用 sed 创建您的规范指示的“CC”值。
我不知道如何处理
因为它必须是其中之一,我不确定你想在哪个领域进行操作。使用上面的代码作为指导。如果您遇到困难,请发布一个新问题,其中包含示例输入数据、预期输出、当前输出和您正在使用的代码。
我希望这有帮助。
awk is definitely your friend.
awk reads a datafile, line-by-line. You don't need/want to have any kind of loop structure (unless you're getting very advanced).
Is all you need to read each line of a file and print it out (it will go to your screen, so don't do a big file)
note that I used
$0
to indicate 'the whole line of data'.Awk also has variables to refer to each field of data, you use values like
$2
to print the 2nd field in the file.So for your problem, you want to test each line, test certain fields and set new values.
To change all fields after a certain field, incorporate this code as needed,
}' inputfile ...
The
NR>1
means only processes line numbers greater than 1.Note that we're using simple logic to implment your tests. it is easy to add more and more. Recall that many times it makes sense to use 'layered' logic
if ($5==0) { ... } else if ($5 == 1) { ....}
The one problem is your requirement to output 'C C', for example. When you do something like `$5="C C" in awk, awk will recalibrate its field numbers, so $5 will be C and $6 will be C, not the value that was there before.
I have taken the short-cut of printing 'CC', and then using sed at the the end to create the 'C C' values that your specfication indicates.
I'm not sure how to deal with
as it has to be one or the other, and I'm not sure what field you want to operate on. Use the above code as a guide. If you get stuck, post a new question with sample input data, expected output, current output and the code you are using.
I hope this helps.
最好通过相等而不是通过正则表达式来检查字段的值:
Better to check the field's value by equality instead of by regexp: