如何在文件的每一行中用字母替换一些数字(根据该行第5列和第6列中存在的字母)?

发布于 2025-01-03 07:23:59 字数 2916 浏览 0 评论 0原文

我有一个以空格分隔的文件,如下所示:

probeset_id submitted_id chr snp_pos alleleA alleleB 562_201 562_202 562_203 562_204 562_205 562_206 562_207 562_208 562_209 562_210 562_211 562_212 562_213 562_214 562_215 562_216 562_217 562_218 562_219 562_220 562_221 562_222 562_223 562_224 562_225 562_226 562_227 562_228 562_229 562_230 562_231 562_232 562_233 562_234 562_235 562_236 562_237 562_238 562_239 562_240 562_241 562_242 562_243 562_244 562_245 562_246 562_247 562_248 562_249 562_250 562_251 562_252 562_253 562_254 562_255 562_256 562_257 562_258 562_259 562_260 562_261 562_262 562_263 562_264 562_265 562_266 562_267 562_268 562_269 562_270 562_271 562_272 562_273 562_274 562_275 562_276 562_277 562_278 562_279 562_280 562_281 562_283 562_284 562_285 562_289 562_291 562_292 562_294 562_295 562_296 562_400 562_401 562_402 562_403 562_404 562_405 
AX-75448119 Chr1_41908741 1 41908741 T C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 1 1 1 0 1 0 0 0 0 2 2 0 0 0 0 0 1 0 0 0 0 0 
AX-75448118 Chr1_41908545 1 41908545 T A 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 0 0 1 2 2 2 2 2 2 2 2 2 0 0 0 0 0 1 1 0 1 1 0 0 0 0 0 0 1 2 2 2 0 1 1 1 2 -1 1 2 0 0 2 1 1 0 1 0 1 2 1 0 0 1 2 2 1 2 2 0 1 2 2 2 2 2 2 0 1 0 0 0 1 2 2 2 2 0

我想根据第 5 列和第 5 列用字母替换数字。 6th

  1. 0 替换为 $5 $5(第 5 列重复两次),例如,如果第 5 列是 T 则替换 0T T
  2. 2 替换为 $6 $6(第 6 列重复两次),例如,如果第 6 列是 C< /code> 替换 2C C
  3. 1 替换为 $5 $6 例如,如果第 5 和第 6 列是 TC< /code> 分别将 1 替换为 T C
  4. -1 替换为 ? ?

我必须注意,第 5 列和第 6 列可以是 T、A、C 和 G

所以我想要的输出是:

AX-75448119 Chr1_41908741 1 41908741 T C T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T C T T T T T T T C T C T C T C T C T C T T T C T T T T T T T T C C C C T T T T T T T T T T T C T T T T T T T T T T 
AX-75448118 Chr1_41908545 1 41908545 T A A A A A A A A A A A A A A A A A A A T T T T T T T T T T T T T T T T T T T T T T T A A A A A A A A A A A A A A A A A A A T T T T T T T T T T T A T A T T T A T A T T T T T T T T T T T T T A A A A A A A T T T A T A T A A A ? ? T A A A T T T T A A T A T A T T T A T T T A A A T A T T T T T A A A A A T A A A A A T T T A A A A A A A A A A A A A T T T A T T T T T T T A A A A A A A A A T T

我不知道 awk 是否可以实现这一点!如果没有,我会尝试使用Python,但我更喜欢像awk这样的Linux命令(它比Python快得多,因为我使用的文件有120万行,我的计算机可以用Python交换!)

I have a file which is space delimited which looks like this:

probeset_id submitted_id chr snp_pos alleleA alleleB 562_201 562_202 562_203 562_204 562_205 562_206 562_207 562_208 562_209 562_210 562_211 562_212 562_213 562_214 562_215 562_216 562_217 562_218 562_219 562_220 562_221 562_222 562_223 562_224 562_225 562_226 562_227 562_228 562_229 562_230 562_231 562_232 562_233 562_234 562_235 562_236 562_237 562_238 562_239 562_240 562_241 562_242 562_243 562_244 562_245 562_246 562_247 562_248 562_249 562_250 562_251 562_252 562_253 562_254 562_255 562_256 562_257 562_258 562_259 562_260 562_261 562_262 562_263 562_264 562_265 562_266 562_267 562_268 562_269 562_270 562_271 562_272 562_273 562_274 562_275 562_276 562_277 562_278 562_279 562_280 562_281 562_283 562_284 562_285 562_289 562_291 562_292 562_294 562_295 562_296 562_400 562_401 562_402 562_403 562_404 562_405 
AX-75448119 Chr1_41908741 1 41908741 T C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 1 1 1 0 1 0 0 0 0 2 2 0 0 0 0 0 1 0 0 0 0 0 
AX-75448118 Chr1_41908545 1 41908545 T A 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 0 0 1 2 2 2 2 2 2 2 2 2 0 0 0 0 0 1 1 0 1 1 0 0 0 0 0 0 1 2 2 2 0 1 1 1 2 -1 1 2 0 0 2 1 1 0 1 0 1 2 1 0 0 1 2 2 1 2 2 0 1 2 2 2 2 2 2 0 1 0 0 0 1 2 2 2 2 0

I want to replace the digits by letters according to column 5th & 6th

  1. Replace 0 by $5 $5 (two repeat of column 5th) e.g if the 5th column is T replace 0 by T T
  2. Replace 2 by $6 $6 (two repeat of column 6th) e.g if the 6th column is C replace 2 by C C
  3. Replace 1 by $5 $6 e.g if the 5th and 6th columns are T and C, respectively, replace 1 by T C
  4. Replace -1 by ? ?

I have to note that column 5th and 6th can be T, A, C, and G

So what I would like to have as output is:

AX-75448119 Chr1_41908741 1 41908741 T C T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T C T T T T T T T C T C T C T C T C T C T T T C T T T T T T T T C C C C T T T T T T T T T T T C T T T T T T T T T T 
AX-75448118 Chr1_41908545 1 41908545 T A A A A A A A A A A A A A A A A A A A T T T T T T T T T T T T T T T T T T T T T T T A A A A A A A A A A A A A A A A A A A T T T T T T T T T T T A T A T T T A T A T T T T T T T T T T T T T A A A A A A A T T T A T A T A A A ? ? T A A A T T T T A A T A T A T T T A T T T A A A T A T T T T T A A A A A T A A A A A T T T A A A A A A A A A A A A A T T T A T T T T T T T A A A A A A A A A T T

I don't know if this is possible by awk or not! if not I will give a try in python but I would rather a Linux command like awk (which is much faster than python because I'm using a file which has 1.2 million lines and my computer can swap by python!)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

薆情海 2025-01-10 07:23:59
NR>1{
  o="1"; z="0"; t="2"
  if($5 == "T" && $6 == "C")
    o="T C"
  if($5 == "T")
    z="T T"
  if($6 == "C")
    t="C C"
  if($6 == "A")
    t="A A"
  for (i=7; i<=NF; i++) {
    gsub(/1/,o,$i)
    gsub(/0/,z,$i)
    gsub(/2/,t,$i)
    gsub(/-1/,"? ?", $i)
  }
}1

输出

$ awk -f allele.awk allele.in
probeset_id submitted_id chr snp_pos alleleA alleleB 562_201 562_202 562_203 562_204 562_205 562_206 562_207 562_208 562_209 562_210 562_211 562_212 562_213 562_214 562_215 562_216 562_217 562_218 562_219 562_220 562_221 562_222 562_223 562_224 562_225 562_226 562_227 562_228 562_229 562_230 562_231 562_232 562_233 562_234 562_235 562_236 562_237 562_238 562_239 562_240 562_241 562_242 562_243 562_244 562_245 562_246 562_247 562_248 562_249 562_250 562_251 562_252 562_253 562_254 562_255 562_256 562_257 562_258 562_259 562_260 562_261 562_262 562_263 562_264 562_265 562_266 562_267 562_268 562_269 562_270 562_271 562_272 562_273 562_274 562_275 562_276 562_277 562_278 562_279 562_280 562_281 562_283 562_284 562_285 562_289 562_291 562_292 562_294 562_295 562_296 562_400 562_401 562_402 562_403 562_404 562_405
AX-75448119 Chr1_41908741 1 41908741 T C T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T C T T T T T T T C T C T C T C T C T C T T T C T T T T T T T T C C C C T T T T T T T T T T T C T T T T T T T T T T
AX-75448118 Chr1_41908545 1 41908545 T A A A A A A A A A A A A A A A A A A A T T T T T T T T T T T T T T T T T T T T T T 1 A A A A A A A A A A A A A A A A A A T T T T T T T T T T 1 1 T T 1 1 T T T T T T T T T T T T 1 A A A A A A T T 1 1 1 A A ? ? 1 A A T T T T A A 1 1 T T 1 T T 1 A A 1 T T T T 1 A A A A 1 A A A A T T 1 A A A A A A A A A A A A T T 1 T T T T T T 1 A A A A A A A A T T

注意:您的规范仍然不完整,当第 5 列和第 6 列分别不是 T C 时,您从未说过如何处理 1。您遗漏了许多排列。

NR>1{
  o="1"; z="0"; t="2"
  if($5 == "T" && $6 == "C")
    o="T C"
  if($5 == "T")
    z="T T"
  if($6 == "C")
    t="C C"
  if($6 == "A")
    t="A A"
  for (i=7; i<=NF; i++) {
    gsub(/1/,o,$i)
    gsub(/0/,z,$i)
    gsub(/2/,t,$i)
    gsub(/-1/,"? ?", $i)
  }
}1

Output

$ awk -f allele.awk allele.in
probeset_id submitted_id chr snp_pos alleleA alleleB 562_201 562_202 562_203 562_204 562_205 562_206 562_207 562_208 562_209 562_210 562_211 562_212 562_213 562_214 562_215 562_216 562_217 562_218 562_219 562_220 562_221 562_222 562_223 562_224 562_225 562_226 562_227 562_228 562_229 562_230 562_231 562_232 562_233 562_234 562_235 562_236 562_237 562_238 562_239 562_240 562_241 562_242 562_243 562_244 562_245 562_246 562_247 562_248 562_249 562_250 562_251 562_252 562_253 562_254 562_255 562_256 562_257 562_258 562_259 562_260 562_261 562_262 562_263 562_264 562_265 562_266 562_267 562_268 562_269 562_270 562_271 562_272 562_273 562_274 562_275 562_276 562_277 562_278 562_279 562_280 562_281 562_283 562_284 562_285 562_289 562_291 562_292 562_294 562_295 562_296 562_400 562_401 562_402 562_403 562_404 562_405
AX-75448119 Chr1_41908741 1 41908741 T C T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T C T T T T T T T C T C T C T C T C T C T T T C T T T T T T T T C C C C T T T T T T T T T T T C T T T T T T T T T T
AX-75448118 Chr1_41908545 1 41908545 T A A A A A A A A A A A A A A A A A A A T T T T T T T T T T T T T T T T T T T T T T 1 A A A A A A A A A A A A A A A A A A T T T T T T T T T T 1 1 T T 1 1 T T T T T T T T T T T T 1 A A A A A A T T 1 1 1 A A ? ? 1 A A T T T T A A 1 1 T T 1 T T 1 A A 1 T T T T 1 A A A A 1 A A A A T T 1 A A A A A A A A A A A A T T 1 T T T T T T 1 A A A A A A A A T T

Note: Your spec is still incomplete, you never say what to do with 1 when the 5th and 6th columns are not T C, respectively. There are a number of permutations that you are leaving out.

じ违心 2025-01-10 07:23:59

这可能对你有用:

awk 'NR>1{a=$3;$3="@";gsub(/ -1\>/," ? ?");gsub(/\<0\>/,$5 " " $5);gsub(/\<1\>/,$5 " " $6);gsub(/\<2\>/,$6 " " $6);$3=a;print}' file

This might work for you:

awk 'NR>1{a=$3;$3="@";gsub(/ -1\>/," ? ?");gsub(/\<0\>/,$5 " " $5);gsub(/\<1\>/,$5 " " $6);gsub(/\<2\>/,$6 " " $6);$3=a;print}' file
沉鱼一梦 2025-01-10 07:23:59

这也会对前 4 列进行替换,我并没有理会 1 或 -1 的情况(留给读者作为练习),但您应该能够轻松扩展它以适应:

$ perl -lape 's/0/$F[ 4 ] $F[ 4 ]/g; s/2/$F[ 5 ] $F[ 5 ]/g' input

我真的怀疑awk 在这方面会比 perl 更快。

This will also do replacements on the first 4 columns, and I didn't bother with the 1 or -1 case (left as an exercise for the reader), but you should be able to easily expand this to suit:

$ perl -lape 's/0/$F[ 4 ] $F[ 4 ]/g; s/2/$F[ 5 ] $F[ 5 ]/g' input

I really doubt that awk will be faster than perl at this.

旧情勿念 2025-01-10 07:23:59

awk 绝对是你的朋友。

awk 逐行读取数据文件。您不需要/想要有任何类型的循环结构(除非您非常先进)。

awk '{print $0}' inFile

您只需读取文件的每一行并将其打印出来(它将显示在您的屏幕上,所以不要做一个大文件)

请注意,我使用 $0 来指示“整行”数据”。

awk 还具有引用每个数据字段的变量,您可以使用 $2 等值来打印文件中的第二个字段。

我想根据第 5 列和第 5 列用字母替换数字第六。所以我想要的是将 0 替换为 TT(如果第 5 列是 T),将 2 替换为 CC(如果第 6 列是 C),将 1 替换为 TC(如果第 5 列和第 6 列分别是 T 和 C),我想要将 -1 更改为 ? ?或者 ! !

因此,对于您的问题,您想要测试每一行,测试某些字段并设置新值。

awk 'NR>1{
  # replace 0 with T T (if the 5th column is T)
  if ($5 == 0) $5="TT"
  # and 2 by C C (if the 6th column is C)
  if ($6 == 2) $6="CC"
  # and 1 with T C (if the 5th and 6th columns are T and C respectively)
  if ($5 == "T" && $6 == "C") $1="1"
}'  inputFile  | sed 's/TT/T T/; s/CC/C C/'

要更改某个字段后的所有字段,请根据需要合并此代码,

awk 'NR>1{
  # replace 0 with T T (if the 5th column is T)
  if ($5 == 0) { 
     for (i=5; i<=NF;i++) {
         printf("T ")
     }
     printf("\n")
 }
 ......

}' inputfile ...

NR>1 表示仅处理大于 1 的行号。

请注意,我们使用简单的逻辑来实施您的测试。添加越来越多的东西很容易。回想一下,很多时候使用“分层”逻辑是有意义的 if ($5==0) { ... } else if ($5 == 1) { ....}

一个问题是例如,您要求输出“C C”。当你在 awk 中执行类似 `$5="C C" 的操作时,awk 将重新校准其字段编号,因此 $5 将是 C,$6 将是 C,而不是之前的值。

我采取了打印“CC”的快捷方式,然后在最后使用 sed 创建您的规范指示的“CC”值。

我不知道如何处理

我想将 -1 更改为 ? ?或者 ! !

因为它必须是其中之一,我不确定你想在哪个领域进行操作。使用上面的代码作为指导。如果您遇到困难,请发布一个新问题,其中包含示例输入数据、预期输出、当前输出和您正在使用的代码。

我希望这有帮助。

awk is definitely your friend.

awk reads a datafile, line-by-line. You don't need/want to have any kind of loop structure (unless you're getting very advanced).

awk '{print $0}' inFile

Is all you need to read each line of a file and print it out (it will go to your screen, so don't do a big file)

note that I used $0 to indicate 'the whole line of data'.

Awk also has variables to refer to each field of data, you use values like $2 to print the 2nd field in the file.

I want to replace the digits by letters according to column 5th & 6th. So what I want is to replace 0 with T T (if the 5th column is T) and 2 by C C (if the 6th column is C) and 1 with T C (if the 5th and 6th columns are T and C respectively) and I want to change -1 to ? ? or ! !

So for your problem, you want to test each line, test certain fields and set new values.

awk 'NR>1{
  # replace 0 with T T (if the 5th column is T)
  if ($5 == 0) $5="TT"
  # and 2 by C C (if the 6th column is C)
  if ($6 == 2) $6="CC"
  # and 1 with T C (if the 5th and 6th columns are T and C respectively)
  if ($5 == "T" && $6 == "C") $1="1"
}'  inputFile  | sed 's/TT/T T/; s/CC/C C/'

To change all fields after a certain field, incorporate this code as needed,

awk 'NR>1{
  # replace 0 with T T (if the 5th column is T)
  if ($5 == 0) { 
     for (i=5; i<=NF;i++) {
         printf("T ")
     }
     printf("\n")
 }
 ......

}' inputfile ...

The NR>1 means only processes line numbers greater than 1.

Note that we're using simple logic to implment your tests. it is easy to add more and more. Recall that many times it makes sense to use 'layered' logic if ($5==0) { ... } else if ($5 == 1) { ....}

The one problem is your requirement to output 'C C', for example. When you do something like `$5="C C" in awk, awk will recalibrate its field numbers, so $5 will be C and $6 will be C, not the value that was there before.

I have taken the short-cut of printing 'CC', and then using sed at the the end to create the 'C C' values that your specfication indicates.

I'm not sure how to deal with

and I want to change -1 to ? ? or ! !

as it has to be one or the other, and I'm not sure what field you want to operate on. Use the above code as a guide. If you get stuck, post a new question with sample input data, expected output, current output and the code you are using.

I hope this helps.

寂寞陪衬 2025-01-10 07:23:59

最好通过相等而不是通过正则表达式来检查字段的值:

awk '
    NR==1 {print; next}
    {check0 = check1 = check2 = 0}
    $5 == "T"              {check0 = 1}
    $5 == "T" && $6 == "C" {check1 = 1}
    $6 == "C" || $6 == "A" {check2 = 1}
    {
        for (idx=7; idx <= NF; idx++)
            if      (check0 && $idx == 0) $idx = "T T"
            else if (check1 && $idx == 1) $idx = "T C"
            else if (check2 && $idx == 2) $idx = $6 " " $6
            else if ($idx == -1)          $idx = "? ?"
        print
    }
'

Better to check the field's value by equality instead of by regexp:

awk '
    NR==1 {print; next}
    {check0 = check1 = check2 = 0}
    $5 == "T"              {check0 = 1}
    $5 == "T" && $6 == "C" {check1 = 1}
    $6 == "C" || $6 == "A" {check2 = 1}
    {
        for (idx=7; idx <= NF; idx++)
            if      (check0 && $idx == 0) $idx = "T T"
            else if (check1 && $idx == 1) $idx = "T C"
            else if (check2 && $idx == 2) $idx = $6 " " $6
            else if ($idx == -1)          $idx = "? ?"
        print
    }
'
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文