从多个文件中剪切特定列并使用 UNIX 工具重塑形状

发布于 2024-10-24 04:01:17 字数 4933 浏览 2 评论 0原文

我的一个文件夹中有数百个文件。其中每个文件都是制表符分隔的文本文件，包含超过 100 万行和 27 列。从每个文件中，我希望能够仅提取特定列（例如仅提取列：1,2,11,12,13）。第 3 栏：10 和第 10 栏14:27可以忽略。我希望能够对文件夹中的所有文件（例如 2300 个文件）执行此操作。每个 2300 个文件中的列看起来像这样......

Sample.ID      SNP.Name    col3  col10  Sample.Index   Allele1...Forward       Allele2...Forward col14 ....col27      
1234567890_A    rs758676    -     -      1              T                       T                 -     ....col27
1234567890_A    rs3916934   -     -      1              T                       T                 -     ....col27
1234567890_A    rs2711935   -     -      1              T                       C                 -     ....col27
1234567890_A    rs17126880  -     -      1              -                       -                 -     ....col27
1234567890_A    rs12831433  -     -      1              T                       T                 -     ....col27
1234567890_A    rs12797197  -     -      1              T                       C                 -     ....col27

第二个文件中的剪切列可能看起来像这样......

Sample.ID      SNP.Name    col3  col10  Sample.Index   Allele1...Forward       Allele2...Forward col14 ....col27      
1234567899_C    rs758676    -     -      100            T                       A                 -     ....col27
1234567899_C    rs3916934   -     -      100            T                       T                 -     ....col27
1234567899_C    rs2711935   -     -      100            T                       C                 -     ....col27
1234567899_C    rs17126880  -     -      100            C                       G                 -     ....col27
1234567899_C    rs12831433  -     -      100            T                       T                 -     ....col27
1234567899_C    rs12797197  -     -      100            T                       C                 -     ....col27

第三个文件中的剪切列可能看起来像这样...... 每个文件

Sample.ID      SNP.Name    col3  col10  Sample.Index   Allele1...Forward       Allele2...Forward col14 ....col27      
1234567999_F    rs758676    -     -      256            A                       A                 -     ....col27
1234567999_F    rs3916934   -     -      256            T                       T                 -     ....col27
1234567999_F    rs2711935   -     -      256            T                       C                 -     ....col27
1234567999_F    rs17126880  -     -      256            C                       G                 -     ....col27
1234567999_F    rs12831433  -     -      256            T                       T                 -     ....col27
1234567999_F    rs12797197  -     -      256            C                       C                 -     ....col27

中 Sample.ID 和 Sample.Index 的宽度相同，但在文件之间可以更改。 Sample.ID 的值在每个文件内相同，但在文件之间不同。每个剪切文件在“SNP.Name”列下具有相同的值。不同文件中的 Sample.Index 列有时可能相同。其他两列值 (Allele1...Forward & Allele2...Forward) 可能会更改，并在每个 SNP.Name 下粘贴“” sep 样本.ID。

我终于想将 2300 个文件中的所有剪切列合并（制表符分隔）为这种格式......

Sample.Index  Sample.ID     rs758676   rs3916934   rs2711935  rs17126880  rs12831433  rs12797197        
1             1234567890_A  T T         T T         T C         0 0         T T         T C        
200           1234567899_C  T A         T T         T C         C G         T T         T C        
256           1234567999_F  A A         T T         T C         C G         T T         C C

简单来说，我希望能够根据 Sample 将长格式转换为宽格式.ID 列。这类似于 R 中的 reshape 函数。我用 R 尝试过这个，它耗尽了内存并且速度非常慢。任何人都可以帮助使用unix工具吗？

当 reshape.sh 应用于 20 个文件时...它在输出中产生了一个虚假的“样本行”。此处显示前 4 个字段。

Sample.Index    Sample.ID   rs476542    rs7073746   
1234567891_A          11         C C          A G   
1234567892_A         191         T C          A G   
1234567893_A         204         T C          G G   
1234567894_A          15         T C          A G   
1234567895_A         158         T T          A A   
1234567896_A         208         T C          A A   
1234567897_A         111         T T          G G   
1234567898_A         137         T C          G G   
1234567899_A         216         T C          A G   
1234567900_A         113         T C          G G   
1234567901_A         152         T C          A G   
1234567902_A         178         C C          A A   
1234567903_A         135         C C          A A   
1234567904_A         125         T C          A A   
1234567905_A         194         C C          A A   
1234567906_A         110         C C          G G   
1234567907_A         126         C C          A A   
Sample         -                                    
1234567908_A         169         C C          G G   
1234567909_A         173         C C          G G   
1234567910_A         168         T C          A A

原文

I have several hundred files in a folder. Each of these file is a tab delimited text file that contain more than a million rows and 27 columns. From each file, I want to be able to extract only specific columns (say pull out only columns: 1,2,11,12,13). Columns 3:10 & 14:27 can be ignored. I want to be able to do this for all files in the folder (say 2300 files). The columns from each of the 2300 file looks like this..........

Sample.ID      SNP.Name    col3  col10  Sample.Index   Allele1...Forward       Allele2...Forward col14 ....col27      
1234567890_A    rs758676    -     -      1              T                       T                 -     ....col27
1234567890_A    rs3916934   -     -      1              T                       T                 -     ....col27
1234567890_A    rs2711935   -     -      1              T                       C                 -     ....col27
1234567890_A    rs17126880  -     -      1              -                       -                 -     ....col27
1234567890_A    rs12831433  -     -      1              T                       T                 -     ....col27
1234567890_A    rs12797197  -     -      1              T                       C                 -     ....col27

The cut columns from the 2nd file may look like this....

Sample.ID      SNP.Name    col3  col10  Sample.Index   Allele1...Forward       Allele2...Forward col14 ....col27      
1234567899_C    rs758676    -     -      100            T                       A                 -     ....col27
1234567899_C    rs3916934   -     -      100            T                       T                 -     ....col27
1234567899_C    rs2711935   -     -      100            T                       C                 -     ....col27
1234567899_C    rs17126880  -     -      100            C                       G                 -     ....col27
1234567899_C    rs12831433  -     -      100            T                       T                 -     ....col27
1234567899_C    rs12797197  -     -      100            T                       C                 -     ....col27

The cut columns from the 3rd file may look like this....

Sample.ID      SNP.Name    col3  col10  Sample.Index   Allele1...Forward       Allele2...Forward col14 ....col27      
1234567999_F    rs758676    -     -      256            A                       A                 -     ....col27
1234567999_F    rs3916934   -     -      256            T                       T                 -     ....col27
1234567999_F    rs2711935   -     -      256            T                       C                 -     ....col27
1234567999_F    rs17126880  -     -      256            C                       G                 -     ....col27
1234567999_F    rs12831433  -     -      256            T                       T                 -     ....col27
1234567999_F    rs12797197  -     -      256            C                       C                 -     ....col27

The width of the Sample.ID, Sample.Index are the same in each file but can change between files. The value of Sample.ID is the same within each file but different between files. Each of the cut files have the same values under "SNP.Name" column. The Sample.Index column may sometimes be same from different file. The other two columns values (Allele1...Forward & Allele2...Forward) may change, and are pasted with " " sep under each SNP.Name for each Sample.ID.

I finally want to merge (tab-delemited) all the cut columns from the 2300 files into this format ......

Sample.Index  Sample.ID     rs758676   rs3916934   rs2711935  rs17126880  rs12831433  rs12797197        
1             1234567890_A  T T         T T         T C         0 0         T T         T C        
200           1234567899_C  T A         T T         T C         C G         T T         T C        
256           1234567999_F  A A         T T         T C         C G         T T         C C

In simple terms I want to be able to convert a long format into wide format based on the Sample.ID column. This is similar to reshape function in R. I tried this with R and it runs out of memory and is really slow. Can anyone help with unix tools?

When reshape.sh was applied to 20 files... it produced a spurious "Samples line" in the output. The first 4 fields are featured here.

Sample.Index    Sample.ID   rs476542    rs7073746   
1234567891_A          11         C C          A G   
1234567892_A         191         T C          A G   
1234567893_A         204         T C          G G   
1234567894_A          15         T C          A G   
1234567895_A         158         T T          A A   
1234567896_A         208         T C          A A   
1234567897_A         111         T T          G G   
1234567898_A         137         T C          G G   
1234567899_A         216         T C          A G   
1234567900_A         113         T C          G G   
1234567901_A         152         T C          A G   
1234567902_A         178         C C          A A   
1234567903_A         135         C C          A A   
1234567904_A         125         T C          A A   
1234567905_A         194         C C          A A   
1234567906_A         110         C C          G G   
1234567907_A         126         C C          A A   
Sample         -                                    
1234567908_A         169         C C          G G   
1234567909_A         173         C C          G G   
1234567910_A         168         T C          A A

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梓梦 2024-10-31 04:01:17

#!/bin/bash

awk '

BEGIN {
  maxInd = length("Sample.Index")
  maxID  = length("Sample.ID")
}

FNR>11 && $2 ~ "^rs" {
  SNP[$2]
  key[$11,$1]
  val[$2,$11,$1]=$12" "$13
  maxInd = (len=length($11)) > maxInd ? len : maxInd
  maxID  = (len=length($1))  > maxID  ? len : maxID
}

END {
  printf("%-*s\t%*s\t", maxInd, "Sample.Index", maxID, "Sample.ID")
  for (rs in SNP)
    printf("%s\t", rs)
  printf("\n")

  for(pair in key) {
    split(pair,a,SUBSEP)
      printf("%-*s\t%*s\t", maxInd, a[1], maxID, a[2])
      for(rs in SNP) {
        ale = val[rs,a[1],a[2]]
        out = ale == "- -" ||  ale == "" ? "0 0" : ale
        printf("%*s\t", length(rs), out)
      }
      printf("\n")
  }
}' DNA*.txt

概念验证

$ ./reshapeDNA
Sample.Index       Sample.ID    rs2711935       rs10829026      rs3924674       rs2635442       rs715350        rs17126880      rs7037313       rs11983370      rs6424572       rs7055953       rs758676        rs7167305       rs12831433      rs2147587       rs12797197      rs3916934       rs11002902
11              1234567890_A          T T              0 0            C C             0 0            0 0               T C            0 0              C C            T G             0 0            C C              0 0              T C            A G              T T            T C              G G
111             1234567892_A          T T              T C            C C             0 0            0 0               C C            T C              C C            T T             0 0            C C              0 0              T T            A A              T T            T T              G G
1               1234567894_A          T T              0 0            T C             C C            A G               C C            0 0              C C            0 0             T C            C C              T T              T T            A G              T T            C C              G G
12              1234567893_A          T T              0 0            C C             T C            A A               T C            0 0              C C            0 0             T T            C C              T G              T C            A G              T T            T C              G G
15              1234567891_A          T T              C C            C C             0 0            0 0               C C            C C              C C            T T             0 0            C C              0 0              T C            A G              T T            T T              G G

#!/bin/bash

awk '

BEGIN {
  maxInd = length("Sample.Index")
  maxID  = length("Sample.ID")
}

FNR>11 && $2 ~ "^rs" {
  SNP[$2]
  key[$11,$1]
  val[$2,$11,$1]=$12" "$13
  maxInd = (len=length($11)) > maxInd ? len : maxInd
  maxID  = (len=length($1))  > maxID  ? len : maxID
}

END {
  printf("%-*s\t%*s\t", maxInd, "Sample.Index", maxID, "Sample.ID")
  for (rs in SNP)
    printf("%s\t", rs)
  printf("\n")

  for(pair in key) {
    split(pair,a,SUBSEP)
      printf("%-*s\t%*s\t", maxInd, a[1], maxID, a[2])
      for(rs in SNP) {
        ale = val[rs,a[1],a[2]]
        out = ale == "- -" ||  ale == "" ? "0 0" : ale
        printf("%*s\t", length(rs), out)
      }
      printf("\n")
  }
}' DNA*.txt

Proof of Concept

$ ./reshapeDNA
Sample.Index       Sample.ID    rs2711935       rs10829026      rs3924674       rs2635442       rs715350        rs17126880      rs7037313       rs11983370      rs6424572       rs7055953       rs758676        rs7167305       rs12831433      rs2147587       rs12797197      rs3916934       rs11002902
11              1234567890_A          T T              0 0            C C             0 0            0 0               T C            0 0              C C            T G             0 0            C C              0 0              T C            A G              T T            T C              G G
111             1234567892_A          T T              T C            C C             0 0            0 0               C C            T C              C C            T T             0 0            C C              0 0              T T            A A              T T            T T              G G
1               1234567894_A          T T              0 0            T C             C C            A G               C C            0 0              C C            0 0             T C            C C              T T              T T            A G              T T            C C              G G
12              1234567893_A          T T              0 0            C C             T C            A A               T C            0 0              C C            0 0             T T            C C              T G              T C            A G              T T            T C              G G
15              1234567891_A          T T              C C            C C             0 0            0 0               C C            C C              C C            T T             0 0            C C              0 0              T C            A G              T T            T T              G G

回复收藏 0 原文

~没有更多了~