从多个文件中剪切特定列并使用 UNIX 工具重塑形状
我的一个文件夹中有数百个文件。其中每个文件都是制表符分隔的文本文件,包含超过 100 万行和 27 列。从每个文件中,我希望能够仅提取特定列(例如仅提取列:1,2,11,12,13)。第 3 栏:10 和第 10 栏14:27可以忽略。我希望能够对文件夹中的所有文件(例如 2300 个文件)执行此操作。每个 2300 个文件中的列看起来像这样......
Sample.ID SNP.Name col3 col10 Sample.Index Allele1...Forward Allele2...Forward col14 ....col27
1234567890_A rs758676 - - 1 T T - ....col27
1234567890_A rs3916934 - - 1 T T - ....col27
1234567890_A rs2711935 - - 1 T C - ....col27
1234567890_A rs17126880 - - 1 - - - ....col27
1234567890_A rs12831433 - - 1 T T - ....col27
1234567890_A rs12797197 - - 1 T C - ....col27
第二个文件中的剪切列可能看起来像这样......
Sample.ID SNP.Name col3 col10 Sample.Index Allele1...Forward Allele2...Forward col14 ....col27
1234567899_C rs758676 - - 100 T A - ....col27
1234567899_C rs3916934 - - 100 T T - ....col27
1234567899_C rs2711935 - - 100 T C - ....col27
1234567899_C rs17126880 - - 100 C G - ....col27
1234567899_C rs12831433 - - 100 T T - ....col27
1234567899_C rs12797197 - - 100 T C - ....col27
第三个文件中的剪切列可能看起来像这样...... 每个文件
Sample.ID SNP.Name col3 col10 Sample.Index Allele1...Forward Allele2...Forward col14 ....col27
1234567999_F rs758676 - - 256 A A - ....col27
1234567999_F rs3916934 - - 256 T T - ....col27
1234567999_F rs2711935 - - 256 T C - ....col27
1234567999_F rs17126880 - - 256 C G - ....col27
1234567999_F rs12831433 - - 256 T T - ....col27
1234567999_F rs12797197 - - 256 C C - ....col27
中 Sample.ID
和 Sample.Index
的宽度相同,但在文件之间可以更改。 Sample.ID
的值在每个文件内相同,但在文件之间不同。每个剪切文件在“SNP.Name”列下具有相同的值。不同文件中的 Sample.Index
列有时可能相同。其他两列值 (Allele1...Forward & Allele2...Forward)
可能会更改,并在每个 SNP.Name
下粘贴“” sep 样本.ID
。
我终于想将 2300 个文件中的所有剪切列合并(制表符分隔)为这种格式......
Sample.Index Sample.ID rs758676 rs3916934 rs2711935 rs17126880 rs12831433 rs12797197
1 1234567890_A T T T T T C 0 0 T T T C
200 1234567899_C T A T T T C C G T T T C
256 1234567999_F A A T T T C C G T T C C
简单来说,我希望能够根据 Sample 将长格式转换为宽格式.ID
列。这类似于 R 中的 reshape
函数。我用 R 尝试过这个,它耗尽了内存并且速度非常慢。任何人都可以帮助使用unix工具吗?
当 reshape.sh 应用于 20 个文件时...它在输出中产生了一个虚假的“样本行”。此处显示前 4 个字段。
Sample.Index Sample.ID rs476542 rs7073746
1234567891_A 11 C C A G
1234567892_A 191 T C A G
1234567893_A 204 T C G G
1234567894_A 15 T C A G
1234567895_A 158 T T A A
1234567896_A 208 T C A A
1234567897_A 111 T T G G
1234567898_A 137 T C G G
1234567899_A 216 T C A G
1234567900_A 113 T C G G
1234567901_A 152 T C A G
1234567902_A 178 C C A A
1234567903_A 135 C C A A
1234567904_A 125 T C A A
1234567905_A 194 C C A A
1234567906_A 110 C C G G
1234567907_A 126 C C A A
Sample -
1234567908_A 169 C C G G
1234567909_A 173 C C G G
1234567910_A 168 T C A A
I have several hundred files in a folder. Each of these file is a tab delimited text file that contain more than a million rows and 27 columns. From each file, I want to be able to extract only specific columns (say pull out only columns: 1,2,11,12,13). Columns 3:10 & 14:27 can be ignored. I want to be able to do this for all files in the folder (say 2300 files). The columns from each of the 2300 file looks like this..........
Sample.ID SNP.Name col3 col10 Sample.Index Allele1...Forward Allele2...Forward col14 ....col27
1234567890_A rs758676 - - 1 T T - ....col27
1234567890_A rs3916934 - - 1 T T - ....col27
1234567890_A rs2711935 - - 1 T C - ....col27
1234567890_A rs17126880 - - 1 - - - ....col27
1234567890_A rs12831433 - - 1 T T - ....col27
1234567890_A rs12797197 - - 1 T C - ....col27
The cut columns from the 2nd file may look like this....
Sample.ID SNP.Name col3 col10 Sample.Index Allele1...Forward Allele2...Forward col14 ....col27
1234567899_C rs758676 - - 100 T A - ....col27
1234567899_C rs3916934 - - 100 T T - ....col27
1234567899_C rs2711935 - - 100 T C - ....col27
1234567899_C rs17126880 - - 100 C G - ....col27
1234567899_C rs12831433 - - 100 T T - ....col27
1234567899_C rs12797197 - - 100 T C - ....col27
The cut columns from the 3rd file may look like this....
Sample.ID SNP.Name col3 col10 Sample.Index Allele1...Forward Allele2...Forward col14 ....col27
1234567999_F rs758676 - - 256 A A - ....col27
1234567999_F rs3916934 - - 256 T T - ....col27
1234567999_F rs2711935 - - 256 T C - ....col27
1234567999_F rs17126880 - - 256 C G - ....col27
1234567999_F rs12831433 - - 256 T T - ....col27
1234567999_F rs12797197 - - 256 C C - ....col27
The width of the Sample.ID
, Sample.Index
are the same in each file but can change between files. The value of Sample.ID
is the same within each file but different between files. Each of the cut files have the same values under "SNP.Name" column. The Sample.Index
column may sometimes be same from different file. The other two columns values (Allele1...Forward & Allele2...Forward)
may change, and are pasted with " " sep under each SNP.Name
for each Sample.ID
.
I finally want to merge (tab-delemited) all the cut columns from the 2300 files into this format ......
Sample.Index Sample.ID rs758676 rs3916934 rs2711935 rs17126880 rs12831433 rs12797197
1 1234567890_A T T T T T C 0 0 T T T C
200 1234567899_C T A T T T C C G T T T C
256 1234567999_F A A T T T C C G T T C C
In simple terms I want to be able to convert a long format into wide format based on the Sample.ID
column. This is similar to reshape
function in R. I tried this with R and it runs out of memory and is really slow. Can anyone help with unix tools?
When reshape.sh was applied to 20 files... it produced a spurious "Samples line" in the output. The first 4 fields are featured here.
Sample.Index Sample.ID rs476542 rs7073746
1234567891_A 11 C C A G
1234567892_A 191 T C A G
1234567893_A 204 T C G G
1234567894_A 15 T C A G
1234567895_A 158 T T A A
1234567896_A 208 T C A A
1234567897_A 111 T T G G
1234567898_A 137 T C G G
1234567899_A 216 T C A G
1234567900_A 113 T C G G
1234567901_A 152 T C A G
1234567902_A 178 C C A A
1234567903_A 135 C C A A
1234567904_A 125 T C A A
1234567905_A 194 C C A A
1234567906_A 110 C C G G
1234567907_A 126 C C A A
Sample -
1234567908_A 169 C C G G
1234567909_A 173 C C G G
1234567910_A 168 T C A A
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
概念验证
Proof of Concept