如何循环多个文件夹以连接 FastQ 文件?
我从 Illumina Sequencing 收到了 100 个样本的多个 fastq.gz 文件。但各个样本的所有 fastq.gz 文件根据样本 ID 位于不同的文件夹中。此外,对于一个示例,我有多个 (8-16) 个 R1.fastq.gz
和 R2.fastq.gz
文件。因此,我使用以下代码将所有 R1.fastq.gz
和 R2.fastq.gz
连接成一个 R1.fastq.gz
> 和R2.fastq.gz
。
cat V350043117_L04_some_digits-525_1.fq.gz V350043117_L04_some_digits-525_1.fq.gz V350043117_L04_some_digits-525_1.fq.gz > sample_R1.fq.gz
所以在排序文件中,结构就像上面代码中的那样。对于每个样本,V
的字符串具有不同的数字,然后 L
具有不同的数字,然后是 _1
和 之前的另一个数字字符串_2
。对于每个样本,数字不断变化。 我的问题是,如何创建一个循环,一次遍历所有文件夹,同时考虑序列文件的不同文件编号,以连接多个 fq.gz
文件并将它们合并为一个文件R1 和 R2 文件? 当然,我不能只是通过进入每个示例文件夹来将其一一连接起来。
请提供一些有用的提示。谢谢。
文件夹结构如下:
/data/Sample_1/....._525_1_fq.gz /....._525_2_fq.gz /....._526_1_fq.gz /....._526_2_fq.gz
/data/Sample_2/....._580_1_fq.gz /....._580_2_fq.gz /....._589_1_fq.gz /....._589_2_fq.gz
/data/Sample_3/....._690_1_fq.gz /....._690_2_fq.gz /....._645_1_fq.gz /....._645_2_fq.gz
下面我附上了文件夹结构的屏幕截图。
I have received multiple fastq.gz files from Illumina Sequencing for 100 samples. But all the fastq.gz files for the respective samples are in separate folders according to the sample ID. Moreover, I have multiple (8-16) R1.fastq.gz
and R2.fastq.gz
files for one sample. So, I used the following code for concatenating all the R1.fastq.gz
and R2.fastq.gz
into a single R1.fastq.gz
and R2.fastq.gz
.
cat V350043117_L04_some_digits-525_1.fq.gz V350043117_L04_some_digits-525_1.fq.gz V350043117_L04_some_digits-525_1.fq.gz > sample_R1.fq.gz
So in the sequencing file, the structure is like the above in the code. For each sample, the string with V
has different number then L
with different number and then another string of digits before the _1
and _2
. For each sample, the numbers keep changing.
My questing is, how can I create a loop that will go over all the folders at once taking the different file numbering of sequence files into consideration for concatenating the multiple fq.gz
files and combine them into a single R1 and R2 file?
Surely, I cannot just concatenate one by one by going into each sample folder.
Please give some helpful tips. Thank you.
The folder structure is the following:
/data/Sample_1/....._525_1_fq.gz /....._525_2_fq.gz /....._526_1_fq.gz /....._526_2_fq.gz
/data/Sample_2/....._580_1_fq.gz /....._580_2_fq.gz /....._589_1_fq.gz /....._589_2_fq.gz
/data/Sample_3/....._690_1_fq.gz /....._690_2_fq.gz /....._645_1_fq.gz /....._645_2_fq.gz
Below I have attached a screenshot of the folder structure.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
根据提供的文件结构,请您尝试一下:
for d in Raw2/C*/
循环遍历以C
开头的子目录。id
被分配给从目录名称中提取的ID。cat V*_1.fq.gz
将扩展为V350028825_L04_581_1.fq.gz V350028825_L04_582_1.fq.gz
V350028825_L04_583_1.fq.gz
代码> ...根据到目录中的文件并连接到${id}_R1.fastq.gz
中。${id}_R2.fastq.gz
相同。Based on the provided file structure, would you please try:
for d in Raw2/C*/
loops over the subdirectories starting withC
.cd "$d"
(at the expense of small extra execution time).id
is assigned to the ID extracted from the directory name.cat V*_1.fq.gz
, for example, will be expanded asV350028825_L04_581_1.fq.gz V350028825_L04_582_1.fq.gz
V350028825_L04_583_1.fq.gz
... according to the files in the directory and are concatenated into${id}_R1.fastq.gz
. Same for${id}_R2.fastq.gz
.