如何循环多个文件夹以连接 FastQ 文件？

发布于 2025-01-09 15:04:22 字数 1224 浏览 1 评论 0原文

我从 Illumina Sequencing 收到了 100 个样本的多个 fastq.gz 文件。但各个样本的所有 fastq.gz 文件根据样本 ID 位于不同的文件夹中。此外，对于一个示例，我有多个 (8-16) 个 R1.fastq.gz 和 R2.fastq.gz 文件。因此，我使用以下代码将所有 R1.fastq.gz 和 R2.fastq.gz 连接成一个 R1.fastq.gz > 和R2.fastq.gz。

cat V350043117_L04_some_digits-525_1.fq.gz V350043117_L04_some_digits-525_1.fq.gz V350043117_L04_some_digits-525_1.fq.gz > sample_R1.fq.gz

所以在排序文件中，结构就像上面代码中的那样。对于每个样本，V 的字符串具有不同的数字，然后 L 具有不同的数字，然后是 _1 和 之前的另一个数字字符串_2。对于每个样本，数字不断变化。我的问题是，如何创建一个循环，一次遍历所有文件夹，同时考虑序列文件的不同文件编号，以连接多个 fq.gz 文件并将它们合并为一个文件R1 和 R2 文件？当然，我不能只是通过进入每个示例文件夹来将其一一连接起来。

请提供一些有用的提示。谢谢。
文件夹结构如下：

/data/Sample_1/....._525_1_fq.gz    /....._525_2_fq.gz    /....._526_1_fq.gz        /....._526_2_fq.gz
/data/Sample_2/....._580_1_fq.gz    /....._580_2_fq.gz    /....._589_1_fq.gz        /....._589_2_fq.gz
/data/Sample_3/....._690_1_fq.gz    /....._690_2_fq.gz    /....._645_1_fq.gz        /....._645_2_fq.gz

下面我附上了文件夹结构的屏幕截图。

文件夹结构

原文

I have received multiple fastq.gz files from Illumina Sequencing for 100 samples. But all the fastq.gz files for the respective samples are in separate folders according to the sample ID. Moreover, I have multiple (8-16) R1.fastq.gz and R2.fastq.gz files for one sample. So, I used the following code for concatenating all the R1.fastq.gz and R2.fastq.gz into a single R1.fastq.gz and R2.fastq.gz.

cat V350043117_L04_some_digits-525_1.fq.gz V350043117_L04_some_digits-525_1.fq.gz V350043117_L04_some_digits-525_1.fq.gz > sample_R1.fq.gz

So in the sequencing file, the structure is like the above in the code. For each sample, the string with V has different number then L with different number and then another string of digits before the _1 and _2. For each sample, the numbers keep changing.
My questing is, how can I create a loop that will go over all the folders at once taking the different file numbering of sequence files into consideration for concatenating the multiple fq.gz files and combine them into a single R1 and R2 file?
Surely, I cannot just concatenate one by one by going into each sample folder.

Please give some helpful tips. Thank you.
The folder structure is the following:

/data/Sample_1/....._525_1_fq.gz    /....._525_2_fq.gz    /....._526_1_fq.gz        /....._526_2_fq.gz
/data/Sample_2/....._580_1_fq.gz    /....._580_2_fq.gz    /....._589_1_fq.gz        /....._589_2_fq.gz
/data/Sample_3/....._690_1_fq.gz    /....._690_2_fq.gz    /....._645_1_fq.gz        /....._645_2_fq.gz

Below I have attached a screenshot of the folder structure.

Folder structure

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

放赐 2025-01-16 15:04:22

根据提供的文件结构，请您尝试一下：

#!/bin/bash

for d in Raw2/C*/; do
(
    cd "$d"
    id=${d%/}; id=${id##*/}             # extract ID from the directory name
    cat V*_1.fq.gz > "${id}_R1.fq.gz"
    cat V*_2.fq.gz > "${id}_R2.fq.gz"
)
done

语法 for d in Raw2/C*/ 循环遍历以 C 开头的子目录。
括号使内部命令在子 shell 中执行，因此我们不必关心从 cd "$d" 的返回（以少量的额外执行时间为代价）。
变量id 被分配给从目录名称中提取的ID。
例如，cat V*_1.fq.gz 将扩展为 V350028825_L04_581_1.fq.gz V350028825_L04_582_1.fq.gz V350028825_L04_583_1.fq.gz代码> ...根据到目录中的文件并连接到 ${id}_R1.fastq.gz 中。 ${id}_R2.fastq.gz 相同。

Based on the provided file structure, would you please try:

#!/bin/bash

for d in Raw2/C*/; do
(
    cd "$d"
    id=${d%/}; id=${id##*/}             # extract ID from the directory name
    cat V*_1.fq.gz > "${id}_R1.fq.gz"
    cat V*_2.fq.gz > "${id}_R2.fq.gz"
)
done

The syntax for d in Raw2/C*/ loops over the subdirectories starting with C.
The parentheses make the inner commands executed in a subshell so we don't have to care about returning from cd "$d" (at the expense of small extra execution time).
The variable id is assigned to the ID extracted from the directory name.
cat V*_1.fq.gz, for example, will be expanded as V350028825_L04_581_1.fq.gz V350028825_L04_582_1.fq.gz V350028825_L04_583_1.fq.gz ... according to the files in the directory and are concatenated into ${id}_R1.fastq.gz. Same for ${id}_R2.fastq.gz.