使用 SED/AWK 将 FASTQ 转换为 FASTA

发布于 2024-08-07 09:53:22 字数 723 浏览 5 评论 0原文

我的数据总是以四个为一组采用以下格式（称为 FASTQ）：

@SRR018006.2016 GA2:6:1:20:650 length=36
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGN
+SRR018006.2016 GA2:6:1:20:650 length=36
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!+!
@SRR018006.19405469 GA2:6:100:1793:611 length=36
ACCCGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
+SRR018006.19405469 GA2:6:100:1793:611 length=36
7);;).;);;/;*.2>/@@7;@77<..;)58)5/>/

是否有一种简单的 sed/awk/bash 方法将它们转换为这种格式（称为 FASTA）：

>SRR018006.2016 GA2:6:1:20:650 length=36
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGN
>SRR018006.19405469 GA2:6:100:1793:611 length=36
ACCCGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC

原则上，我们希望提取每个 4 块中的前两行并将 @ 替换为 >。

原文

I have a data in that always comes in block of four
in the following format (called FASTQ):

@SRR018006.2016 GA2:6:1:20:650 length=36
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGN
+SRR018006.2016 GA2:6:1:20:650 length=36
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!+!
@SRR018006.19405469 GA2:6:100:1793:611 length=36
ACCCGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
+SRR018006.19405469 GA2:6:100:1793:611 length=36
7);;).;);;/;*.2>/@@7;@77<..;)58)5/>/

Is there a simple sed/awk/bash way to convert them into
this format (called FASTA):

>SRR018006.2016 GA2:6:1:20:650 length=36
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGN
>SRR018006.19405469 GA2:6:100:1793:611 length=36
ACCCGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC

In principle, we want to extract the first two lines in each block-of-4
and replace @ with >.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦初启 2024-08-14 09:53:22

这是一个老问题，并且已经提供了许多不同的解决方案。由于接受的答案使用 sed 但有一个明显的问题（即当 @ 符号作为质量行的第一个字母出现时，它将用 > 替换 @），我觉得有必要提供一个简单的基于 sed 的解决方案，该解决方案实际上有效：

sed -n '1~4s/^@/>/p;2~4p'

唯一的假设是每次读取恰好占用 FASTQ 文件中的 4 行，但根据我的经验，这似乎相当安全。

fastx 工具包中的 fastq_to_fasta 脚本也可以工作。（值得一提的是，您需要指定 -Q33 选项来适应现在常见的 Phred+33 质量编码。这很有趣，因为它无论如何都会丢弃质量数据！）

This is an old question, and there have been many different solutions offered. Since the accepted answer uses sed but has a glaring problem (which is that it will replace @ with > when the @ sign appears as the first letter of the quality line), I feel compelled to offer a simple sed-based solution that actually works:

sed -n '1~4s/^@/>/p;2~4p'

The only assumption made is that each read occupies exactly 4 lines in the FASTQ file, but that seems pretty safe, in my experience.

The fastq_to_fasta script in the fastx toolkit also works. (It's worth mentioning that you need to specify the -Q33 option to accommodate the now common Phred+33 qual encodings. Which is funny, since it's throwing away the quality data anyway!)

回复收藏 0 原文

橘香 2024-08-14 09:53:22

sed 还没有死。如果我们打高尔夫球：

sed '/^@/!d;s//>/;N'

或者，模拟 http://www. Pierre 发布的ringtail.tsl.ac.uk/david-studholme/scripts/fastq2fasta.pl，它只打印第一行的第一个单词（id）并进行（一些）错误处理：

#!/usr/bin/sed -f
# Read a total of four lines
$b error
N;$b error
N;$b error
N
# Parse the lines
/^@\(\([^ ]*\).*\)\(\n[ACGTN]*\)\n+\1\n.*$/{
  # Output id and sequence for FASTA format.
  s//>\2\3/
  b
}
:error
i\
Error parsing input:
q

似乎有大量用于转换这些格式的现有工具；您可能应该使用这些而不是此处发布的任何内容（包括上述内容）。

sed ain't dead. If we're golfing:

sed '/^@/!d;s//>/;N'

Or, emulating http://www.ringtail.tsl.ac.uk/david-studholme/scripts/fastq2fasta.pl posted by Pierre, which only prints the first word (the id) from the first line and does (some) error handling:

#!/usr/bin/sed -f
# Read a total of four lines
$b error
N;$b error
N;$b error
N
# Parse the lines
/^@\(\([^ ]*\).*\)\(\n[ACGTN]*\)\n+\1\n.*$/{
  # Output id and sequence for FASTA format.
  s//>\2\3/
  b
}
:error
i\
Error parsing input:
q

There seem to be plenty of existing tools for converting these formats; you should probably use these instead of anything posted here (including the above).

回复收藏 0 原文

哀由 2024-08-14 09:53:22

正如 Cock 等人 (2009) NAR 中详细描述的，许多解决方案都是不正确的，因为“‘@’标记字符 (ASCII 64) 可能出现在质量字符串中的任何位置。这意味着任何解析器都不得处理以'@' 表示下一条记录的开始，无需额外检查迄今为止质量字符串的长度是否与序列的长度匹配。”

请参阅http://ukpmc.ac.uk/articlerender.cgi?accid=PMC2847217 了解详情。

回复收藏 0 原文

请爱~陌生人 2024-08-14 09:53:22

只需 awk ，不需要其他工具

# awk '/^@SR/{gsub(/^@/,">",$1);print;getline;print}' file
>SRR018006.2016 GA2:6:1:20:650 length=36
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGN
>SRR018006.19405469 GA2:6:100:1793:611 length=36
ACCCGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC

just awk , no need other tools

# awk '/^@SR/{gsub(/^@/,">",$1);print;getline;print}' file
>SRR018006.2016 GA2:6:1:20:650 length=36
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGN
>SRR018006.19405469 GA2:6:100:1793:611 length=36
ACCCGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC

回复收藏 0 原文

乙白 2024-08-14 09:53:22

请参阅 http://www 中的 fastq2fasta.pl .ringtail.tsl.ac.uk/david-studholme/scripts/

回复收藏 0 原文

别靠近我心 2024-08-14 09:53:22

我会写

awk '
    NR%4 == 1 {print ">" substr($0, 2)}
    NR%4 == 2 {print}
' fastq > fasta

I'd write

awk '
    NR%4 == 1 {print ">" substr($0, 2)}
    NR%4 == 2 {print}
' fastq > fasta

回复收藏 0 原文

西瓜 2024-08-14 09:53:22

这是我得到的最快的，我把它放在我的 .bashrc 文件中：

alias fq2fa="awk '{print \">\" substr(\$0,2);getline;print;getline;getline}'"

它不会在以 @... 开头的罕见但并非不可能的质量行上失败，但在包装的 FASTQ 上会失败，如果这甚至是合法的话（虽然它存在）。

This is the fastest I've got, and I stuck it in my .bashrc file:

alias fq2fa="awk '{print \">\" substr(\$0,2);getline;print;getline;getline}'"

It doesn't fail on the infrequent but not impossible quality lines that start with @... but does fail on wrapped FASTQ, if that's even legal (it exists though).

回复收藏 0 原文

静若繁花 2024-08-14 09:53:22

您可能对 bioawk 感兴趣，它是 awk 的改编版本，专门用于处理 fasta 文件

bioawk -c fastx '{ print ">"$name ORS $seq }' file.fastq

注意： BioAwk 基于 Brian Kernighan 的 awk 记录在 “AWK 编程语言”中，
作者：Al Aho、Brian Kernighan 和 Peter Weinberger
（艾迪生-韦斯利，1988 年，ISBN 0-201-07981-X）
。我不确定此版本是否与 POSIX 兼容。

You might be interested in bioawk, it is an adapted version of awk which is tuned to process fasta files

bioawk -c fastx '{ print ">"$name ORS $seq }' file.fastq

Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language",
by Al Aho, Brian Kernighan, and Peter Weinberger
(Addison-Wesley, 1988, ISBN 0-201-07981-X)
. I'm not sure if this version is compatible with POSIX.

回复收藏 0 原文

-小熊_ 2024-08-14 09:53:22

这是我刚刚从 SO 中学到的问题的“跳过每隔一行”部分的解决方案：

while read line
do
    # print two lines
    echo "$line"
    read line_to_print
    echo "$line_to_print"

    # and skip two lines
    read line_to_skip
    read line_to_skip
done

如果需要做的只是将一个 @ 更改为 > ，那么我认为

while read line
do
    echo "$line" | sed 's/@/>/'
    read line
    echo "$line"

    read line_to_skip
    read line_to_skip
done

可以完成这项工作。

Here's the solution to the "skip every other line" part of the problem that I just learned from SO:

while read line
do
    # print two lines
    echo "$line"
    read line_to_print
    echo "$line_to_print"

    # and skip two lines
    read line_to_skip
    read line_to_skip
done

If all that needs to be done is change one @ to >, then I reckon

while read line
do
    echo "$line" | sed 's/@/>/'
    read line
    echo "$line"

    read line_to_skip
    read line_to_skip
done

will do the job.

回复收藏 0 原文

遗心遗梦遗幸福 2024-08-14 09:53:22

像这样的东西：

awk 'BEGIN{a=0}{if(a==1){print;a=0}}/^@/{print;a=1}' myFastqFile | sed 's/^@/>/'

应该有效。

Something like:

awk 'BEGIN{a=0}{if(a==1){print;a=0}}/^@/{print;a=1}' myFastqFile | sed 's/^@/>/'

should work.

回复收藏 0 原文

小傻瓜 2024-08-14 09:53:22

我认为，使用 gnu grep 可以这样完成：

grep -A 1 "^@" t.txt | grep -v "^--" | sed -e "s/^@/\>/"

I think, with gnu grep this could be done with this:

grep -A 1 "^@" t.txt | grep -v "^--" | sed -e "s/^@/\>/"

回复收藏 0 原文

西瑶 2024-08-14 09:53:22

awk 'BEGIN{P=1}{if(P==1||P==2){gsub(/^[@]/,">");print}; if(P==4)P=0; P++}' data

>SRR018006.2016 GA2:6:1:20:650 length=36
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGN
>SRR018006.19405469 GA2:6:100:1793:611 length=36
ACCCGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC

下面的

awk '{gsub(/^[@]/,">"); print}' data

data 是您的数据文件。
我收到了：

>SRR018006.2016 GA2:6:1:20:650 length=36
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGN
+SRR018006.2016 GA2:6:1:20:650 length=36
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!+!
>SRR018006.19405469 GA2:6:100:1793:611 length=36
ACCCGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
+SRR018006.19405469 GA2:6:100:1793:611 length=36
7);;).;);;/;*.2>/@@7;@77<..;)58)5/>/

awk 'BEGIN{P=1}{if(P==1||P==2){gsub(/^[@]/,">");print}; if(P==4)P=0; P++}' data

>SRR018006.2016 GA2:6:1:20:650 length=36
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGN
>SRR018006.19405469 GA2:6:100:1793:611 length=36
ACCCGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC

below

awk '{gsub(/^[@]/,">"); print}' data

where data is your data file.
I've received:

>SRR018006.2016 GA2:6:1:20:650 length=36
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGN
+SRR018006.2016 GA2:6:1:20:650 length=36
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!+!
>SRR018006.19405469 GA2:6:100:1793:611 length=36
ACCCGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
+SRR018006.19405469 GA2:6:100:1793:611 length=36
7);;).;);;/;*.2>/@@7;@77<..;)58)5/>/

回复收藏 0 原文

爱已欠费 2024-08-14 09:53:22

我知道我在未来，但为了谷歌用户的利益：

你可能想使用来自 fastx 工具包的 fastq_to_fasta。不过，它将保留 @ 符号。它还会删除带有 Ns 的行，除非您告诉它不要这样做。

回复收藏 0 原文

国粹 2024-08-14 09:53:22

不要重新发明轮子。对于常见的生物信息学任务，请使用专为这些任务设计、经过充分测试、广泛使用并处理边缘情况的开源工具。例如，对于常见的下一代测序数据处理任务，请使用seqtk。

seqtk seq -A in.fq > out.fa

要安装这些工具，请使用 conda，特别是 miniconda，例如：

conda create --channel bioconda --name seqtk seqtk
conda activate seqtk
# ... use seqtk here ...
conda deactivate

参考文献：

seqtk：https://github.com/lh3/seqtk
conda：https ://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html
conda 创建： https:// docs.conda.io/projects/conda/en/latest/commands/create.html

Do not reinvent the wheel. For common bioinformatics tasks, use open-source tools that are specifically designed for these tasks, are well-tested, widely used, and handle edge cases. For example, for common next generation sequencing data processing tasks, use seqtk.

seqtk seq -A in.fq > out.fa

To install these tools, use conda, specifically miniconda, for example:

conda create --channel bioconda --name seqtk seqtk
conda activate seqtk
# ... use seqtk here ...
conda deactivate

References:

seqtk: https://github.com/lh3/seqtk
conda: https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html
conda create: https://docs.conda.io/projects/conda/en/latest/commands/create.html

回复收藏 0 原文

~没有更多了~

关于作者

北城孤痞

暂无简介

0 文章

0 评论

22 人气

关注发私信

友情链接

文江博客

使用 SED/AWK 将 FASTQ 转换为 FASTA

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（14）

关于作者

相关话题

热门标签

推荐作者

1CH1MKgiKxn9p

ゞ记忆︶ㄣ

JackDx

信远

yaoduoduo1995

霞映澄塘

友情链接

使用 SED/AWK 将 FASTQ 转换为 FASTA

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（14）

关于作者

相关话题

热门标签

推荐作者

1CH1MKgiKxn9p

ゞ记忆︶ㄣ

JackDx

信远

yaoduoduo1995

霞映澄塘

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。