使用 SED/AWK 将 FASTQ 转换为 FASTA
我的数据总是以四个为一组 采用以下格式(称为 FASTQ):
@SRR018006.2016 GA2:6:1:20:650 length=36
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGN
+SRR018006.2016 GA2:6:1:20:650 length=36
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!+!
@SRR018006.19405469 GA2:6:100:1793:611 length=36
ACCCGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
+SRR018006.19405469 GA2:6:100:1793:611 length=36
7);;).;);;/;*.2>/@@7;@77<..;)58)5/>/
是否有一种简单的 sed/awk/bash 方法将它们转换为 这种格式(称为 FASTA):
>SRR018006.2016 GA2:6:1:20:650 length=36
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGN
>SRR018006.19405469 GA2:6:100:1793:611 length=36
ACCCGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
原则上,我们希望提取每个 4 块中的前两行 并将 @
替换为 >
。
I have a data in that always comes in block of four
in the following format (called FASTQ):
@SRR018006.2016 GA2:6:1:20:650 length=36
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGN
+SRR018006.2016 GA2:6:1:20:650 length=36
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!+!
@SRR018006.19405469 GA2:6:100:1793:611 length=36
ACCCGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
+SRR018006.19405469 GA2:6:100:1793:611 length=36
7);;).;);;/;*.2>/@@7;@77<..;)58)5/>/
Is there a simple sed/awk/bash way to convert them into
this format (called FASTA):
>SRR018006.2016 GA2:6:1:20:650 length=36
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGN
>SRR018006.19405469 GA2:6:100:1793:611 length=36
ACCCGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
In principle, we want to extract the first two lines in each block-of-4
and replace @
with >
.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(14)
这是一个老问题,并且已经提供了许多不同的解决方案。由于接受的答案使用 sed 但有一个明显的问题(即当 @ 符号作为质量行的第一个字母出现时,它将用 > 替换 @),我觉得有必要提供一个简单的基于 sed 的解决方案,该解决方案实际上有效:
唯一的假设是每次读取恰好占用 FASTQ 文件中的 4 行,但根据我的经验,这似乎相当安全。
fastx 工具包中的 fastq_to_fasta 脚本也可以工作。 (值得一提的是,您需要指定 -Q33 选项来适应现在常见的 Phred+33 质量编码。这很有趣,因为它无论如何都会丢弃质量数据!)
This is an old question, and there have been many different solutions offered. Since the accepted answer uses sed but has a glaring problem (which is that it will replace @ with > when the @ sign appears as the first letter of the quality line), I feel compelled to offer a simple sed-based solution that actually works:
The only assumption made is that each read occupies exactly 4 lines in the FASTQ file, but that seems pretty safe, in my experience.
The fastq_to_fasta script in the fastx toolkit also works. (It's worth mentioning that you need to specify the -Q33 option to accommodate the now common Phred+33 qual encodings. Which is funny, since it's throwing away the quality data anyway!)
sed 还没有死。如果我们打高尔夫球:
或者,模拟 http://www. Pierre 发布的ringtail.tsl.ac.uk/david-studholme/scripts/fastq2fasta.pl,它只打印第一行的第一个单词(id)并进行(一些)错误处理:
似乎有大量用于转换这些格式的现有工具;您可能应该使用这些而不是此处发布的任何内容(包括上述内容)。
sed ain't dead. If we're golfing:
Or, emulating http://www.ringtail.tsl.ac.uk/david-studholme/scripts/fastq2fasta.pl posted by Pierre, which only prints the first word (the id) from the first line and does (some) error handling:
There seem to be plenty of existing tools for converting these formats; you should probably use these instead of anything posted here (including the above).
正如 Cock 等人 (2009) NAR 中详细描述的,许多解决方案都是不正确的,因为“‘@’标记字符 (ASCII 64) 可能出现在质量字符串中的任何位置。这意味着任何解析器都不得处理以'@' 表示下一条记录的开始,无需额外检查迄今为止质量字符串的长度是否与序列的长度匹配。”
请参阅http://ukpmc.ac.uk/articlerender.cgi?accid=PMC2847217 了解详情。
As detailed in Cock, et al (2009) NAR, many of these solutions are incorrect since "the ‘@’ marker character (ASCII 64) may occur anywhere in the quality string. This means that any parser must not treat a line starting with ‘@’ as indicating the start of the next record, without additionally checking the length of the quality string thus far matches the length of the sequence."
See http://ukpmc.ac.uk/articlerender.cgi?accid=PMC2847217 for details.
只需 awk ,不需要其他工具
just awk , no need other tools
请参阅 http://www 中的 fastq2fasta.pl .ringtail.tsl.ac.uk/david-studholme/scripts/
See fastq2fasta.pl in http://www.ringtail.tsl.ac.uk/david-studholme/scripts/
我会写
I'd write
这是我得到的最快的,我把它放在我的 .bashrc 文件中:
它不会在以 @... 开头的罕见但并非不可能的质量行上失败,但在包装的 FASTQ 上会失败,如果这甚至是合法的话(虽然它存在)。
This is the fastest I've got, and I stuck it in my .bashrc file:
It doesn't fail on the infrequent but not impossible quality lines that start with @... but does fail on wrapped FASTQ, if that's even legal (it exists though).
您可能对 bioawk 感兴趣,它是 awk 的改编版本,专门用于处理 fasta 文件
注意: BioAwk 基于 Brian Kernighan 的 awk 记录在 “AWK 编程语言”中,
作者:Al Aho、Brian Kernighan 和 Peter Weinberger
(艾迪生-韦斯利,1988 年,ISBN 0-201-07981-X)
。我不确定此版本是否与 POSIX 兼容。
You might be interested in bioawk, it is an adapted version of awk which is tuned to process fasta files
Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language",
by Al Aho, Brian Kernighan, and Peter Weinberger
(Addison-Wesley, 1988, ISBN 0-201-07981-X)
. I'm not sure if this version is compatible with POSIX.
这是我刚刚从 SO 中学到的问题的“跳过每隔一行”部分的解决方案:
如果需要做的只是将一个
@
更改为>
,那么我认为可以完成这项工作。
Here's the solution to the "skip every other line" part of the problem that I just learned from SO:
If all that needs to be done is change one
@
to>
, then I reckonwill do the job.
像这样的东西:
应该有效。
Something like:
should work.
我认为,使用 gnu grep 可以这样完成:
I think, with gnu grep this could be done with this:
下面的
data 是您的数据文件。
我收到了:
below
where data is your data file.
I've received:
我知道我在未来,但为了谷歌用户的利益:
你可能想使用 来自 fastx 工具包的 fastq_to_fasta。不过,它将保留 @ 符号。它还会删除带有 Ns 的行,除非您告诉它不要这样做。
I know I'm way in the future, but for the benefit of googlers:
You may want to use fastq_to_fasta from the fastx toolkit. It will keep the @ sign, though. It will also remove lines with Ns unless you tell it not to.
不要重新发明轮子。对于常见的生物信息学任务,请使用专为这些任务设计、经过充分测试、广泛使用并处理边缘情况的开源工具。例如,对于常见的下一代测序数据处理任务,请使用
seqtk
。要安装这些工具,请使用
conda
,特别是miniconda
,例如:参考文献:
seqtk
:https://github.com/lh3/seqtkconda
:https ://docs.conda.io/projects/conda/en/latest/user-guide/install/index.htmlconda 创建
: https:// docs.conda.io/projects/conda/en/latest/commands/create.htmlDo not reinvent the wheel. For common bioinformatics tasks, use open-source tools that are specifically designed for these tasks, are well-tested, widely used, and handle edge cases. For example, for common next generation sequencing data processing tasks, use
seqtk
.To install these tools, use
conda
, specificallyminiconda
, for example:References:
seqtk
: https://github.com/lh3/seqtkconda
: https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.htmlconda create
: https://docs.conda.io/projects/conda/en/latest/commands/create.html