将 GenBank 平面文件转换为 FASTA
我需要解析一个初步的 GenBank 平面文件。该序列尚未发布,因此我无法通过加入查找它并下载 FASTA 文件。我是生物信息学的新手,所以有人可以告诉我在哪里可以找到 BioPerl 或 BioPython 脚本来自己完成此操作吗?谢谢!
I need to parse a preliminary GenBank Flatfile. The sequence hasn't been published yet, so I can't look it up by accession and download a FASTA file. I'm new to Bioinformatics, so could someone show me where I could find a BioPerl or BioPython script to do this myself? Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您需要 Bio::SeqIO 模块来读取或写出生物信息学数据。 SeqIO HOWTO 应该告诉您需要了解的所有内容,但是 这里有一个用 Perl 编写的读取 GenBank 文件的小脚本,可以帮助您入门!
You need the Bio::SeqIO module to read or write out bioinformatics data. The SeqIO HOWTO should tell you everything you need to know, but here's a small read-a-GenBank-file script in Perl to get you started!
我在这里为您提供了 Biopython 解决方案。我首先假设您的基因库文件与基因组序列相关,然后假设它是基因序列,我将提供不同的解决方案。事实上,如果知道您正在处理的是其中的哪一个,将会很有帮助。
基因组序列解析:
从文件中解析您的自定义genbank平面文件:
如果您只想要原始序列,那么:
现在也许您需要这个序列的名称,为序列提供“>标题”在制作.fasta之前。让我们看看 genbank .gb 文件中包含哪些名称:
这应该返回一个字典,其中包含该 genbank 文件作者注释的整个序列的各种同义词
基因序列解析:
在您的自定义 genbank 平面文件中解析文件方式:
获取基因的原始序列列表/所有基因的列表然后:
获取每个基因序列的名称列表(更准确地说是每个基因的同义词字典)
I have the Biopython solution for you here. I will firstly assume your genbank file relates to a genome sequence, then I will provide a different solution assuming it was instead a gene sequence. Indeed it would have been helpful to have known which of these you are dealing with.
Genome Sequence Parsing:
Parse in your custom genbank flatfile from file by:
If you just want the raw sequence then:
Now perhaps you need a name for this sequence, to give the sequence a ">header" before making the .fasta. Let's see what names came with the genbank .gb file:
This should return a dictionary with various synonyms of that whole sequence as annotated by author of that genbank file
Gene Sequence Parsing:
Parse in your custom genbank flatfile from file by:
To get a list of raw sequences for the gene/list of all genes then:
To get a list of names for each gene sequence (more precisely a dictionary of synonyms for each gene)