使用biopython从gb文件中提取数据
我有一个 GB 文件,我需要从该文件中提取一些特定特征:蛋白质编码基因名称和大小。
LOCUS NC_008137 15318 bp DNA linear MAM 15-APR-2009
DEFINITION Phalanger interpositus mitochondrion, complete genome.
ACCESSION NC_008137
VERSION NC_008137.1 GI:108793518
DBLINK Project: 17043
KEYWORDS .
SOURCE mitochondrion Phalanger interpositus (Stein's cuscus)
ORGANISM Phalanger interpositus
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Metatheria; Diprotodontia; Phalangeridae; Phalanger.
REFERENCE 1 (bases 1 to 15318)
AUTHORS Munemasa,M., Nikaido,M., Donnellan,S., Austin,C.C., Okada,N. and
Hasegawa,M.
TITLE Phylogenetic analysis of diprotodontian marsupials based on
complete mitochondrial genomes
JOURNAL Genes Genet. Syst. 81 (3), 181-191 (2006)
PUBMED 16905872
REFERENCE 2 (bases 1 to 15318)
CONSRTM NCBI Genome Project
TITLE Direct Submission
JOURNAL Submitted (12-JUN-2006) National Center for Biotechnology
Information, NIH, Bethesda, MD 20894, USA
REFERENCE 3 (bases 1 to 15318)
AUTHORS Munemasa,M., Nikaido,M., Donnellan,S., Austin,C.C., Okada,N. and
Hasegawa,M.
TITLE Direct Submission
JOURNAL Submitted (08-NOV-2005) Tokyo Institute of Technology, Graduate
School of Bioscience and Biotechnology; Nagatsuta-cho 4259-B-21,
Midori-ku, Kanagawa 226-8501, Japan
COMMENT REVIEWED REFSEQ: This record has been curated by NCBI staff. The
reference sequence was derived from AB241057.
Genome sequence lacks part of non-coding region.
COMPLETENESS: full length.
FEATURES Location/Qualifiers
source 1..15318
/organism="Phalanger interpositus"
/organelle="mitochondrion"
/mol_type="genomic DNA"
/db_xref="taxon:356347"
/tissue_type="liver"
/common="Stein's cuscus"
tRNA 1..69
/product="tRNA-Phe"
rRNA 72..1018
/product="s-rRNA"
/note="12S ribosomal RNA"
tRNA 1020..1088
/product="tRNA-Val"
rRNA 1089..2653
/product="l-rRNA"
/note="16S ribosomal RNA"
tRNA 2654..2727
/product="tRNA-Leu"
/codon_recognized="UUR"
gene 2729..3685
/gene="ND1"
/db_xref="GeneID:4117948"
CDS 2729..3685
/gene="ND1"
/codon_start=1
/transl_table=2
/product="NADH dehydrogenase subunit 1"
/protein_id="YP_637062.1"
/db_xref="GI:108793519"
/db_xref="GeneID:4117948"
/translation="MFIINLLMYIIPILLAIAFLTLVERKALGYMQFRKGPNVVGPYG
LLQPIADGMKLFSKEPLQPVTSSTTMFIIAPTLALTLSLTMWTPLPMPHSLIDLNLGL
LFILALSGLSVYSILWSGWASNSKYALMGALRAVAQTISYEVTLAIILLSIMLINGSF
TLKNLITTQENMWLIITTWPLVMMWYVSTLAETNRAPLDLTEGESELVSGFNVEYAAG
PFAMFFLAEYANIMLMNAMTTILFLGSSINHNFTHLNTLSFMTKTIALTFLFLWVRAS
YPRFRYDQLMHLLWKNFLPMTLAMCLWFISIPIALSCIPPQI"
misc_feature 2729..3682
/gene="ND1"
/note="NADH dehydrogenase; Region: NADHdh; cl00469"
/db_xref="CDD:186018"
tRNA 3686..3751
/product="tRNA-Ile"
tRNA complement(3750..3821)
/product="tRNA-Gln"
tRNA 3821..3878
/product="tRNA-Met"
gene 3889..4932
/gene="ND2"
/db_xref="GeneID:4117949"
CDS 3889..4932
/gene="ND2"
/codon_start=1
/transl_table=2
/product="NADH dehydrogenase subunit 2"
/protein_id="YP_637063.1"
/db_xref="GI:108793520"
/db_xref="GeneID:4117949"
/translation="MSPYILLIMLTSLLLGTSLTLFSNHWLTAWMGLEINTLAIIPMM
TYPNHPRATESAIKYFLTQSTASMMLMFAIINNAWMTNQWTLLQTSDQTSSTIMTLAL
AMKLGLAPFHFWVPEVTQGIPLTSGMILLTWQKIAPTSLMYQISPSLNMKILVMLALL
STILGGWGGLNQTHMRKILAYSSIAHMGWMTIIILINPTLTLLNLAIYITTTLTLFLA
LNHSSITKIKSLANLWNKSSSMTIVIALTLLSLGGLPPLTGFMPKWLILQELITYNNI
ATATMMAMSALLNLFFYMRIIYTTTLTMPPSINNSKLQWPHPQTKTTNIIPLLTIISS
FLLPLTPLSITLS"
我使用了 seqFeature 和 subfeatures 但它不起作用。
从这个文件中我应该得到(ND1和2729..3685,ND2和3889..4932,...如果还有更多)
我是biopython的新手,希望获得有关如何执行此操作的帮助。
I have a gb file and I need to extract some specific features from the file : protein coding genes names and size.
LOCUS NC_008137 15318 bp DNA linear MAM 15-APR-2009
DEFINITION Phalanger interpositus mitochondrion, complete genome.
ACCESSION NC_008137
VERSION NC_008137.1 GI:108793518
DBLINK Project: 17043
KEYWORDS .
SOURCE mitochondrion Phalanger interpositus (Stein's cuscus)
ORGANISM Phalanger interpositus
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Metatheria; Diprotodontia; Phalangeridae; Phalanger.
REFERENCE 1 (bases 1 to 15318)
AUTHORS Munemasa,M., Nikaido,M., Donnellan,S., Austin,C.C., Okada,N. and
Hasegawa,M.
TITLE Phylogenetic analysis of diprotodontian marsupials based on
complete mitochondrial genomes
JOURNAL Genes Genet. Syst. 81 (3), 181-191 (2006)
PUBMED 16905872
REFERENCE 2 (bases 1 to 15318)
CONSRTM NCBI Genome Project
TITLE Direct Submission
JOURNAL Submitted (12-JUN-2006) National Center for Biotechnology
Information, NIH, Bethesda, MD 20894, USA
REFERENCE 3 (bases 1 to 15318)
AUTHORS Munemasa,M., Nikaido,M., Donnellan,S., Austin,C.C., Okada,N. and
Hasegawa,M.
TITLE Direct Submission
JOURNAL Submitted (08-NOV-2005) Tokyo Institute of Technology, Graduate
School of Bioscience and Biotechnology; Nagatsuta-cho 4259-B-21,
Midori-ku, Kanagawa 226-8501, Japan
COMMENT REVIEWED REFSEQ: This record has been curated by NCBI staff. The
reference sequence was derived from AB241057.
Genome sequence lacks part of non-coding region.
COMPLETENESS: full length.
FEATURES Location/Qualifiers
source 1..15318
/organism="Phalanger interpositus"
/organelle="mitochondrion"
/mol_type="genomic DNA"
/db_xref="taxon:356347"
/tissue_type="liver"
/common="Stein's cuscus"
tRNA 1..69
/product="tRNA-Phe"
rRNA 72..1018
/product="s-rRNA"
/note="12S ribosomal RNA"
tRNA 1020..1088
/product="tRNA-Val"
rRNA 1089..2653
/product="l-rRNA"
/note="16S ribosomal RNA"
tRNA 2654..2727
/product="tRNA-Leu"
/codon_recognized="UUR"
gene 2729..3685
/gene="ND1"
/db_xref="GeneID:4117948"
CDS 2729..3685
/gene="ND1"
/codon_start=1
/transl_table=2
/product="NADH dehydrogenase subunit 1"
/protein_id="YP_637062.1"
/db_xref="GI:108793519"
/db_xref="GeneID:4117948"
/translation="MFIINLLMYIIPILLAIAFLTLVERKALGYMQFRKGPNVVGPYG
LLQPIADGMKLFSKEPLQPVTSSTTMFIIAPTLALTLSLTMWTPLPMPHSLIDLNLGL
LFILALSGLSVYSILWSGWASNSKYALMGALRAVAQTISYEVTLAIILLSIMLINGSF
TLKNLITTQENMWLIITTWPLVMMWYVSTLAETNRAPLDLTEGESELVSGFNVEYAAG
PFAMFFLAEYANIMLMNAMTTILFLGSSINHNFTHLNTLSFMTKTIALTFLFLWVRAS
YPRFRYDQLMHLLWKNFLPMTLAMCLWFISIPIALSCIPPQI"
misc_feature 2729..3682
/gene="ND1"
/note="NADH dehydrogenase; Region: NADHdh; cl00469"
/db_xref="CDD:186018"
tRNA 3686..3751
/product="tRNA-Ile"
tRNA complement(3750..3821)
/product="tRNA-Gln"
tRNA 3821..3878
/product="tRNA-Met"
gene 3889..4932
/gene="ND2"
/db_xref="GeneID:4117949"
CDS 3889..4932
/gene="ND2"
/codon_start=1
/transl_table=2
/product="NADH dehydrogenase subunit 2"
/protein_id="YP_637063.1"
/db_xref="GI:108793520"
/db_xref="GeneID:4117949"
/translation="MSPYILLIMLTSLLLGTSLTLFSNHWLTAWMGLEINTLAIIPMM
TYPNHPRATESAIKYFLTQSTASMMLMFAIINNAWMTNQWTLLQTSDQTSSTIMTLAL
AMKLGLAPFHFWVPEVTQGIPLTSGMILLTWQKIAPTSLMYQISPSLNMKILVMLALL
STILGGWGGLNQTHMRKILAYSSIAHMGWMTIIILINPTLTLLNLAIYITTTLTLFLA
LNHSSITKIKSLANLWNKSSSMTIVIALTLLSLGGLPPLTGFMPKWLILQELITYNNI
ATATMMAMSALLNLFFYMRIIYTTTLTMPPSINNSKLQWPHPQTKTTNIIPLLTIISS
FLLPLTPLSITLS"
I used seqFeature and subfeatures but it did not work.
From this file I should get (ND1 and 2729..3685, ND2 and 3889..4932, ... if there was more)
I'm new to biopython and would like help with how to do this.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您发布的 genbank 文件不完整,有遗漏的部分,并且没有
//
终止行。然后解析器就会在尝试读取它时陷入困境。我从 Phalanger interpositus 线粒体中获得了正确的文件="http://www.ncbi.nlm.nih.gov/nuccore/NC_008137.1" rel="nofollow">此处。
然后(py3k代码):
The genbank file you posted is not complete, there are sections missed and does not have the
//
termination line. Parsers then get stuck trying to read it.I got the correct file for the Phalanger interpositus mitochondrion from here.
Then (py3k code):