定义计算氨基酸相对频率的函数

发布于 2024-11-03 07:22:26 字数 486 浏览 1 评论 0原文

我正在尝试计算给定 DNA 序列内的密码子频率。

例如:

sequence = 'ATGAAGAAA'
codons = ['ATG', 'AAG', 'AAA']

对于密码子中的 XX:

frequency  = codons.count(XX)/(codons.count(XX)+codons.count(XX2)+codons.count(XX3))

请注意,XX2 和 XX3 并不总是在序列中。一些密码子可能有也可能没有多个密码子。

示例:赖氨酸有 2 个密码子,AAA 和 AAG,

因此频率

AAA = codons.count('AAA')/(codons.count('AAA') + codons.count('AAG'))

如何对列表中的每个密码子执行此操作?我如何解释多个密码子?

I'm trying to calculate codon frequency within a given sequence of DNA.

For example:

sequence = 'ATGAAGAAA'
codons = ['ATG', 'AAG', 'AAA']

for XX in codons:

frequency  = codons.count(XX)/(codons.count(XX)+codons.count(XX2)+codons.count(XX3))

Note that XX2 and XX3 will not always be in the sequence. Some codons may or may not have multiple codons.

Example: Lysine has 2 codons, AAA and AAG

so the frequency of

AAA = codons.count('AAA')/(codons.count('AAA') + codons.count('AAG'))

How can I do this for EVERY codon in the list? How do I account for multiple codons?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

这样的小城市 2024-11-10 07:22:26

使用defaultdict

from collections import defaultdict

mydict = defaultdict(int)

for aa in mysecuence:
    mydict[aa] +=1

这适用于氨基酸(蛋白质)。
对于密码子,您应该以 3 个位置的步骤迭代序列以获取 defaultdict 的键。例如:

>>> mysec = "GAUCACTUGCCA"
>>> a = [mysec[i:i+3] for i in range(0,len(mysec), 3)]
>>> print a


['GAU', 'CAC', 'TUG', 'CCA']

编辑:如果你想计算简并,你应该准备一个字典,将每个密码子(键)与其简并密码子(值,密码子列表)相关联。为了计算频率,
从 defaultdict 中,您可以获得每个密码子的计数,然后对于每个密码子,您可以计算从上述密码子字典中读取的简并密码子的计数总和。然后就可以计算频率了。

编辑2:这里有一个真实的例子:

from collections import defaultdict

#the first 600 nucleotides from GenBank: AAHX01097212.1
rna = ("tcccccgcagcttcgggaacgtgcgggctcgggagggaggggcctggcgccgggcgcgcg"
       "cctgcgccccaccccgccccaccctggcgggtctcgcgcgcccggcccgcctcctgtcaa"
       "ccccagcgcggcggtcaggtggtccccagcccttggccccagcctccagcttcctggtcc"
       "ctcgggctctgagtcctgtctccggcagatcgcctttctgattgttctcctgcgcagctg"
       "gaggtgtatagcccctagccgagctatggtgcctcagcagatgtgaggaggtagtgggtc"
       "aggataaacccgcgcactccataataacgtgccagggctcagtgacttgggtctgcatta")

seq = rna.upper().replace('T', 'U')

#RNA codon table from http://en.wikipedia.org/wiki/Genetic_code
degenerated = (('GCU', 'GCC', 'GCA', 'GCG'),
               ('UUA', 'UUG', 'CUU', 'CUC', 'CUA', 'CUG'),
               ('CGU', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'),
               ('AAA', 'AAG'), ('AAU', 'AAC'), ('GAU', 'GAC'),
               ('UUU', 'UUC'), ('UGU', 'UGC'), ('CCU', 'CCC', 'CCA', 'CCG'),
               ('CAA', 'CAG'), ('UCU', 'UCC', 'UCA', 'UCG', 'AGU', 'AGC'),
               ('GAA', 'GAG'), ('ACU', 'ACC', 'ACA', 'ACG'),
               ('GGU', 'GGC', 'GGA', 'GGG'), ('CAU', 'CAC'), ('UAU', 'UAC'),
               ('AUU', 'AUC', 'AUA'), ('GUU', 'GUC', 'GUA', 'GUG'),
               ('UAA', 'UGA', 'UAG'))

#prepare the dictio of degenerated codons
degen_dict = {}
for codons in degenerated:
    for codon in codons:
        degen_dict[codon] = codons

#query_codons
max_seq = len(seq)
query_codons = [seq[i:i+3] for i in range(0, max_seq, 3)]

#prepare dictio of counts:
counts = defaultdict(int)
for codon in query_codons:
    counts[codon] +=1

#actual calculation of frecuencies
data = {}
for codon in query_codons:
    if codon in  degen_dict:
        totals = sum(counts[deg] for deg in degen_dict[codon])
        frecuency = float(counts[codon]) / totals
    else:
        frecuency = 1.00

    data[codon] = frecuency

#print results
for codon, frecuency in data.iteritems():
    print "%s  -> %.2f" %(codon, frecuency)


#produces:
GUC  -> 0.57
AUA  -> 1.00
ACG  -> 0.50
AAC  -> 1.00
CCU  -> 0.25
UAU  -> 1.00
..........
GCU  -> 0.19
GAU  -> 1.00
UAG  -> 0.33
CUC  -> 0.38
UUA  -> 0.13
UGA  -> 0.33

use defaultdict

from collections import defaultdict

mydict = defaultdict(int)

for aa in mysecuence:
    mydict[aa] +=1

This works for aminoacids (proteins).
For codons, you should iterate on the sequence in 3 positions steps to get the keys of the defaultdict. For example:

>>> mysec = "GAUCACTUGCCA"
>>> a = [mysec[i:i+3] for i in range(0,len(mysec), 3)]
>>> print a


['GAU', 'CAC', 'TUG', 'CCA']

EDIT: If you want to calculate degeneration, you should prepare a dictionary relating each codon (key) with its degenerated codons (value, list of codons). To calculate the frecuency,
from the defaultdict you can get the counts for each codon, then for each codon you calculate the sum of the counts of the degenerated codons read from the dictionary of codons indicated above. Then you can calculate the frecuency.

EDIT 2: Here you have a real example:

from collections import defaultdict

#the first 600 nucleotides from GenBank: AAHX01097212.1
rna = ("tcccccgcagcttcgggaacgtgcgggctcgggagggaggggcctggcgccgggcgcgcg"
       "cctgcgccccaccccgccccaccctggcgggtctcgcgcgcccggcccgcctcctgtcaa"
       "ccccagcgcggcggtcaggtggtccccagcccttggccccagcctccagcttcctggtcc"
       "ctcgggctctgagtcctgtctccggcagatcgcctttctgattgttctcctgcgcagctg"
       "gaggtgtatagcccctagccgagctatggtgcctcagcagatgtgaggaggtagtgggtc"
       "aggataaacccgcgcactccataataacgtgccagggctcagtgacttgggtctgcatta")

seq = rna.upper().replace('T', 'U')

#RNA codon table from http://en.wikipedia.org/wiki/Genetic_code
degenerated = (('GCU', 'GCC', 'GCA', 'GCG'),
               ('UUA', 'UUG', 'CUU', 'CUC', 'CUA', 'CUG'),
               ('CGU', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'),
               ('AAA', 'AAG'), ('AAU', 'AAC'), ('GAU', 'GAC'),
               ('UUU', 'UUC'), ('UGU', 'UGC'), ('CCU', 'CCC', 'CCA', 'CCG'),
               ('CAA', 'CAG'), ('UCU', 'UCC', 'UCA', 'UCG', 'AGU', 'AGC'),
               ('GAA', 'GAG'), ('ACU', 'ACC', 'ACA', 'ACG'),
               ('GGU', 'GGC', 'GGA', 'GGG'), ('CAU', 'CAC'), ('UAU', 'UAC'),
               ('AUU', 'AUC', 'AUA'), ('GUU', 'GUC', 'GUA', 'GUG'),
               ('UAA', 'UGA', 'UAG'))

#prepare the dictio of degenerated codons
degen_dict = {}
for codons in degenerated:
    for codon in codons:
        degen_dict[codon] = codons

#query_codons
max_seq = len(seq)
query_codons = [seq[i:i+3] for i in range(0, max_seq, 3)]

#prepare dictio of counts:
counts = defaultdict(int)
for codon in query_codons:
    counts[codon] +=1

#actual calculation of frecuencies
data = {}
for codon in query_codons:
    if codon in  degen_dict:
        totals = sum(counts[deg] for deg in degen_dict[codon])
        frecuency = float(counts[codon]) / totals
    else:
        frecuency = 1.00

    data[codon] = frecuency

#print results
for codon, frecuency in data.iteritems():
    print "%s  -> %.2f" %(codon, frecuency)


#produces:
GUC  -> 0.57
AUA  -> 1.00
ACG  -> 0.50
AAC  -> 1.00
CCU  -> 0.25
UAU  -> 1.00
..........
GCU  -> 0.19
GAU  -> 1.00
UAG  -> 0.33
CUC  -> 0.38
UUA  -> 0.13
UGA  -> 0.33
怀中猫帐中妖 2024-11-10 07:22:26

如果您的序列位于正确的阅读框架中:

>>> import collections
>>> 
>>> code = {     'ttt': 'F', 'tct': 'S', 'tat': 'Y', 'tgt': 'C',
...              'ttc': 'F', 'tcc': 'S', 'tac': 'Y', 'tgc': 'C',
...              'tta': 'L', 'tca': 'S', 'taa': '*', 'tga': '*',
...              'ttg': 'L', 'tcg': 'S', 'tag': '*', 'tgg': 'W',
...              'ctt': 'L', 'cct': 'P', 'cat': 'H', 'cgt': 'R',
...              'ctc': 'L', 'ccc': 'P', 'cac': 'H', 'cgc': 'R',
...              'cta': 'L', 'cca': 'P', 'caa': 'Q', 'cga': 'R',
...              'ctg': 'L', 'ccg': 'P', 'cag': 'Q', 'cgg': 'R',
...              'att': 'I', 'act': 'T', 'aat': 'N', 'agt': 'S',
...              'atc': 'I', 'acc': 'T', 'aac': 'N', 'agc': 'S',
...              'ata': 'I', 'aca': 'T', 'aaa': 'K', 'aga': 'R',
...              'atg': 'M', 'acg': 'T', 'aag': 'K', 'agg': 'R',
...              'gtt': 'V', 'gct': 'A', 'gat': 'D', 'ggt': 'G',
...              'gtc': 'V', 'gcc': 'A', 'gac': 'D', 'ggc': 'G',
...              'gta': 'V', 'gca': 'A', 'gaa': 'E', 'gga': 'G',
...              'gtg': 'V', 'gcg': 'A', 'gag': 'E', 'ggg': 'G'
...         }
>>> def count_codons(cds):
...     counts = collections.defaultdict(int)
...     for i in range(0,len(cds),3):
...        codon = cds[i:i+3]
...        counts[codon] += 1
...     return counts
... 
>>> def translate(cds, code):
...     return "".join((code[cds[i:i+3]] for i in range(0, len(cds), 3)))
... 
>>> seq = 'ATGAAGAAA'
>>> 
>>> codon_counts = count_codons(seq.lower())
>>> trans_seq = translate(seq.lower(), code)
>>> 
>>> [(codon, code[codon], float(codon_counts[codon])/trans_seq.count(code[codon])) for codon in codon_counts.keys()]
[('atg', 'M', 1.0), ('aag', 'K', 0.5), ('aaa', 'K', 0.5)]
>>> 

其他信息:

我认为您要求找到一种称为密码子使用的东西。

有一些在线工具可以让您查找密码子的使用情况。该功能还允许离线使用。

http://www.bioinformatics.org/sms2/codon_usage.html

和结果(在这个“分数”就是您所要求的):

Results for 9 residue sequence "sample sequence one" starting "ATGAAGAAA"
AmAcid   Codon     Number        /1000     Fraction   .. 

Ala      GCG         0.00         0.00         0.00 
Ala      GCA         0.00         0.00         0.00 
Ala      GCT         0.00         0.00         0.00 
Ala      GCC         0.00         0.00         0.00 

Cys      TGT         0.00         0.00         0.00 
Cys      TGC         0.00         0.00         0.00 

Asp      GAT         0.00         0.00         0.00 
Asp      GAC         0.00         0.00         0.00 

Glu      GAG         0.00         0.00         0.00 
Glu      GAA         0.00         0.00         0.00 

Phe      TTT         0.00         0.00         0.00 
Phe      TTC         0.00         0.00         0.00 

Gly      GGG         0.00         0.00         0.00 
Gly      GGA         0.00         0.00         0.00 
Gly      GGT         0.00         0.00         0.00 
Gly      GGC         0.00         0.00         0.00 

His      CAT         0.00         0.00         0.00 
His      CAC         0.00         0.00         0.00 

Ile      ATA         0.00         0.00         0.00 
Ile      ATT         0.00         0.00         0.00 
Ile      ATC         0.00         0.00         0.00 

Lys      AAG         1.00       333.33         0.50 
Lys      AAA         1.00       333.33         0.50 

Leu      TTG         0.00         0.00         0.00 
Leu      TTA         0.00         0.00         0.00 
Leu      CTG         0.00         0.00         0.00 
Leu      CTA         0.00         0.00         0.00 
Leu      CTT         0.00         0.00         0.00 
Leu      CTC         0.00         0.00         0.00 

Met      ATG         1.00       333.33         1.00 

Asn      AAT         0.00         0.00         0.00 
Asn      AAC         0.00         0.00         0.00 

Pro      CCG         0.00         0.00         0.00 
Pro      CCA         0.00         0.00         0.00 
Pro      CCT         0.00         0.00         0.00 
Pro      CCC         0.00         0.00         0.00 

Gln      CAG         0.00         0.00         0.00 
Gln      CAA         0.00         0.00         0.00 

Arg      AGG         0.00         0.00         0.00 
Arg      AGA         0.00         0.00         0.00 
Arg      CGG         0.00         0.00         0.00 
Arg      CGA         0.00         0.00         0.00 
Arg      CGT         0.00         0.00         0.00 
Arg      CGC         0.00         0.00         0.00 

Ser      AGT         0.00         0.00         0.00 
Ser      AGC         0.00         0.00         0.00 
Ser      TCG         0.00         0.00         0.00 
Ser      TCA         0.00         0.00         0.00 
Ser      TCT         0.00         0.00         0.00 
Ser      TCC         0.00         0.00         0.00 

Thr      ACG         0.00         0.00         0.00 
Thr      ACA         0.00         0.00         0.00 
Thr      ACT         0.00         0.00         0.00 
Thr      ACC         0.00         0.00         0.00 

Val      GTG         0.00         0.00         0.00 
Val      GTA         0.00         0.00         0.00 
Val      GTT         0.00         0.00         0.00 
Val      GTC         0.00         0.00         0.00 

Trp      TGG         0.00         0.00         0.00 

Tyr      TAT         0.00         0.00         0.00 
Tyr      TAC         0.00         0.00         0.00 

End      TGA         0.00         0.00         0.00 
End      TAG         0.00         0.00         0.00 
End      TAA         0.00         0.00         0.00 

cusp 是 EMBOSS 的密码子使用工具,也可能值得一看。

您可能想查看 BioPython 来处理生物序列。我相信他们有密码子使用模块。

If your sequence is in the correct reading frame:

>>> import collections
>>> 
>>> code = {     'ttt': 'F', 'tct': 'S', 'tat': 'Y', 'tgt': 'C',
...              'ttc': 'F', 'tcc': 'S', 'tac': 'Y', 'tgc': 'C',
...              'tta': 'L', 'tca': 'S', 'taa': '*', 'tga': '*',
...              'ttg': 'L', 'tcg': 'S', 'tag': '*', 'tgg': 'W',
...              'ctt': 'L', 'cct': 'P', 'cat': 'H', 'cgt': 'R',
...              'ctc': 'L', 'ccc': 'P', 'cac': 'H', 'cgc': 'R',
...              'cta': 'L', 'cca': 'P', 'caa': 'Q', 'cga': 'R',
...              'ctg': 'L', 'ccg': 'P', 'cag': 'Q', 'cgg': 'R',
...              'att': 'I', 'act': 'T', 'aat': 'N', 'agt': 'S',
...              'atc': 'I', 'acc': 'T', 'aac': 'N', 'agc': 'S',
...              'ata': 'I', 'aca': 'T', 'aaa': 'K', 'aga': 'R',
...              'atg': 'M', 'acg': 'T', 'aag': 'K', 'agg': 'R',
...              'gtt': 'V', 'gct': 'A', 'gat': 'D', 'ggt': 'G',
...              'gtc': 'V', 'gcc': 'A', 'gac': 'D', 'ggc': 'G',
...              'gta': 'V', 'gca': 'A', 'gaa': 'E', 'gga': 'G',
...              'gtg': 'V', 'gcg': 'A', 'gag': 'E', 'ggg': 'G'
...         }
>>> def count_codons(cds):
...     counts = collections.defaultdict(int)
...     for i in range(0,len(cds),3):
...        codon = cds[i:i+3]
...        counts[codon] += 1
...     return counts
... 
>>> def translate(cds, code):
...     return "".join((code[cds[i:i+3]] for i in range(0, len(cds), 3)))
... 
>>> seq = 'ATGAAGAAA'
>>> 
>>> codon_counts = count_codons(seq.lower())
>>> trans_seq = translate(seq.lower(), code)
>>> 
>>> [(codon, code[codon], float(codon_counts[codon])/trans_seq.count(code[codon])) for codon in codon_counts.keys()]
[('atg', 'M', 1.0), ('aag', 'K', 0.5), ('aaa', 'K', 0.5)]
>>> 

other info:

I think you are asking to find something called codon usage.

There are tools online which allow you to find codon usage. This one also allows for offline use.

http://www.bioinformatics.org/sms2/codon_usage.html

and results (in this 'Fraction' is what you are asking for):

Results for 9 residue sequence "sample sequence one" starting "ATGAAGAAA"
AmAcid   Codon     Number        /1000     Fraction   .. 

Ala      GCG         0.00         0.00         0.00 
Ala      GCA         0.00         0.00         0.00 
Ala      GCT         0.00         0.00         0.00 
Ala      GCC         0.00         0.00         0.00 

Cys      TGT         0.00         0.00         0.00 
Cys      TGC         0.00         0.00         0.00 

Asp      GAT         0.00         0.00         0.00 
Asp      GAC         0.00         0.00         0.00 

Glu      GAG         0.00         0.00         0.00 
Glu      GAA         0.00         0.00         0.00 

Phe      TTT         0.00         0.00         0.00 
Phe      TTC         0.00         0.00         0.00 

Gly      GGG         0.00         0.00         0.00 
Gly      GGA         0.00         0.00         0.00 
Gly      GGT         0.00         0.00         0.00 
Gly      GGC         0.00         0.00         0.00 

His      CAT         0.00         0.00         0.00 
His      CAC         0.00         0.00         0.00 

Ile      ATA         0.00         0.00         0.00 
Ile      ATT         0.00         0.00         0.00 
Ile      ATC         0.00         0.00         0.00 

Lys      AAG         1.00       333.33         0.50 
Lys      AAA         1.00       333.33         0.50 

Leu      TTG         0.00         0.00         0.00 
Leu      TTA         0.00         0.00         0.00 
Leu      CTG         0.00         0.00         0.00 
Leu      CTA         0.00         0.00         0.00 
Leu      CTT         0.00         0.00         0.00 
Leu      CTC         0.00         0.00         0.00 

Met      ATG         1.00       333.33         1.00 

Asn      AAT         0.00         0.00         0.00 
Asn      AAC         0.00         0.00         0.00 

Pro      CCG         0.00         0.00         0.00 
Pro      CCA         0.00         0.00         0.00 
Pro      CCT         0.00         0.00         0.00 
Pro      CCC         0.00         0.00         0.00 

Gln      CAG         0.00         0.00         0.00 
Gln      CAA         0.00         0.00         0.00 

Arg      AGG         0.00         0.00         0.00 
Arg      AGA         0.00         0.00         0.00 
Arg      CGG         0.00         0.00         0.00 
Arg      CGA         0.00         0.00         0.00 
Arg      CGT         0.00         0.00         0.00 
Arg      CGC         0.00         0.00         0.00 

Ser      AGT         0.00         0.00         0.00 
Ser      AGC         0.00         0.00         0.00 
Ser      TCG         0.00         0.00         0.00 
Ser      TCA         0.00         0.00         0.00 
Ser      TCT         0.00         0.00         0.00 
Ser      TCC         0.00         0.00         0.00 

Thr      ACG         0.00         0.00         0.00 
Thr      ACA         0.00         0.00         0.00 
Thr      ACT         0.00         0.00         0.00 
Thr      ACC         0.00         0.00         0.00 

Val      GTG         0.00         0.00         0.00 
Val      GTA         0.00         0.00         0.00 
Val      GTT         0.00         0.00         0.00 
Val      GTC         0.00         0.00         0.00 

Trp      TGG         0.00         0.00         0.00 

Tyr      TAT         0.00         0.00         0.00 
Tyr      TAC         0.00         0.00         0.00 

End      TGA         0.00         0.00         0.00 
End      TAG         0.00         0.00         0.00 
End      TAA         0.00         0.00         0.00 

cusp is the codon usage tool from EMBOSS which also may be worth taking a look at.

You may want to checkout BioPython for working with biological sequences. I believe they have a codon usage module.

自由如风 2024-11-10 07:22:26

PLY 是一个解析器模块,具有一些很好的调试功能;它非常擅长这样的任务...

from ply import lex

tokens = (
    'CODON',
)
t_CODON = (
    r"ATG|"
    r"AAG|"
    r"AAF|"
    r"AAC|"
    r"BOB|"
    r"FOO|"
    r"BAR|"
    r"AAA"
)
def t_error(t):
    raise TypeError("Unknown codon '%s'" % (t.value,))
lex.lex()
sequence = "AAABOBAACAAAFOOAACBARAAAAAA"
ccount = dict()
total = 0.0
lex.input(sequence)
for tok in iter(lex.token, None):
    if ccount.get(tok.value, False):
        ccount[tok.value] += 1
    else:
        ccount[tok.value] = 1
    total += 1.0

for codon,count in ccount.items():
    print "Frequency of %s is %f" % (codon, count/total)

运行该代码会产生...

[mpenning@Bucksnort ~]$ python codon.py
Frequency of BAR is 0.111111
Frequency of BOB is 0.111111
Frequency of FOO is 0.111111
Frequency of AAA is 0.444444
Frequency of AAC is 0.222222

当您开始介绍化学术语时我有点迷失,但您可能可以从这里接管...

PLY is a parser module that has some nice debugging features; it is very good at tasks like this...

from ply import lex

tokens = (
    'CODON',
)
t_CODON = (
    r"ATG|"
    r"AAG|"
    r"AAF|"
    r"AAC|"
    r"BOB|"
    r"FOO|"
    r"BAR|"
    r"AAA"
)
def t_error(t):
    raise TypeError("Unknown codon '%s'" % (t.value,))
lex.lex()
sequence = "AAABOBAACAAAFOOAACBARAAAAAA"
ccount = dict()
total = 0.0
lex.input(sequence)
for tok in iter(lex.token, None):
    if ccount.get(tok.value, False):
        ccount[tok.value] += 1
    else:
        ccount[tok.value] = 1
    total += 1.0

for codon,count in ccount.items():
    print "Frequency of %s is %f" % (codon, count/total)

Running that code produces...

[mpenning@Bucksnort ~]$ python codon.py
Frequency of BAR is 0.111111
Frequency of BOB is 0.111111
Frequency of FOO is 0.111111
Frequency of AAA is 0.444444
Frequency of AAC is 0.222222

I'm a little lost when you start introducing the chemical terminology, but you can probably take over from here...

好久不见√ 2024-11-10 07:22:26
  • 包含所有 64 个密码子的密码子表,甚至包括非简并密码子(它们构成一个元素组)

  • 在迭代期间计算密码子出现的同时计算每个密码子组的出现次数

  • 包含编码氨基酸名称的密码子表->一个良好的显示

代码:

from collections import defaultdict

# the first 600 nucleotides from GenBank: AAHX01097212.1
adn = ("tcccccgcagcttcgggaacgtgcgggctcgggagggaggggcctggcgccgggcgcgcg"
       "cctgcgccccaccccgccccaccctggcgggtctcgcgcgcccggcccgcctcctgtcaa"
       "ccccagcgcggcggtcaggtggtccccagcccttggccccagcctccagcttcctggtcc"
       "ctcgggctctgagtcctgtctccggcagatcgcctttctgattgttctcctgcgcagctg"
       "gaggtgtatagcccctagccgagctatggtgcctcagcagatgtgaggaggtagtgggtc"
       "aggataaacccgcgcactccataataacgtgccagggctcagtgacttgggtctgcatta")

arn = adn.upper().replace('T','U')

#RNA codon table from http://en.wikipedia.org/wiki/Genetic_code
codon_table = ((('GCU', 'GCC', 'GCA', 'GCG'),  'Alanine'),
               (('UUA', 'UUG', 'CUU', 'CUC', 'CUA', 'CUG'),  'Leucine'),
               (('CGU', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'),  'Arginine'),
               (('AAA', 'AAG'),  'Lysine'),
               (('AAU', 'AAC'),  'Asparagine'),
               (('AUG',),  'Methionine'),
               (('GAU', 'GAC'),  'Aspartic acid' ),              
               (('UUU', 'UUC'),  'Phenylalanine'),
               (('UGU', 'UGC'),  'Cysteine'),
               (('CCU', 'CCC', 'CCA', 'CCG'),  'Proline') ,
               (('CAA', 'CAG'),  'Glutamine'),
               (('UCU', 'UCC', 'UCA', 'UCG', 'AGU', 'AGC'),  'Serine'),
               (('GAA', 'GAG'),  'Glutamic acid'),
               (('ACU', 'ACC', 'ACA', 'ACG'),  'Threonine'),
               (('GGU', 'GGC', 'GGA', 'GGG'),  'Glycine'),
               (('UGG',),  'Tryptophane'),
               (('CAU', 'CAC'),  'Histidine'),
               (('UAU', 'UAC'),  'Tyrosine'),
               (('AUU', 'AUC', 'AUA'),  'Isoleucine'),
               (('GUU', 'GUC', 'GUA', 'GUG'),  'Valine'),
               (('UAA', 'UGA', 'UAG'),  'STOP')            )

siblings = dict( (cod, codgroup) for codgroup,aa in codon_table for cod in codgroup )

cod_count, grp_count, freq = defaultdict(int), defaultdict(int), {}

for cod in (arn[i:i+3] for i in xrange(0,len(arn),3)):
    cod_count[cod] += 1
    grp_count[siblings[cod]] += 1

for cod in siblings.iterkeys(): # the keys of siblings are the 64 codons
    if siblings[cod] in grp_count:
        freq[cod] = float(cod_count[cod])/grp_count[siblings[cod]]
    else:
        freq[cod] = '-* Missing *-'


display = '\n'.join(aa.rjust(13)+\
                '\n'.join('%s  %-16s' % (cod.rjust(18 if i else 5),freq[cod])
                          for i,cod in enumerate(codgrp))
                for codgrp,aa in codon_table)


# editing addition:

def outputResults(filename,arn,codon_table,displ):

    li = ['This file is named %s' % filename]

    li.append('The sequence of ARN:\n%s' %\
              '\n'.join(arn[i:i+42] for i in xrange(0,len(arn),42)))
    li.append('Size of the sequence : '+str(len(arn)))

    li.append('Codon_table:\n'+\
              '\n'.join('%s : %s' % (u,v) for u,v in codon_table))

    li.append('Frequency results :\n'+displ)

    with open(filename,'w') as f:
        f.writelines('\n\n'.join(li))


outputResults('ARN_mem.txt',arn,codon_table,display)
print display 

.

编辑

我添加了一个函数outputResults()来显示在文件中记录数据和结果的方式

生成的文件的内容是:

This file is named ARN_mem.txt

The sequence of ARN:
UCCCCCGCAGCUUCGGGAACGUGCGGGCUCGGGAGGGAGGGG
CCUGGCGCCGGGCGCGCGCCUGCGCCCCACCCCGCCCCACCC
UGGCGGGUCUCGCGCGCCCGGCCCGCCUCCUGUCAACCCCAG
CGCGGCGGUCAGGUGGUCCCCAGCCCUUGGCCCCAGCCUCCA
GCUUCCUGGUCCCUCGGGCUCUGAGUCCUGUCUCCGGCAGAU
CGCCUUUCUGAUUGUUCUCCUGCGCAGCUGGAGGUGUAUAGC
CCCUAGCCGAGCUAUGGUGCCUCAGCAGAUGUGAGGAGGUAG
UGGGUCAGGAUAAACCCGCGCACUCCAUAAUAACGUGCCAGG
GCUCAGUGACUUGGGUCUGCAUUA

Size of the sequence : 360

Codon_table:
('GCU', 'GCC', 'GCA', 'GCG') : Alanine
('UUA', 'UUG', 'CUU', 'CUC', 'CUA', 'CUG') : Leucine
('CGU', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG') : Arginine
('AAA', 'AAG') : Lysine
('AAU', 'AAC') : Asparagine
('AUG',) : Methionine
('GAU', 'GAC') : Aspartic acid
('UUU', 'UUC') : Phenylalanine
('UGU', 'UGC') : Cysteine
('CCU', 'CCC', 'CCA', 'CCG') : Proline
('CAA', 'CAG') : Glutamine
('UCU', 'UCC', 'UCA', 'UCG', 'AGU', 'AGC') : Serine
('GAA', 'GAG') : Glutamic acid
('ACU', 'ACC', 'ACA', 'ACG') : Threonine
('GGU', 'GGC', 'GGA', 'GGG') : Glycine
('UGG',) : Tryptophane
('CAU', 'CAC') : Histidine
('UAU', 'UAC') : Tyrosine
('AUU', 'AUC', 'AUA') : Isoleucine
('GUU', 'GUC', 'GUA', 'GUG') : Valine
('UAA', 'UGA', 'UAG') : STOP

Frequency results :
      Alanine  GCU  0.1875          
               GCC  0.375           
               GCA  0.25            
               GCG  0.1875          
      Leucine  UUA  0.125           
               UUG  0.0             
               CUU  0.25            
               CUC  0.375   
etc.............
  • a codon table containing ALL the 64 codons, even the non-degenarated ones (they constitute one element groups)

  • counting the occurences of each codon's group at the same time that occurences of codons are counted during the iteration

  • codon table comprising the names of coded amino acids -> a good display

code:

from collections import defaultdict

# the first 600 nucleotides from GenBank: AAHX01097212.1
adn = ("tcccccgcagcttcgggaacgtgcgggctcgggagggaggggcctggcgccgggcgcgcg"
       "cctgcgccccaccccgccccaccctggcgggtctcgcgcgcccggcccgcctcctgtcaa"
       "ccccagcgcggcggtcaggtggtccccagcccttggccccagcctccagcttcctggtcc"
       "ctcgggctctgagtcctgtctccggcagatcgcctttctgattgttctcctgcgcagctg"
       "gaggtgtatagcccctagccgagctatggtgcctcagcagatgtgaggaggtagtgggtc"
       "aggataaacccgcgcactccataataacgtgccagggctcagtgacttgggtctgcatta")

arn = adn.upper().replace('T','U')

#RNA codon table from http://en.wikipedia.org/wiki/Genetic_code
codon_table = ((('GCU', 'GCC', 'GCA', 'GCG'),  'Alanine'),
               (('UUA', 'UUG', 'CUU', 'CUC', 'CUA', 'CUG'),  'Leucine'),
               (('CGU', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'),  'Arginine'),
               (('AAA', 'AAG'),  'Lysine'),
               (('AAU', 'AAC'),  'Asparagine'),
               (('AUG',),  'Methionine'),
               (('GAU', 'GAC'),  'Aspartic acid' ),              
               (('UUU', 'UUC'),  'Phenylalanine'),
               (('UGU', 'UGC'),  'Cysteine'),
               (('CCU', 'CCC', 'CCA', 'CCG'),  'Proline') ,
               (('CAA', 'CAG'),  'Glutamine'),
               (('UCU', 'UCC', 'UCA', 'UCG', 'AGU', 'AGC'),  'Serine'),
               (('GAA', 'GAG'),  'Glutamic acid'),
               (('ACU', 'ACC', 'ACA', 'ACG'),  'Threonine'),
               (('GGU', 'GGC', 'GGA', 'GGG'),  'Glycine'),
               (('UGG',),  'Tryptophane'),
               (('CAU', 'CAC'),  'Histidine'),
               (('UAU', 'UAC'),  'Tyrosine'),
               (('AUU', 'AUC', 'AUA'),  'Isoleucine'),
               (('GUU', 'GUC', 'GUA', 'GUG'),  'Valine'),
               (('UAA', 'UGA', 'UAG'),  'STOP')            )

siblings = dict( (cod, codgroup) for codgroup,aa in codon_table for cod in codgroup )

cod_count, grp_count, freq = defaultdict(int), defaultdict(int), {}

for cod in (arn[i:i+3] for i in xrange(0,len(arn),3)):
    cod_count[cod] += 1
    grp_count[siblings[cod]] += 1

for cod in siblings.iterkeys(): # the keys of siblings are the 64 codons
    if siblings[cod] in grp_count:
        freq[cod] = float(cod_count[cod])/grp_count[siblings[cod]]
    else:
        freq[cod] = '-* Missing *-'


display = '\n'.join(aa.rjust(13)+\
                '\n'.join('%s  %-16s' % (cod.rjust(18 if i else 5),freq[cod])
                          for i,cod in enumerate(codgrp))
                for codgrp,aa in codon_table)


# editing addition:

def outputResults(filename,arn,codon_table,displ):

    li = ['This file is named %s' % filename]

    li.append('The sequence of ARN:\n%s' %\
              '\n'.join(arn[i:i+42] for i in xrange(0,len(arn),42)))
    li.append('Size of the sequence : '+str(len(arn)))

    li.append('Codon_table:\n'+\
              '\n'.join('%s : %s' % (u,v) for u,v in codon_table))

    li.append('Frequency results :\n'+displ)

    with open(filename,'w') as f:
        f.writelines('\n\n'.join(li))


outputResults('ARN_mem.txt',arn,codon_table,display)
print display 

.

EDIT

I've added a function outputResults() to show the manner to record data and results in a file

The resulting file's content is:

This file is named ARN_mem.txt

The sequence of ARN:
UCCCCCGCAGCUUCGGGAACGUGCGGGCUCGGGAGGGAGGGG
CCUGGCGCCGGGCGCGCGCCUGCGCCCCACCCCGCCCCACCC
UGGCGGGUCUCGCGCGCCCGGCCCGCCUCCUGUCAACCCCAG
CGCGGCGGUCAGGUGGUCCCCAGCCCUUGGCCCCAGCCUCCA
GCUUCCUGGUCCCUCGGGCUCUGAGUCCUGUCUCCGGCAGAU
CGCCUUUCUGAUUGUUCUCCUGCGCAGCUGGAGGUGUAUAGC
CCCUAGCCGAGCUAUGGUGCCUCAGCAGAUGUGAGGAGGUAG
UGGGUCAGGAUAAACCCGCGCACUCCAUAAUAACGUGCCAGG
GCUCAGUGACUUGGGUCUGCAUUA

Size of the sequence : 360

Codon_table:
('GCU', 'GCC', 'GCA', 'GCG') : Alanine
('UUA', 'UUG', 'CUU', 'CUC', 'CUA', 'CUG') : Leucine
('CGU', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG') : Arginine
('AAA', 'AAG') : Lysine
('AAU', 'AAC') : Asparagine
('AUG',) : Methionine
('GAU', 'GAC') : Aspartic acid
('UUU', 'UUC') : Phenylalanine
('UGU', 'UGC') : Cysteine
('CCU', 'CCC', 'CCA', 'CCG') : Proline
('CAA', 'CAG') : Glutamine
('UCU', 'UCC', 'UCA', 'UCG', 'AGU', 'AGC') : Serine
('GAA', 'GAG') : Glutamic acid
('ACU', 'ACC', 'ACA', 'ACG') : Threonine
('GGU', 'GGC', 'GGA', 'GGG') : Glycine
('UGG',) : Tryptophane
('CAU', 'CAC') : Histidine
('UAU', 'UAC') : Tyrosine
('AUU', 'AUC', 'AUA') : Isoleucine
('GUU', 'GUC', 'GUA', 'GUG') : Valine
('UAA', 'UGA', 'UAG') : STOP

Frequency results :
      Alanine  GCU  0.1875          
               GCC  0.375           
               GCA  0.25            
               GCG  0.1875          
      Leucine  UUA  0.125           
               UUG  0.0             
               CUU  0.25            
               CUC  0.375   
etc.............
如果没有你 2024-11-10 07:22:26

我不确定我是否完全理解了这个问题,但我认为你需要将计算分为两个阶段:首先计算每个密码子出现的次数,然后计算出频率。我提出了以下代码:

from collections import defaultdict

# Initial sequence.
sequence = "AAABOBAACAAAFOOAACBARAAAAAA"

# Which codons are grouped together.
groups = (
    ('AAA', 'AAC'),
    ('BOB',),
    ('FOO', 'BAR', 'BAA'),
)

# Separate into list of codons.
codonList = []
for codons in range(0, len(sequence), 3):
    codonList.append(sequence[codons:codons+3])

# Count how many times each codon is used.
counts = defaultdict(int)
for codon in codonList:
    counts[codon] += 1

# Go through and calculate frequencies of each codon.
freqs = {}
for group in groups:
    total = float(sum(counts[codon] for codon in group))
    for codon in group:
        freqs[codon] = counts[codon] / total

# Done.
print freqs

请注意在最后一个循环中将 total 显式转换为浮点数。如果保留为整数,则在 Python 2.x 上后续除法将是 0 或 1,因此我们需要将其转换以获得浮点输出。我得到的输出是:

blair@blair-eeepc:~$ python codons.py 
{'BAR': 0.5, 'AAC': 0.33333333333333331, 'BAA': 0.0, 'AAA': 0.66666666666666663, 'BOB': 1.0, 'FOO': 0.5}

这是您正在寻找的输出类型吗?

I'm not sure if I've fully understood the question, but I think you need to split the calculations into two stages: first count how many times each codon occurs, and then work out the frequencies. I've come up with the following code:

from collections import defaultdict

# Initial sequence.
sequence = "AAABOBAACAAAFOOAACBARAAAAAA"

# Which codons are grouped together.
groups = (
    ('AAA', 'AAC'),
    ('BOB',),
    ('FOO', 'BAR', 'BAA'),
)

# Separate into list of codons.
codonList = []
for codons in range(0, len(sequence), 3):
    codonList.append(sequence[codons:codons+3])

# Count how many times each codon is used.
counts = defaultdict(int)
for codon in codonList:
    counts[codon] += 1

# Go through and calculate frequencies of each codon.
freqs = {}
for group in groups:
    total = float(sum(counts[codon] for codon in group))
    for codon in group:
        freqs[codon] = counts[codon] / total

# Done.
print freqs

Note the explicit conversion of total to a floating-point number in the last loop. If it is left as an integer, the subsequent division will be either 0 or 1 on Python 2.x, so we need to convert it to get a floating-point output. The output I get is:

blair@blair-eeepc:~$ python codons.py 
{'BAR': 0.5, 'AAC': 0.33333333333333331, 'BAA': 0.0, 'AAA': 0.66666666666666663, 'BOB': 1.0, 'FOO': 0.5}

Is this the sort of output you were looking for?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文