是否有一个函数可以计算给定比对参数的比对序列的分数？

发布于 2024-11-01 22:01:42 字数 405 浏览 5 评论 0原文

我尝试对已经对齐的序列进行评分。假设

seq1 = 'PAVKDLGAEG-ASDKGT--SHVVY----------TI-QLASTFE'
seq2 = 'PAVEDLGATG-ANDKGT--LYNIYARNTEGHPRSTV-QLGSTFE'

给定的参数

substitution matrix : blosum62
gap open penalty : -5
gap extension penalty : -1

我确实浏览了biopython Cookbook，但我所能得到的只是替换矩阵blogsum62，但我觉得它一定有人已经实现了这种库。

那么有人可以建议任何可以解决我的问题的库或最短的代码吗？

提前谢谢

原文

I try to score the already-aligned sequences.
Let say

seq1 = 'PAVKDLGAEG-ASDKGT--SHVVY----------TI-QLASTFE'
seq2 = 'PAVEDLGATG-ANDKGT--LYNIYARNTEGHPRSTV-QLGSTFE'

with given parameters

substitution matrix : blosum62
gap open penalty : -5
gap extension penalty : -1

I did look through the biopython cookbook but all i can get is substitution matrix blogsum62 but I feel that it must have someone already implemented this kind of library.

So can anyone suggest any libraries or shortest code that can solve my problem?

Thx in advance

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

鹤舞 2024-11-08 22:01:42

Jessada，

Blosum62 矩阵（注意拼写；）位于 Bio.SubsMat.MatrixInfo 中，是一个包含元组解析为分数的字典（因此 ('A', 'A') 值 4 分）。它没有间隙，而且它只是矩阵的一个三角形（所以它可能有 ('T', 'A') 但不是 ('A', 'T')。Biopython 中有一些辅助函数，包括 Bio.Pairwise 中的一些，但这就是我想出的答案：

from Bio.SubsMat import MatrixInfo

def score_match(pair, matrix):
    if pair not in matrix:
        return matrix[(tuple(reversed(pair)))]
    else:
        return matrix[pair]

def score_pairwise(seq1, seq2, matrix, gap_s, gap_e):
    score = 0
    gap = False
    for i in range(len(seq1)):
        pair = (seq1[i], seq2[i])
        if not gap:
            if '-' in pair:
                gap = True
                score += gap_s
            else:
                score += score_match(pair, matrix)
        else:
            if '-' not in pair:
                gap = False
                score += score_match(pair, matrix)
            else:
                score += gap_e
    return score

seq1 = 'PAVKDLGAEG-ASDKGT--SHVVY----------TI-QLASTFE'
seq2 = 'PAVEDLGATG-ANDKGT--LYNIYARNTEGHPRSTV-QLGSTFE'

blosum = MatrixInfo.blosum62

score_pairwise(seq1, seq2, blosum, -5, -1)

它为你的对齐返回 82 几乎肯定有更漂亮的方法来完成所有这些，但这应该是一个好的开始。

Jessada,

The Blosum62 matrix (note the spelling ;) is in Bio.SubsMat.MatrixInfo and is a dictionary with tuples resolving to scores (so ('A', 'A') is worth 4 pts). It doesn't have the gaps, and it's only one triangle of the matrix (so it might ahve ('T', 'A') but not ('A', 'T'). There are some helper functions in Biopython, including some in Bio.Pairwise, but this is what I came up with as an answer:

from Bio.SubsMat import MatrixInfo

def score_match(pair, matrix):
    if pair not in matrix:
        return matrix[(tuple(reversed(pair)))]
    else:
        return matrix[pair]

def score_pairwise(seq1, seq2, matrix, gap_s, gap_e):
    score = 0
    gap = False
    for i in range(len(seq1)):
        pair = (seq1[i], seq2[i])
        if not gap:
            if '-' in pair:
                gap = True
                score += gap_s
            else:
                score += score_match(pair, matrix)
        else:
            if '-' not in pair:
                gap = False
                score += score_match(pair, matrix)
            else:
                score += gap_e
    return score

seq1 = 'PAVKDLGAEG-ASDKGT--SHVVY----------TI-QLASTFE'
seq2 = 'PAVEDLGATG-ANDKGT--LYNIYARNTEGHPRSTV-QLGSTFE'

blosum = MatrixInfo.blosum62

score_pairwise(seq1, seq2, blosum, -5, -1)

Which returns 82 for your alignment. There's almost certianly prettier ways to do all of this, but that should be a good start.

回复收藏 0 原文

谜泪 2024-11-08 22:01:42

blosum62 是一个包含 276 个项目的字典。

我更喜欢完成缺少的项目，因为它代表只有 276 圈的迭代，而要分析的序列可能有超过 276 个元素。因此，如果您借助函数 Score_match() 找到每对的分数，则该函数将必须对序列的每个元素执行测试ifpair not in Matrix，即也就是说肯定远不止276次。

另一件需要花费大量时间的事情是：每个 score += some 创建一个新的整数并将名称 score 绑定到这个新对象。每个绑定都需要一定的时间，而生成器生成的整数流不存在这些时间，这些时间会立即添加到当前量。

from Bio.SubsMat.MatrixInfo import blosum62 as blosum
from itertools import izip

blosum.update(((b,a),val) for (a,b),val in blosum.items())

def score_pairwise(seq1, seq2, matrix, gap_s, gap_e, gap = True):
    for A,B in izip(seq1, seq2):
        diag = ('-'==A) or ('-'==B)
        yield (gap_e if gap else gap_s) if diag else matrix[(A,B)]
        gap = diag

seq1 = 'PAVKDLGAEG-ASDKGT--SHVVY----------TI-QLASTFE'
seq2 = 'PAVEDLGATG-ANDKGT--LYNIYARNTEGHPRSTV-QLGSTFE'

print sum(score_pairwise(seq1, seq2, blosum, -5, -1))

这个score_pairwise()是一个生成器函数，因为有yield而不是return。

编辑：
Python 3 的更新代码：

from Bio.SubsMat.MatrixInfo import blosum62 as blosum

blosum.update(((b,a),val) for (a,b),val in list(blosum.items()))

def score_pairwise(seq1, seq2, matrix, gap_s, gap_e, gap = True):
    for A,B in zip(seq1, seq2):
        diag = ('-'==A) or ('-'==B)
        yield (gap_e if gap else gap_s) if diag else matrix[(A,B)]
        gap = diag

seq1 = 'PAVKDLGAEG-ASDKGT--SHVVY----------TI-QLASTFE'
seq2 = 'PAVEDLGATG-ANDKGT--LYNIYARNTEGHPRSTV-QLGSTFE'

print(sum(score_pairwise(seq1, seq2, blosum, -5, -1)))

blosum62 is a dictonary of 276 items.

I prefered to complete with the lacking items, because it represents an iteration of only 276 turns, while the sequences to be analysed are likely to have more than 276 elements. Consequently, if you find the score of each pair with the help of the function score_match() , this function will have to perform the test if pair not in matrix for each of the elements of the sequences, that is to say certainly far more than 276 times.

Another thing that takes a lot of time: each score += something creates a new integer and binds the name score to this new object. Each binding takes an amount of time that doesn't exist with a stream of integers by a generator that are immediatly added to the current amount.

from Bio.SubsMat.MatrixInfo import blosum62 as blosum
from itertools import izip

blosum.update(((b,a),val) for (a,b),val in blosum.items())

def score_pairwise(seq1, seq2, matrix, gap_s, gap_e, gap = True):
    for A,B in izip(seq1, seq2):
        diag = ('-'==A) or ('-'==B)
        yield (gap_e if gap else gap_s) if diag else matrix[(A,B)]
        gap = diag

seq1 = 'PAVKDLGAEG-ASDKGT--SHVVY----------TI-QLASTFE'
seq2 = 'PAVEDLGATG-ANDKGT--LYNIYARNTEGHPRSTV-QLGSTFE'

print sum(score_pairwise(seq1, seq2, blosum, -5, -1))

This score_pairwise() is a generator function because there is yield instead of return.

Edit:
Updated code for Python 3:

from Bio.SubsMat.MatrixInfo import blosum62 as blosum

blosum.update(((b,a),val) for (a,b),val in list(blosum.items()))

def score_pairwise(seq1, seq2, matrix, gap_s, gap_e, gap = True):
    for A,B in zip(seq1, seq2):
        diag = ('-'==A) or ('-'==B)
        yield (gap_e if gap else gap_s) if diag else matrix[(A,B)]
        gap = diag

seq1 = 'PAVKDLGAEG-ASDKGT--SHVVY----------TI-QLASTFE'
seq2 = 'PAVEDLGATG-ANDKGT--LYNIYARNTEGHPRSTV-QLGSTFE'

print(sum(score_pairwise(seq1, seq2, blosum, -5, -1)))

回复收藏 0 原文

~没有更多了~