序列比对
我有以下关于序列比对的问题:
我们知道,当您想要强制两个序列在其整个长度上比对时,全局比对算法非常有用,而局部比对可以找到两个序列之间具有最高相似性的一个或多个区域,并从那里向外建立比对。
当我们有一个很长的序列和一个小序列库时,找到库中小序列串联的最佳算法是什么,可以最大限度地减少比对成本?
I have the following question about sequence alignment:
We know that global alignment algorithms are useful when you want to force two sequences to align over their entire length, and local alignment finds the region or regions of highest similarity between two sequences and build the alignment outward from there.
What is the best algorithm to find the concatenation of small sequences in a library that minimizes alignment cost when we have one sequence that is very long and a library of small sequences?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
令 Σ 为字母表(例如,{A, C, G, T})。设 L ⊆ Σ* 为短文库序列的集合。计算 L* 的最小状态 DFA (Q, Σ, ∂, q0, F)。
我们一次扫描一个字母长序列 x ∈ Σ*。令x'为已消耗的x的前缀。对于每个状态 q ∈ Q,我们维持 [每个序列 y ∈ Σ* 上的最小值 cq(x'),使得 ∂(q0, y) = x' 和 y 之间的编辑距离 q]。
对于空前缀 ε,对于每个状态 q ∈ Q,都满足 cq(ε) = min {|y|: y ∈ Σ*, ∂(q0, y) = q},因为 y 和 ε 之间的距离是 y 的长度。在转移图上使用广度优先搜索计算初始表。
给定 x' 的表格和字母 s,我们计算 cq(x) 作为 y 的几种可能性的最小值,其中 x = x' s。
字符串 y = y' sz,对齐 s。本例中的成本为 minq', z: ∂(q', sz) = q (cq'(x') + |z|),其中可以通过 |Q| 计算广度优先搜索。
字符串 y = y',删除 x 中的 s。本例中的成本为 cq(x') + 1。
字符串 y = y' t,其中 t 是一个字母,用 s 代替 t(反之亦然)。本例中的成本为 minq', t: ∂(q', t) = q (cq'(x') + 1)。
最后,最优对齐成本为 minq ∈ F cq(x)。可以按照动态程序的常用方式重建对齐方式。
Let ∑ be the alphabet (e.g., {A, C, G, T}). Let L ⊆ ∑* be the set of short library sequences. Compute a minimum-state DFA (Q, ∑, ∂, q0, F) for L*.
We scan the long sequence x ∈ ∑* one letter at a time. Let x' be the prefix of x that has been consumed. We maintain, for every state q ∈ Q, the minimum cq(x') over [every sequence y ∈ ∑* such that ∂(q0, y) = q] of the Levenshtein distance between x' and y.
For the empty prefix ε, for every state q ∈ Q, it holds that cq(ε) = min {|y|: y ∈ ∑*, ∂(q0, y) = q}, since the distance between y and ε is the length of y. Compute the initial table with breadth-first search on the transition graph.
Given the table for x' and a letter s, we compute cq(x) as the minimum over several possibilities for y, where x = x' s.
Strings y = y' s z, aligning the s's. The cost in this case is minq', z: ∂(q', s z) = q (cq'(x') + |z|), which can be computed by |Q| breadth-first searches.
Strings y = y', deleting the s in x. The cost in this case is cq(x') + 1.
Strings y = y' t where t is a letter, substituting s for t (or vice versa). The cost in this case is minq', t: ∂(q', t) = q (cq'(x') + 1).
At the end, the optimal alignment cost is minq ∈ F cq(x). The alignment can be reconstructed in the usual way for dynamic programs.
一种天真的方法是尝试每一种排列。如果
S
是库中每个小序列的排列集合,您可以将大序列与S
中的每个序列一一比对,看看哪一个具有最小的对齐成本。不幸的是,这对 CPU 来说并不友好,因为S
的大小将随着小序列的数量呈指数增长。One naive approach would be to try every permutation. If
S
is the set of permutations of each small sequence in the library, you could align the large sequence with every sequence inS
, one by one, and see which one has the minimum alignment cost. Unfortunately, this won't be CPU-friendly as the size ofS
would be exponential in the number of small sequences.