如何计算文本字符串的多序列比对

发布于 2024-11-03 15:05:53 字数 433 浏览 3 评论 0原文

我正在编写一个程序,它必须计算一组字符串的多序列对齐。我正在考虑用 Python 来做这件事,但如果更实用的话,我可以使用外部软件或其他语言。数据不是特别大,我没有很强的性能要求,我可以容忍近似值(即我只需要找到一个足够好的对齐方式)。唯一的问题是字符串是常规字符串(即可能带有换行符的 UTF-8 字符串,应将其视为常规字符);它们不是 DNA 序列或蛋白质序列。

我可以找到大量针对生物信息学中常见情况的工具和信息,这些工具和信息具有特定的复杂文件格式和大量我不需要的功能,但出乎意料的是,很难找到针对简单字符串情况的软件、库或示例代码。我可能可以针对这个问题重新实现多种算法中的任何一种,或者将我的字符串编码为 DNA,但一定有更好的方法。你知道有什么解决办法吗?

谢谢!

I'm writing a program which has to compute a multiple sequence alignment of a set of strings. I was thinking of doing this in Python, but I could use an external piece of software or another language if that's more practical. The data is not particularly big, I do not have strong performance requirements and I can tolerate approximations (ie. I just need to find a good enough alignment). The only problem is that the strings are regular strings (ie. UTF-8 strings potentially with newlines that should be treated as a regular character); they aren't DNA sequences or protein sequences.

I can find tons of tools and information for the usual cases in bioinformatics with specific complicated file formats and a host of features I don't need, but it is unexpectly hard to find software, libraries or example code for the simple case of strings. I could probably reimplement any one of the many algorithms for this problem or encode my string as DNA, but there must be a better way. Do you know of any solutions?

Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

萌酱 2024-11-10 15:05:53
  • 比对多个序列的最简单方法是进行多次成对比对。

首先获取每对的成对相似度分数并存储这些分数。这是该过程中最昂贵的部分。选择具有最佳相似度得分的对并进行对齐。现在选择与比对序列集中的一个序列最匹配的序列,并基于该成对比对将其与比对集比对。重复直到所有序列都输入。

当您将序列对齐到
比对序列,(基于
成对对齐),当您插入
已经存在的序列中的间隙
集合中,您在同一个集合中插入间隙
放置在对齐的所有序列中
设置。

Lafrasu 建议使用 SequneceMatcher() 算法来对 UTF-8 字符串进行成对对齐。我所描述的内容为您提供了一种相当轻松、相当不错的方法来将其扩展到多个序列。

如果您感兴趣,它相当于构建一小组对齐序列并将它们对齐到最佳对上。它给出了完全相同的结果,但它是一个更简单的实现。

  • The easiest way to align multiple sequences is to do a number of pairwise alignments.

First get pairwise similarity scores for each pair and store those scores. This is the most expensive part of the process. Choose the pair that has the best similarity score and do that alignment. Now pick the sequence which aligned best to one of the sequences in the set of aligned sequences, and align it to the aligned set, based on that pairwise alignment. Repeat until all sequences are in.

When you are aligning a sequence to
the aligned sequences, (based on a
pairwise alignment), when you insert a
gap in the sequence that is already in
the set, you insert gaps in the same
place in all sequences in the aligned
set.

Lafrasu has suggested the SequneceMatcher() algorithm to use for pairwise alignment of UTF-8 strings. What I've described gives you a fairly painless, reasonably decent way to extend that to multiple sequences.

In case you are interested, it is equivalent to building up small sets of aligned sequences and aligning them on their best pair. It gives exactly the same result, but it is a simpler implementation.

南渊 2024-11-10 15:05:53

您是否正在寻找一些快速而肮脏的东西,如下所示?

from difflib import SequenceMatcher

a = "dsa jld lal"
b = "dsajld kll"
c = "dsc jle kal"
d = "dsd jlekal"

ss = [a,b,c,d]

s = SequenceMatcher()

for i in range(len(ss)):
    x = ss[i]
    s.set_seq1(x)
    for j in range(i+1,len(ss)):

        y = ss[j]
        s.set_seq2(y)

        print
        print s.ratio()
        print s.get_matching_blocks()

Are you looking for something quick and dirty, as in the following?

from difflib import SequenceMatcher

a = "dsa jld lal"
b = "dsajld kll"
c = "dsc jle kal"
d = "dsd jlekal"

ss = [a,b,c,d]

s = SequenceMatcher()

for i in range(len(ss)):
    x = ss[i]
    s.set_seq1(x)
    for j in range(i+1,len(ss)):

        y = ss[j]
        s.set_seq2(y)

        print
        print s.ratio()
        print s.get_matching_blocks()
梦言归人 2024-11-10 15:05:53

MAFFT 版本 7.120+ 支持多种文本对齐方式。输入类似于 FASTA 格式,但使用 LATIN1 文本而不是序列,并且输出与 FASTA 格式对齐。安装后,很容易运行:

mafft --text input_text.fa > output_alignment.fa

虽然MAFFT是一个成熟的生物序列比对工具,但文本比对模式正处于开发阶段,未来的计划包括允许用户定义评分矩阵。您可以在文档中查看更多详细信息。

MAFFT version 7.120+ supports multiple text alignment. Input is like FASTA format but with LATIN1 text instead of sequences and output is aligned FASTA format. Once installed, it is easy to run:

mafft --text input_text.fa > output_alignment.fa

Although MAFFT is a mature tool for biological sequence alignment, the text alignment mode is in the development stage, with future plans including permitting user defined scoring matrices. You can see the further details in the documentation.

删除→记忆 2024-11-10 15:05:53

我最近编写了一个运行 Smith-Waterman 算法的 Python 脚本(该算法用于生成 DNA 或蛋白质序列的间隙局部序列比对)。它几乎肯定不是最快的实现,因为我根本没有优化它的速度(目前不是我的瓶颈),但它可以工作并且不关心字符串中每个字符的身份。如果您正在寻找这种东西,我可以将其发布在这里或通过电子邮件将文件发送给您。

I've pretty recently written a python script that runs the Smith-Waterman algorithm (which is what is used to generate gapped local sequence alignments for DNA or protein sequences). It's almost certainly not the fastest implementation, as I haven't optimized it for speed at all (not my bottleneck at the moment), but it works and doesn't care about the identity of each character in the strings. I could post it here or email you the files if that's the kind of thing you're looking for.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文