如何计算文本字符串的多序列比对
我正在编写一个程序,它必须计算一组字符串的多序列对齐。我正在考虑用 Python 来做这件事,但如果更实用的话,我可以使用外部软件或其他语言。数据不是特别大,我没有很强的性能要求,我可以容忍近似值(即我只需要找到一个足够好的对齐方式)。唯一的问题是字符串是常规字符串(即可能带有换行符的 UTF-8 字符串,应将其视为常规字符);它们不是 DNA 序列或蛋白质序列。
我可以找到大量针对生物信息学中常见情况的工具和信息,这些工具和信息具有特定的复杂文件格式和大量我不需要的功能,但出乎意料的是,很难找到针对简单字符串情况的软件、库或示例代码。我可能可以针对这个问题重新实现多种算法中的任何一种,或者将我的字符串编码为 DNA,但一定有更好的方法。你知道有什么解决办法吗?
谢谢!
I'm writing a program which has to compute a multiple sequence alignment of a set of strings. I was thinking of doing this in Python, but I could use an external piece of software or another language if that's more practical. The data is not particularly big, I do not have strong performance requirements and I can tolerate approximations (ie. I just need to find a good enough alignment). The only problem is that the strings are regular strings (ie. UTF-8 strings potentially with newlines that should be treated as a regular character); they aren't DNA sequences or protein sequences.
I can find tons of tools and information for the usual cases in bioinformatics with specific complicated file formats and a host of features I don't need, but it is unexpectly hard to find software, libraries or example code for the simple case of strings. I could probably reimplement any one of the many algorithms for this problem or encode my string as DNA, but there must be a better way. Do you know of any solutions?
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
首先获取每对的成对相似度分数并存储这些分数。这是该过程中最昂贵的部分。选择具有最佳相似度得分的对并进行对齐。现在选择与比对序列集中的一个序列最匹配的序列,并基于该成对比对将其与比对集比对。重复直到所有序列都输入。
Lafrasu 建议使用 SequneceMatcher() 算法来对 UTF-8 字符串进行成对对齐。我所描述的内容为您提供了一种相当轻松、相当不错的方法来将其扩展到多个序列。
如果您感兴趣,它相当于构建一小组对齐序列并将它们对齐到最佳对上。它给出了完全相同的结果,但它是一个更简单的实现。
First get pairwise similarity scores for each pair and store those scores. This is the most expensive part of the process. Choose the pair that has the best similarity score and do that alignment. Now pick the sequence which aligned best to one of the sequences in the set of aligned sequences, and align it to the aligned set, based on that pairwise alignment. Repeat until all sequences are in.
Lafrasu has suggested the SequneceMatcher() algorithm to use for pairwise alignment of UTF-8 strings. What I've described gives you a fairly painless, reasonably decent way to extend that to multiple sequences.
In case you are interested, it is equivalent to building up small sets of aligned sequences and aligning them on their best pair. It gives exactly the same result, but it is a simpler implementation.
您是否正在寻找一些快速而肮脏的东西,如下所示?
Are you looking for something quick and dirty, as in the following?
MAFFT 版本 7.120+ 支持多种文本对齐方式。输入类似于 FASTA 格式,但使用 LATIN1 文本而不是序列,并且输出与 FASTA 格式对齐。安装后,很容易运行:
虽然MAFFT是一个成熟的生物序列比对工具,但文本比对模式正处于开发阶段,未来的计划包括允许用户定义评分矩阵。您可以在文档中查看更多详细信息。
MAFFT version 7.120+ supports multiple text alignment. Input is like FASTA format but with LATIN1 text instead of sequences and output is aligned FASTA format. Once installed, it is easy to run:
Although MAFFT is a mature tool for biological sequence alignment, the text alignment mode is in the development stage, with future plans including permitting user defined scoring matrices. You can see the further details in the documentation.
我最近编写了一个运行 Smith-Waterman 算法的 Python 脚本(该算法用于生成 DNA 或蛋白质序列的间隙局部序列比对)。它几乎肯定不是最快的实现,因为我根本没有优化它的速度(目前不是我的瓶颈),但它可以工作并且不关心字符串中每个字符的身份。如果您正在寻找这种东西,我可以将其发布在这里或通过电子邮件将文件发送给您。
I've pretty recently written a python script that runs the Smith-Waterman algorithm (which is what is used to generate gapped local sequence alignments for DNA or protein sequences). It's almost certainly not the fastest implementation, as I haven't optimized it for speed at all (not my bottleneck at the moment), but it works and doesn't care about the identity of each character in the strings. I could post it here or email you the files if that's the kind of thing you're looking for.