长字符串序列的成对比对
我想找到两个长(数万)字符串序列之间的全局最优(或接近最优)成对对齐,但该算法预计适用于任何对象序列。 我还想使用自己的距离函数实现来计算两个对象的相似度。对于较短的序列,我可以使用动态时间扭曲 (DTW) 算法,但 DTW 算法需要计算和存储 *m 距离矩阵(n,m 是序列的长度),这对于较长的序列是不可行的。你能推荐这样的算法吗?有效的实施将是一个优势。
下面的例子阐明了算法需要做什么:
Input: Sequence A: i saw the top of the mountain Sequence B: then i see top of the mountains Result: Sequence A: i saw the top of the mountain Sequence B: then i see top of the mountains
I want to find the globally optimal (or close to optimal) pairwise alignment between two long (tens of thousands) sequences of strings, but the algorithm is expected to operate on any object sequences.
I also want to use my own distance function implementation to compute the similarity of two objects. For shorter sequences, I could use the dynamic time warping (DTW) algorithm but the DTW algorithm needs to compute and store a n*m distance matrix (n,m are lengths of the sequences) which is not feasible for longer sequences. Can you recommend such algorithm? A working implementation would be a plus.
The following example clarifies what the algorithm needs to do:
Input: Sequence A: i saw the top of the mountain Sequence B: then i see top of the mountains Result: Sequence A: i saw the top of the mountain Sequence B: then i see top of the mountains
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我不知道,我是否正确理解了您的要求,但在我看来,您正在尝试解决 稳定婚姻问题。如果我没记错的话,链接中 Gale 和 Shapley 的原始解决方案是在 O(n*m) 时间和 O(n+m) 空间中。实现非常简单。
还有一些后来的解决方案具有不同的问题变体。
您还可以尝试使用最大二部图匹配来解决此问题,例如 这里,用于获得不同的最优标准。这也可以在 O(n*m) 内完成。
I don't know, if I understood your requirements right, but it sounds to me like you are trying to solve a Stable Marriage Problem. The original solution of Gale and Shapley in the link is in O(n*m) time and O(n+m) space, if my memory serves me right. The implementation was pretty straightforward.
There are also some later solutions with different variants of the problem.
You also could try to solve this problem by using Maximum Bipartite Graph Matching, e.g. here, for getting a different optimality criterion. This also can be done in O(n*m).