长字符串序列的成对比对

发布于 2024-11-07 04:42:02 字数 428 浏览 7 评论 0原文

我想找到两个长(数万)字符串序列之间的全局最优(或接近最优)成对对齐,但该算法预计适用于任何对象序列。 我还想使用自己的距离函数实现来计算两个对象的相似度。对于较短的序列,我可以使用动态时间扭曲 (DTW) 算法,但 DTW 算法需要计算和存储 *m 距离矩阵(n,m 是序列的长度),这对于较长的序列是不可行的。你能推荐这样的算法吗?有效的实施将是一个优势。

下面的例子阐明了算法需要做什么:

Input:
Sequence A: i saw the top of the mountain
Sequence B: then i see top of the mountains

Result:    
Sequence A:      i saw the top of the mountain
Sequence B: then i see     top of the mountains

I want to find the globally optimal (or close to optimal) pairwise alignment between two long (tens of thousands) sequences of strings, but the algorithm is expected to operate on any object sequences.
I also want to use my own distance function implementation to compute the similarity of two objects. For shorter sequences, I could use the dynamic time warping (DTW) algorithm but the DTW algorithm needs to compute and store a n*m distance matrix (n,m are lengths of the sequences) which is not feasible for longer sequences. Can you recommend such algorithm? A working implementation would be a plus.

The following example clarifies what the algorithm needs to do:

Input:
Sequence A: i saw the top of the mountain
Sequence B: then i see top of the mountains

Result:    
Sequence A:      i saw the top of the mountain
Sequence B: then i see     top of the mountains

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

路弥 2024-11-14 04:42:02

我不知道,我是否正确理解了您的要求,但在我看来,您正在尝试解决 稳定婚姻问题。如果我没记错的话,链接中 Gale 和 Shapley 的原始解决方案是在 O(n*m) 时间和 O(n+m) 空间中。实现非常简单。
还有一些后来的解决方案具有不同的问题变体。

您还可以尝试使用最大二部图匹配来解决此问题,例如 这里,用于获得不同的最优标准。这也可以在 O(n*m) 内完成。

I don't know, if I understood your requirements right, but it sounds to me like you are trying to solve a Stable Marriage Problem. The original solution of Gale and Shapley in the link is in O(n*m) time and O(n+m) space, if my memory serves me right. The implementation was pretty straightforward.
There are also some later solutions with different variants of the problem.

You also could try to solve this problem by using Maximum Bipartite Graph Matching, e.g. here, for getting a different optimality criterion. This also can be done in O(n*m).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文