对于无序序列匹配问题,什么样的算法比较好?

发布于 2024-09-26 19:26:17 字数 859 浏览 1 评论 0 原文

如果我有两个序列(例如字符串),

//   01234567890123456789012  
a = "AAACDDFFFEE1122VV1VAADD"

//   0123456789012345678901
b = "DDFFAA11221DHHVV1VAAFE"

我想知道从 b 到 a 的最佳子字符串匹配(无序),例如:

optimal (6 matched parts, 19 characters of a matched)
b         a
DDFF   -> DDFF     (4,7)
AA     -> AA       (0,1)
1122   -> 1122     (11,14)
1     
D      -> D        (21)
HH
VV1VAA -> VV1VAA   (15,20)
FE     -> FE       (8,9)

还有另一种解决方案,但不是最优的:

not optimal (8 parts matched, 19 characters of a matched)
b        a
DDFF  -> DDFF  (4,7)
AA    -> AA    (0,1)
1122  -> 1122  (11,14)
1     -> 1     (17)
D     -> D     (21)
HH
VV    -> VV    (15,16)
1     
VAA   -> VAA   (18,20)
FE    -> FE    (8,9)

哪种算法更适合这个问题? ? (我需要最佳结果,性能至关重要)。

谢谢。

If I have two sequences (for example, string)

//   01234567890123456789012  
a = "AAACDDFFFEE1122VV1VAADD"

//   0123456789012345678901
b = "DDFFAA11221DHHVV1VAAFE"

I want to know the best substring matching (unordered) from b to a, for instance:

optimal (6 matched parts, 19 characters of a matched)
b         a
DDFF   -> DDFF     (4,7)
AA     -> AA       (0,1)
1122   -> 1122     (11,14)
1     
D      -> D        (21)
HH
VV1VAA -> VV1VAA   (15,20)
FE     -> FE       (8,9)

there is another solution, but not optimal:

not optimal (8 parts matched, 19 characters of a matched)
b        a
DDFF  -> DDFF  (4,7)
AA    -> AA    (0,1)
1122  -> 1122  (11,14)
1     -> 1     (17)
D     -> D     (21)
HH
VV    -> VV    (15,16)
1     
VAA   -> VAA   (18,20)
FE    -> FE    (8,9)

What kind of algorithm is better for this problem??? (I need optimal result and performance is critical).

Thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

指尖上的星空 2024-10-03 19:26:17

有趣的问题,你可以使用 Boyer-Moore 在 O(|a|.|b| + |b|^2) 中解决它( http://en.wikipedia.org/wiki/Boyer–Moore_string_search_algorithm ) 或 KMP (http://en.wikipedia.org/wiki/Knuth–Morris–Pratt_algorithm)算法或任何其他线性时间字符串搜索算法。

  • 对于每个 b[0..i] ty 在字符串 a 中找到它(在 O(|a| + i) 中),直到你再也找不到它
  • 你知道你可以找到 b[0..i] 但不能b[0..i+1],因此您有 b[0..i] 的匹配项,然后继续 b[i+1..i+1],b[i+1..i+2] ..
  • 最后b的每个部分是否匹配,如果匹配则尽可能大。

总复杂度最多为 sum( O(|a| + i) , i=0..|b|) = O(|a|.|b| + |b|^2) 但它可以小得多,只要b 的小子串可以在 a 中找到。

编辑:

上述方法是贪婪的,不会最小化部分数量,但我认为它会最大化匹配的字符总数。


Thoughts on an optimal solution :

  • 对于 |b| 的每个 |b|^2 个子串 情况
  • 确定是否可以在 |a| 中找到它,并仅保留我们需要在这些字符串中查找其子集的 :
    • 其中任何两个之间没有重叠
    • 长度总和最大
    • 同等长度下,字符串数量必须最少

很容易计算,因为一个非常简单的解决方案是仅匹配大小为 1 的子字符串:然后是 length是 a 和 b 之间共同字母的数量。

因此,如果我们将 b 的大小为 1 的子字符串(甚至 a 中不存在的字母)添加到上面的匹配字符串集合中,我们需要找到 b 的最小集合覆盖且不重叠。

一般的集合覆盖是 NP 完全的,但这里有无重叠约束,有帮助吗?

我正在调查。

事实上,NP完全: http://www.springerlink.com/content/n73561q050w54pn6/

您可能想寻找近似算法......

Interesting problem, you could solve it in O(|a|.|b| + |b|^2) using Boyer-Moore ( http://en.wikipedia.org/wiki/Boyer–Moore_string_search_algorithm ) or KMP ( http://en.wikipedia.org/wiki/Knuth–Morris–Pratt_algorithm ) algorithms or any other linear time string search algorithm.

  • For each b[0..i] ty to find it in the string a (in O(|a| + i) ) until you can't find it anymore
  • You know you can find b[0..i] but not b[0..i+1], so you have a match for b[0..i] and you continue with b[i+1..i+1],b[i+1..i+2]..
  • At the end every part of b has been matched or not, and if it has been matched if was as big as possible.

Total complexity is at most sum( O(|a| + i) , i=0..|b|) = O(|a|.|b| + |b|^2) but it can be much smaller if only small substrings of b can be found in a.

EDIT :

The above approach is greedy and will not minimize the number of parts, but I think it will maximize the total number of characters matched.


Thoughts on an optimal solution :

  • for every |b|^2 substring of |b| determine if it can be found in |a|, and keep only the ones for which it is the case
  • we need to find among these strings a subset of them with :
    • no overlap between any two of them
    • sum of length is maximum
    • at equal length, the number of strings must be minimum

The sum is easy to calculate because a very simple solution would be to match only substring of size 1 : then length is the number of common letters between a and b.

So if we add substring of size 1 of b (even the letters that are not in a) to the set of matching strings above we need to find a minimal set-cover of b with no overlapping.

The general set-cover is NP-complete, but here with have the no-overlapping constraints, does it helps ?

I'm looking into it.

Indeed, NP-complete : http://www.springerlink.com/content/n73561q050w54pn6/

You might want to look for approximation algorithms....

温柔戏命师 2024-10-03 19:26:17

如果我理解你的问题,你想要找到两个给定字符串的一组不重叠的公共子串,这些子串要么最大化公共子串的总长度,要么最小化公共子串的数量。我将提出以下启发式:找到两个字符串的最长公共子串(LCS),将其删除,重复。我无法证明这是最优的,但我有一个非常有效的算法

所以在你的例子中
AAACDDFFFEE1122VV1VAADD
DDFFAA11221DHHVV1VAAFE
濒海战斗舰 = VV1VAA

AAACDDFFFEE1122DD
DDFFAA11221DHHFE

LCS = DDFF

AAACFEE1122DD
AA11221DHHFE

LCS = 1122

AAACFEEDD
AADHHFE

算法如下
1)使用基于后缀树的标准LCS算法,即
1.1 构建两个字符串连接并具有唯一终止符的后缀树
1.2 用 1,2 或两者标记每个节点,具体取决于以该处为根的子树是否具有来自任一字符串或两个字符串的叶子
1.3 计算每个节点的字符串深度
1.4 找到字符串最深且标记为1和2的节点
2)删除以该节点为根的子树,并更新其上方节点的标签
3) 重复1.4,

当树中没有同时标记为1和2的节点时,算法终止
1.1 可以完成的时间与两个字符串的长度之和成正比
1.2、1.3 和 1.4 只不过是树遍历
2 如果树正确实现并且更新受到 LCS 长度的限制,则删除应该是恒定时间
3 又是一个树遍历,但是是一棵较小的树

所以这是一种优化,为了避免重复的树遍历,我们添加步骤 1.35:按字符串深度对具有两个标签的内部节点进行排序(在单独的数据结构中,树仍然存在) 。现在您可以扫描已排序的节点列表,执行 2) 并重复。通过这种优化,如果您可以使用基数排序,看起来该算法是线性时间的,并且您无法在渐近意义上击败它。

我希望这是正确且足够清晰的,我相信您需要先熟悉一下后缀树文献,然后才能听起来很明显。我推荐 Dan Gusfield 的书《字符串、树和序列的算法》,特别是第 7.4 节,如果您有疑问,请告诉我。

If I understand your problem, you want to find a set of non-overlapping common substrings of two given strings that either maximizes the total length of the common substrings and among those minimizes the number of common substrings. I will propose the following heuristic: find the longest common substring (LCS) of the two strings, remove it, repeat. I can't prove this is optimal, but I have a very efficient algorithm for it

So in your example
AAACDDFFFEE1122VV1VAADD
DDFFAA11221DHHVV1VAAFE
LCS = VV1VAA

AAACDDFFFEE1122DD
DDFFAA11221DHHFE

LCS = DDFF

AAACFEE1122DD
AA11221DHHFE

LCS = 1122

AAACFEEDD
AADHHFE

and so forth

The algorithm is as follows
1)Use the standard LCS algorithm based on suffix trees which is
1.1 build the suffix trees of the two strings concatenated and with unique terminators
1.2 mark every node with 1,2 or both depending whether the subtree rooted there has leaves from either or both strings
1.3 compute the string-depth of every node
1.4 find the string-deepest node that is labeled both 1 and 2
2) remove the subtree rooted at that node, and update the labels of nodes above it
3) repeat from 1.4

the algorithm terminates when the tree has no nodes that are labeled both 1 and 2
1.1 can be done in time proportional to the sum of the length of the two strings
1.2, 1.3 and 1.4 are little more than tree traversals
2 the removal should be constant time if the tree is implemented right and the update is bounded by the length of the LCS
3 is again a tree traversal but of a smaller tree

So this is one optimization, to avoid repeated tree traversals let's add step 1.35: sort internal nodes that have both labels by string depth (in a separate data structure, the tree is still there). Now you can scan that sorted list of nodes, perform 2) and repeat. With this optimization and if you can use radix sorting, it looks like the algorithm is linear time, and you can't beat that in an asymptotic sense.

I hope this is correct and clear enough, I am sure you will need to familiarize yourself with the suffix tree literature a little bit before it sounds obvious. I recommend Dan Gusfield's book Algorithms on String, Trees and Sequences, the section in particular is 7.4 Let me know if you have questions.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文