在可变数量的集合中找到没有重复字符的最长公共子序列?
我正在尝试提出一种有效的算法,可以在 JavaScript 中解决 最长公共子序列问题< /a>.然而,我的问题和维基百科文章中描述的问题有两个主要区别。首先是我将拥有两组以上的角色。第二个是角色在一组中永远不会重复。这意味着每组的长度最多约为 50 个字符(即可打印的 ASCII 字符)。
例如,这些集合可能包含:
A = ZBANICOT
B = ACNTBZIO
C = ANICOTZB
D = ZIANCOTB
...并且它应该输出 ACO
、ACT
、ANO
和 ANT
>,因为这些是所有 4 个集合中最长的子序列(据我通过手动检查这些集合可知)。
由于没有一个字母重复,我是否应该考虑一种比维基百科文章中描述的算法更有效的算法?如果没有,是否有任何地方描述如何将算法转换为具有 N 组而不是 2 组?
I'm trying to come up with an efficient algorithm that would work in JavaScript for the longest common subsequence problem. However, there are two main differences between my problem and the one described in the Wikipedia article. The first is that I will have more than two sets of characters. The second is that characters will never repeat in a set. This means that the length of each set will be at most about 50 characters (i.e. the printable ASCII chars).
For example, the sets may contain:
A = ZBANICOT
B = ACNTBZIO
C = ANICOTZB
D = ZIANCOTB
... and it should output ACO
, ACT
, ANO
, and ANT
, since these are the longest subsequences in all 4 sets (as far as I can tell from manually examining these sets).
Since none of the letters repeat, is there a more efficient algorithm I should consider rather than the one described in the Wikipedia article? If not, is there anywhere that describes how to convert the algorithm to having N sets instead of 2?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
不确定正常 LCS 算法的适应效率如何,因此这可能效率低下,但由于符合条件的字母不多,所以这并不算太糟糕。
制作后继矩阵。对于每个字符串
s
,对于每对字母(s[i],s[j])
且i
j
,递增矩阵[s[i]][s[j]]
。如果matrix[a][b] = n = 字符串数
,一对字母只能是最长公共子序列的一部分。从参与此类对的字母构建一个有向图,边从a
到b
iffmatrix[a][b] = n
。在该图中找到最长的路径。Not sure how efficient an adaption of the normal LCS algorithm would be, so this may be inefficient, but since there aren't many eligible letters, it's not too horrible.
Make a successor matrix. For each string
s
, for each pair of letters(s[i],s[j])
withi < j
, incrementmatrix[s[i]][s[j]]
. A pair of letters can only be part of a longest common subsequence ifmatrix[a][b] = n = number of strings
. Build a directed graph from the letters participating in such a pair, with an edge froma
tob
iffmatrix[a][b] = n
. Find a longest path in that graph.您想要做的事情称为多序列比对。该算法“可以”重写为 n 维,但这会将空间复杂度增加到
长度 ^ 维度
。也许。首先创建一个映射 (a,b) 使得 a < b.例如,每个集合的前三个字符的映射如下
然后仅保留所有集合上公共的映射。下面是遵循此过程的示例,遍历 Set[A] (
ZBANICOT
),这将为我们提供一组所有集合共有的关系。因此,对于字母
Z
,您将观察到在 Set[C] 中,Z
后面的唯一字符是B
,因此您可以消除关于从Z
不到 B 的每个映射。但长话短说,Z
的每个映射都被删除。请注意,仅删除来自 Z 的映射,而不删除到Z
的映射。另请注意,由于 Set[D] 以
B
结尾,因此我们可以安全地消除来自B
的所有映射。通过遵循该过程,我们可以观察到A
映射到N
、C
、O
和T< /代码>。接下来是 N,它映射到
T
和O
。I
映射到O
(老实说,“I/O”),而C
映射到O
和T
(多可爱)。有趣的是O
映射到什么都没有!多合适啊。最后还有一个字符,
T
也映射为空(因为ZBANICOT
以T
结尾)。最后,您将得到以下映射:
现在您所要做的就是从每个键开始(
A
、C
、I
和N
)并搜索没有任何重复节点(本例中为字符)的最长序列。以下是所有常见的子序列(全部使用简单算法):
What you are trying to do is called multiple sequence alignment. The algorithm 'could' be rewritten to be n-dimensional, but that would increase the space complexity to
Length ^ Dimensions
.Perhaps. First create a mapping (a,b) such that a < b. For example the mappings for the first three characters of each sets is as follows
Then retain only mappings that are common on all sets. Here's an example of following this process, traversing Set[A] (
ZBANICOT
), which will give us a set of relations that are common to all sets.So with the letter
Z
you will observe that in Set[C], the only character that followsZ
isB
, so you can eliminate just about every mapping fromZ
that is not to B. But to cut a long story short, every mapping forZ
is removed. Note, only mappings from Z are removed, not mappings toZ
.Also note that since Set[D] ends with
B
, we can safely eliminate all mappings fromB
. By following the process one can observe thatA
maps toN
,C
,O
andT
. Next is N, which maps toT
andO
.I
maps toO
(honestly, "I/O") whileC
maps toO
andT
(how cute). And funnily enoughO
maps to nothing!!! How befitting.Finally there is one more character, and
T
maps to nothing too (sinceZBANICOT
ends withT
).In the end you are left with the following mappings:
Now all you have to do is start from each of the keys (
A
,C
,I
andN
) and search for the longest sequence without any repeating nodes (characters in this case).Here are all of the common subsequences (all using the simple algorithm):
如果我存储每组字符的字符映射,然后每次增加此计数会怎样......
对于第一组 (
ZBANICOT
),这将创建以下字符对(其中 28 个):第二组 (
ACNTBZIO
),我将再次拥有 28 对:对于最后 2 组,我将:(
因此要计算出到目前为止的迭代计数,让
N
等于数量集合,M
是每个集合中的字符数,我们有N*(M)((M-1)/2)
或4*28 =112 在本例中)
接下来,我将计算每对存在的次数。例如,
然后,由于我们有 4 对,我会丢弃所有计数不为 4 的对。然后我会尝试连接每对,以形成可能的最长字符串。
What if I stored character mapping for each set of characters and then incremented this count each time...
For the 1st set (
ZBANICOT
), this would create the following character pairs (28 of them):For the 2nd set (
ACNTBZIO
), I would have 28 pairs again:For the last 2 sets, I would have:
(So to figure out the iteration count so far, let
N
equal the number of sets, andM
be the number of characters in each set, we haveN*(M)((M-1)/2)
or4*28=112
in this example)Next, I would count the number of times each pair exists. For example,
Then, since we have 4 pairs, I would discard all pairs which don't have a count of 4. And then I would try concatting each pair in order to form the longest string possible.