在可变数量的集合中找到没有重复字符的最长公共子序列?

发布于 2024-12-24 02:57:09 字数 569 浏览 0 评论 0原文

我正在尝试提出一种有效的算法,可以在 JavaScript 中解决 最长公共子序列问题< /a>.然而,我的问题和维基百科文章中描述的问题有两个主要区别。首先是我将拥有两组以上的角色。第二个是角色在一组中永远不会重复。这意味着每组的长度最多约为 50 个字符(即可打印的 ASCII 字符)。

例如,这些集合可能包含:

A = ZBANICOT
B = ACNTBZIO
C = ANICOTZB
D = ZIANCOTB

...并且它应该输出 ACOACTANOANT >,因为这些是所有 4 个集合中最长的子序列(据我通过手动检查这些集合可知)。

由于没有一个字母重复,我是否应该考虑一种比维基百科文章中描述的算法更有效的算法?如果没有,是否有任何地方描述如何将算法转换为具有 N 组而不是 2 组?

I'm trying to come up with an efficient algorithm that would work in JavaScript for the longest common subsequence problem. However, there are two main differences between my problem and the one described in the Wikipedia article. The first is that I will have more than two sets of characters. The second is that characters will never repeat in a set. This means that the length of each set will be at most about 50 characters (i.e. the printable ASCII chars).

For example, the sets may contain:

A = ZBANICOT
B = ACNTBZIO
C = ANICOTZB
D = ZIANCOTB

... and it should output ACO, ACT, ANO, and ANT, since these are the longest subsequences in all 4 sets (as far as I can tell from manually examining these sets).

Since none of the letters repeat, is there a more efficient algorithm I should consider rather than the one described in the Wikipedia article? If not, is there anywhere that describes how to convert the algorithm to having N sets instead of 2?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

痴梦一场 2024-12-31 02:57:09

不确定正常 LCS 算法的适应效率如何,因此这可能效率低下,但由于符合条件的字母不多,所以这并不算太糟糕。

制作后继矩阵。对于每个字符串s,对于每对字母(s[i],s[j])i j,递增矩阵[s[i]][s[j]]。如果matrix[a][b] = n = 字符串数,一对字母只能是最长公共子序列的一部分。从参与此类对的字母构建一个有向图,边从 ab iff matrix[a][b] = n。在该图中找到最长的路径。

Not sure how efficient an adaption of the normal LCS algorithm would be, so this may be inefficient, but since there aren't many eligible letters, it's not too horrible.

Make a successor matrix. For each string s, for each pair of letters (s[i],s[j]) with i < j, increment matrix[s[i]][s[j]]. A pair of letters can only be part of a longest common subsequence if matrix[a][b] = n = number of strings. Build a directed graph from the letters participating in such a pair, with an edge from a to b iff matrix[a][b] = n. Find a longest path in that graph.

青衫负雪 2024-12-31 02:57:09

您想要做的事情称为多序列比对。该算法“可以”重写为 n 维,但这会将空间复杂度增加到长度 ^ 维度

由于没有一个字母重复,我是否应该考虑一种比维基百科文章中描述的算法更有效的算法?

也许。首先创建一个映射 (a,b) 使得 a < b.例如,每个集合的前三个字符的映射如下

--- Mapping for character[0]
Z - B    A - C    A - N    Z - I
    A        N        I        A
    N        I        C        N
    I        B        O        C
    C        Z        T        O
    O        I        Z        T
    T        O        B        B

--- Mapping for character[1]
B - A    C - N    N - I    I - A
    N        I        C        N
    I        B        O        C
    C        Z        T        O
    O        I        Z        T
    T        O        B        B

--- Mapping for character[2]
A - N    N - I    N - C    I - N
    I        B        O        C
    C        Z        T        O
    O        I        Z        T
    T        O        B        B

然后仅保留所有集合上公共的映射。下面是遵循此过程的示例,遍历 Set[A] (ZBANICOT),这将为我们提供一组所有集合共有的关系。

因此,对于字母 Z,您将观察到在 Set[C] 中,Z 后面的唯一字符是 B,因此您可以消除关于从 Z 不到 B 的每个映射。但长话短说,Z 的每个映射都被删除。请注意,仅删除来自 Z 的映射,而不删除到 Z 的映射。

另请注意,由于 Set[D] 以 B 结尾,因此我们可以安全地消除来自 B 的所有映射。通过遵循该过程,我们可以观察到 A 映射到 NCOT< /代码>。接下来是 N,它映射到 TO

I 映射到 O(老实说,“I/O”),而 C 映射到 OT(多可爱)。有趣的是 O 映射到什么都没有!多合适啊。

最后还有一个字符,T 也映射为空(因为 ZBANICOTT 结尾)。

最后,您将得到以下映射:

A - C    C - O    I - O    N - O
    N        T                 T
    O
    T

现在您所要做的就是从每个键开始(ACIN)并搜索没有任何重复节点(本例中为字符)的最长序列。

以下是所有常见的子序列(全部使用简单算法):

ACO
ACT
ANO
ANT
AO
AT
CO
CT
IO
NT
NO

What you are trying to do is called multiple sequence alignment. The algorithm 'could' be rewritten to be n-dimensional, but that would increase the space complexity to Length ^ Dimensions.

Since none of the letters repeat, is there a more efficient algorithm I should consider rather than the one described in the Wikipedia article?

Perhaps. First create a mapping (a,b) such that a < b. For example the mappings for the first three characters of each sets is as follows

--- Mapping for character[0]
Z - B    A - C    A - N    Z - I
    A        N        I        A
    N        I        C        N
    I        B        O        C
    C        Z        T        O
    O        I        Z        T
    T        O        B        B

--- Mapping for character[1]
B - A    C - N    N - I    I - A
    N        I        C        N
    I        B        O        C
    C        Z        T        O
    O        I        Z        T
    T        O        B        B

--- Mapping for character[2]
A - N    N - I    N - C    I - N
    I        B        O        C
    C        Z        T        O
    O        I        Z        T
    T        O        B        B

Then retain only mappings that are common on all sets. Here's an example of following this process, traversing Set[A] (ZBANICOT), which will give us a set of relations that are common to all sets.

So with the letter Z you will observe that in Set[C], the only character that follows Z is B, so you can eliminate just about every mapping from Z that is not to B. But to cut a long story short, every mapping for Z is removed. Note, only mappings from Z are removed, not mappings to Z.

Also note that since Set[D] ends with B, we can safely eliminate all mappings from B. By following the process one can observe that A maps to N, C, O and T. Next is N, which maps to T and O.

I maps to O (honestly, "I/O") while C maps to O and T (how cute). And funnily enough O maps to nothing!!! How befitting.

Finally there is one more character, and T maps to nothing too (since ZBANICOT ends with T).

In the end you are left with the following mappings:

A - C    C - O    I - O    N - O
    N        T                 T
    O
    T

Now all you have to do is start from each of the keys (A, C, I and N) and search for the longest sequence without any repeating nodes (characters in this case).

Here are all of the common subsequences (all using the simple algorithm):

ACO
ACT
ANO
ANT
AO
AT
CO
CT
IO
NT
NO
写给空气的情书 2024-12-31 02:57:09

如果我存储每组字符的字符映射,然后每次增加此计数会怎样......

对于第一组 (ZBANICOT),这将创建以下字符对(其中 28 个)

ZB, ZA, ZN, ZI, ZC, ZO, ZT
    BA, BN, BI, BC, BO, BT
        AN, AI, AC, AO, AT
            NI, NC, NO, NT
                IC, IO, IT
                    CO, CT
                        OT

:第二组 (ACNTBZIO),我将再次拥有 28 对:

AC, AN, AT, AB, AZ, AI, AO
    CN, CT, CB, CZ, CI, CO
        NT, NB, NZ, NI, NO
            TB, TZ, TI, TO
                BZ, BI, BO
                    ZI, ZO
                        IO

对于最后 2 组,我将:(

AN, AI, AC, AO, AT, AZ, AB
    NI, NC, NO, NT, NZ, NB
        IC, IO, IT, IZ, IB
            CO, CT, CZ, CB
                OT, OZ, OB
                    TZ, TB
                        ZB

ZI, ZA, ZN, ZC, ZO, ZT, ZB
    IA, IN, IC, IO, IT, IB
        AN, AC, AO, AT, AB
            NC, NO, NT, NB
                CO, CT, CB
                    OT, OB
                        TB

因此要计算出到目前为止的迭代计数,让 N 等于数量集合,M 是每个集合中的字符数,我们有 N*(M)((M-1)/2)4*28 =112 在本例中)

接下来,我将计算每对存在的次数。例如,

PairCounts["ZB"] == 3
PairCounts["ZA"] == 2
// ...
PairCounts["AO"] == 4
PairCounts["AT"] == 4
// ...
PairCounts["AN"] == 4
PairCounts["AC"] == 4
PairCounts["CT"] == 4
PairCounts["NO"] == 4
PairCounts["NT"] == 4

然后,由于我们有 4 对,我会丢弃所有计数不为 4 的对。然后我会尝试连接每对,以形成可能的最长字符串。

What if I stored character mapping for each set of characters and then incremented this count each time...

For the 1st set (ZBANICOT), this would create the following character pairs (28 of them):

ZB, ZA, ZN, ZI, ZC, ZO, ZT
    BA, BN, BI, BC, BO, BT
        AN, AI, AC, AO, AT
            NI, NC, NO, NT
                IC, IO, IT
                    CO, CT
                        OT

For the 2nd set (ACNTBZIO), I would have 28 pairs again:

AC, AN, AT, AB, AZ, AI, AO
    CN, CT, CB, CZ, CI, CO
        NT, NB, NZ, NI, NO
            TB, TZ, TI, TO
                BZ, BI, BO
                    ZI, ZO
                        IO

For the last 2 sets, I would have:

AN, AI, AC, AO, AT, AZ, AB
    NI, NC, NO, NT, NZ, NB
        IC, IO, IT, IZ, IB
            CO, CT, CZ, CB
                OT, OZ, OB
                    TZ, TB
                        ZB

ZI, ZA, ZN, ZC, ZO, ZT, ZB
    IA, IN, IC, IO, IT, IB
        AN, AC, AO, AT, AB
            NC, NO, NT, NB
                CO, CT, CB
                    OT, OB
                        TB

(So to figure out the iteration count so far, let N equal the number of sets, and M be the number of characters in each set, we have N*(M)((M-1)/2) or 4*28=112 in this example)

Next, I would count the number of times each pair exists. For example,

PairCounts["ZB"] == 3
PairCounts["ZA"] == 2
// ...
PairCounts["AO"] == 4
PairCounts["AT"] == 4
// ...
PairCounts["AN"] == 4
PairCounts["AC"] == 4
PairCounts["CT"] == 4
PairCounts["NO"] == 4
PairCounts["NT"] == 4

Then, since we have 4 pairs, I would discard all pairs which don't have a count of 4. And then I would try concatting each pair in order to form the longest string possible.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文