最长公共子序列 (LCS) 长度的快速（呃）算法

发布于 2024-11-18 01:33:06 字数 4888 浏览 2 评论 0原文

问题：需要两个字符串之间的 LCS 长度。字符串的大小最多为 100 个字符。字母表是常见的 DNA 字母表，4 个字符“ACGT”。动态方法不够快。

我的问题是，我正在处理很多很多对（据我所知，有数亿对）。我相信我已经将 LCS_length 函数的调用减少到尽可能少，因此使我的程序运行得更快的唯一其他方法是拥有更高效的 LCS_Length 函数。

我首先采用通常的动态编程方法来实现。这给出了正确的答案，并希望得到正确的实施。

#define arrayLengthMacro(a) strlen(a) + 1
#define MAX_STRING 101

static int MaxLength(int lengthA, int lengthB);

/* 
 * Then the two strings are compared following a dynamic computing
 * LCS table algorithm. Since we only require the length of the LCS 
 * we can get this rather easily.
 */
int LCS_Length(char *a, char *b)
{
    int lengthA = arrayLengthMacro(a),lengthB = arrayLengthMacro(b), 
        LCS = 0, i, j, maxLength, board[MAX_STRING][MAX_STRING];

        maxLength = MaxLength(lengthA, lengthB);

    //printf("%d %d\n", lengthA, lengthB);
    for (i = 0; i < maxLength - 1; i++)
    {
        board[i][0] = 0;
        board[0][i] = 0;
    }

    for (i = 1; i < lengthA; i++)
    {
        for (j = 1; j < lengthB; j++)
        {
/* If a match is found we allocate the number in (i-1, j-1) incremented  
 * by 1 to the (i, j) position
 */
            if (a[i - 1] == b[j - 1])
            {

                board[i][j] = board[i-1][j-1] + 1;
                if(LCS < board[i][j])
                {
                    LCS++;
                }
            }
            else
            {
                if (board[i-1][j] > board[i][j-1])
                {
                    board[i][j] = board[i-1][j];
                }
                else
                {
                    board[i][j] = board[i][j-1];
                }
            }
        }
    }

    return LCS;
}

那应该是 O(mn) （希望如此）。

然后在寻找速度时，我发现了这篇文章最长公共子序列这给出了 O(ND ）迈尔斯的论文。我尝试过将 LCS 与最短编辑脚本 (SES) 联系起来。他们给出的关系是 D = M + N - 2L，其中 D 是 SES 的长度，M 和 N 是两个字符串的长度，L 是 LCS 长度。但这在我的实现中并不总是正确的。我给出我的实现（我认为是正确的）：

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define arrayLengthMacro(a) strlen(a)

int LCS_Length (char *a, char *b);
int MinLength (int A, int B);
int Max (int A, int B);
int snake(int k, int max, char *a, char *b, int lengthA, int lengthB);

int main(void)
{
    int L;
    char a[] = "tomato", b[] = "potato"; //must give LCS = 4
    L =  LCS_Length(a, b);
    printf("LCS: %d\n", L );    
    char c[] = "GTCGTTCGGAATGCCGTTGCTCTGTAAA", d[] = "ACCGGTCGAGTGCGCGGAAGCCGGCCGAA"; // must give LCS = 20
    L =  LCS_Length(c, d);
    printf("LCS: %d\n", L );
    char e[] = "human", f[] = "chimpanzee"; // must give LCS = 4
    L =  LCS_Length(e, f);
    printf("LCS: %d\n", L );
    char g[] = "howareyou", h[] = "whoareyou"; // LCS =8
    L =  LCS_Length(g, h);
    printf("LCS: %d\n", L );
    char i[] = "TTCTTTCGGTAACGCCTACTTTATGAAGAGGTTACATTGCAATCGGGTAAATTAACCAACAAGTAATGGTAGTTCCTAGTAGAGAAACCCTCCCGCTCAC", 
        j[] = "GCACGCGCCTGTTGCTACGCTCTGTCCATCCTTTGTGTGCCGGGTACTCAGACCGGTAACTCGAGTTGCTATCGCGGTTATCAGGATCATTTACGGACTC"; // 61
    L =  LCS_Length(i, j);
    printf("LCS: %d\n", L );


    return 0;
}

int LCS_Length(char *a, char *b)
{

    int D, lengthA = arrayLengthMacro(a), lengthB = arrayLengthMacro(b), 
        max, *V_, *V, i, k, x, y;

    max = lengthA + lengthB;
    V_ = malloc(sizeof(int) * (max+1));
    if(V_ == NULL)
    {
        fprintf(stderr, "Failed to allocate memory for LCS");
        exit(1);
    }
    V = V_ + lengthA;
    V[1] = 0;

    for (D = 0; D < max; D++)
    {
        for (k = -D; k <= D; k = k + 2)
        {
            if ((k == -D && V[k-1] < V[k+1]) || (k != D && V[k-1] < V[k+1]))
            {
                x = V[k+1];
            }
            else
            {
                x = V[k-1] + 1;
            }
            y = x - k;
            while ((x < lengthB) && (y < lengthA) && (a[x+1] == b[y+1]))
            {
                x++;
                y++;
            }
            V[k] = x;
            if ((x >= lengthB) && (y >= lengthA))
            {
                return (lengthA + lengthB - D)/2;
            }
        }
    }
    return  (lengthA + lengthB - D)/2;
}

main中有例子。例如。 “tomato”和“potato”（不评论）的 LCS 长度为 4。实现发现 SES（代码中的 D）是 5 倍，结果是 2，而不是我期望的 4（删除“t”，插入“p”，删除“m”，插入“t”）。我倾向于认为 O(ND) 算法可能也会计算替换，但我对此不确定。

任何可实施的方法（我没有丰富的编程技能）都是受欢迎的。（例如，如果有人知道如何利用小字母表）。

编辑：我认为最重要的是，我可以随时在任何两个字符串之间使用 LCS 函数。因此，与其他几百万个字符串相比，这不仅仅是 s1 字符串。可能是 s200 和 s1000，然后是 s0 和 s10000，最后是 250 和 s100000。也不可能跟踪最常用的字符串。我要求 LCS 长度不是近似值，因为我正在实现近似算法并且我不想添加额外的误差。

编辑：刚刚运行callgrind。对于 10000 个字符串的输入，对于不同的字符串对，我似乎调用 lcs 函数大约 50,000,000 次。（10000 个字符串是我想要运行的最低值，最大值是 100 万个（如果可行的话））。

原文

Problem: Need the Length of the LCS between two strings. The size of the strings is at most 100 characters. The alphabet is the usual DNA one, 4 characters "ACGT". The dynamic approach is not quick enough.

My problem is that I am dealing with lot's and lot's of pairs (of the rank of hundreds of million as far as I can see). I believe I have decreased the calling of the LCS_length function to the minimum possible so the only other way to make my program run faster is to have a more efficient LCS_Length function.

I have started off by implementing in the usual dynamic programming approach.
That gives the correct answer and is hopefully implemented in properly.

#define arrayLengthMacro(a) strlen(a) + 1
#define MAX_STRING 101

static int MaxLength(int lengthA, int lengthB);

/* 
 * Then the two strings are compared following a dynamic computing
 * LCS table algorithm. Since we only require the length of the LCS 
 * we can get this rather easily.
 */
int LCS_Length(char *a, char *b)
{
    int lengthA = arrayLengthMacro(a),lengthB = arrayLengthMacro(b), 
        LCS = 0, i, j, maxLength, board[MAX_STRING][MAX_STRING];

        maxLength = MaxLength(lengthA, lengthB);

    //printf("%d %d\n", lengthA, lengthB);
    for (i = 0; i < maxLength - 1; i++)
    {
        board[i][0] = 0;
        board[0][i] = 0;
    }

    for (i = 1; i < lengthA; i++)
    {
        for (j = 1; j < lengthB; j++)
        {
/* If a match is found we allocate the number in (i-1, j-1) incremented  
 * by 1 to the (i, j) position
 */
            if (a[i - 1] == b[j - 1])
            {

                board[i][j] = board[i-1][j-1] + 1;
                if(LCS < board[i][j])
                {
                    LCS++;
                }
            }
            else
            {
                if (board[i-1][j] > board[i][j-1])
                {
                    board[i][j] = board[i-1][j];
                }
                else
                {
                    board[i][j] = board[i][j-1];
                }
            }
        }
    }

    return LCS;
}

That should be O(mn) (hopefully).

Then in search for speed I found this post Longest Common Subsequence
Which gave the O(ND) paper by Myers. I tried this which relates the LCS with the shortest edit script (SES).
The relation they give is D = M + N - 2L, where D is the length of the SES, M and N are the lengths of the two strings and L is the LCS length. But this isn't always correct in my implementation. I give my implementation (which I think is correct):

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define arrayLengthMacro(a) strlen(a)

int LCS_Length (char *a, char *b);
int MinLength (int A, int B);
int Max (int A, int B);
int snake(int k, int max, char *a, char *b, int lengthA, int lengthB);

int main(void)
{
    int L;
    char a[] = "tomato", b[] = "potato"; //must give LCS = 4
    L =  LCS_Length(a, b);
    printf("LCS: %d\n", L );    
    char c[] = "GTCGTTCGGAATGCCGTTGCTCTGTAAA", d[] = "ACCGGTCGAGTGCGCGGAAGCCGGCCGAA"; // must give LCS = 20
    L =  LCS_Length(c, d);
    printf("LCS: %d\n", L );
    char e[] = "human", f[] = "chimpanzee"; // must give LCS = 4
    L =  LCS_Length(e, f);
    printf("LCS: %d\n", L );
    char g[] = "howareyou", h[] = "whoareyou"; // LCS =8
    L =  LCS_Length(g, h);
    printf("LCS: %d\n", L );
    char i[] = "TTCTTTCGGTAACGCCTACTTTATGAAGAGGTTACATTGCAATCGGGTAAATTAACCAACAAGTAATGGTAGTTCCTAGTAGAGAAACCCTCCCGCTCAC", 
        j[] = "GCACGCGCCTGTTGCTACGCTCTGTCCATCCTTTGTGTGCCGGGTACTCAGACCGGTAACTCGAGTTGCTATCGCGGTTATCAGGATCATTTACGGACTC"; // 61
    L =  LCS_Length(i, j);
    printf("LCS: %d\n", L );


    return 0;
}

int LCS_Length(char *a, char *b)
{

    int D, lengthA = arrayLengthMacro(a), lengthB = arrayLengthMacro(b), 
        max, *V_, *V, i, k, x, y;

    max = lengthA + lengthB;
    V_ = malloc(sizeof(int) * (max+1));
    if(V_ == NULL)
    {
        fprintf(stderr, "Failed to allocate memory for LCS");
        exit(1);
    }
    V = V_ + lengthA;
    V[1] = 0;

    for (D = 0; D < max; D++)
    {
        for (k = -D; k <= D; k = k + 2)
        {
            if ((k == -D && V[k-1] < V[k+1]) || (k != D && V[k-1] < V[k+1]))
            {
                x = V[k+1];
            }
            else
            {
                x = V[k-1] + 1;
            }
            y = x - k;
            while ((x < lengthB) && (y < lengthA) && (a[x+1] == b[y+1]))
            {
                x++;
                y++;
            }
            V[k] = x;
            if ((x >= lengthB) && (y >= lengthA))
            {
                return (lengthA + lengthB - D)/2;
            }
        }
    }
    return  (lengthA + lengthB - D)/2;
}

There are examples in the main.
Eg. "tomato" and "potato" (don't comment), have an LCS length of 4.
The implementation finds that it is 5 sice the SES (D in the code) comes out as 2 instead of 4 that I'd expect (delete "t", insert "p", delete "m", insert "t"). I am inclined to think that maybe the O(ND) algorithm counts substitutions as well, but I am not sure about this.

Any approach that is implementable (I don't have immense programming skills), is welcome.
(If someone would know how to take advantage of the small alphabet for example).

EDIT: I think it would be useful to say on top of everything else, that I use the LCS function between ANY two strings at ANY time. So it is not only string say s1, compared with few million others. It might be s200 with s1000, then s0 with s10000, and then 250 with s100000. Not likely to be able to track most used strings either.
I require that the LCS length is NOT an approximation, since I am implementing an approximation algorithm and I don't want to add extra error.

EDIT: Just ran callgrind. For an input of 10000 strings I seem to call the lcs function about 50,000,000 times, for different pairs of strings. (10000 strings is the lowest I want to run and the max is 1 million (if that is feasible)).

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦中楼上月下 2024-11-25 01:33:06

我不熟悉比动态编程更奇特的算法
计算 LCS，但我想指出一些事情：

首先，O(ND) 方法只有在比较非常大、非常大的情况时才有意义。
类似的字符串。你的情况似乎并非如此。

其次，加速 LCD_Length 函数的渐近性能是
可能不是你应该关注的，因为你的琴弦很漂亮
短的。如果您只关心寻找相似或不相似的对（而不是所有
对的精确 LCS），那么 Yannick 提到的 BK 树看起来是一个有前途的
要走的路。

最后，关于你的 DP 实现，有些事情让我困扰。正确的
代码中“board[i][j]”的解释是“最长的子序列
字符串 a[1..i], b[1..j]" （我在此表示法中使用 1 索引）。因此，
你的 for 循环应该包括 i = lengthA 和 j = lengthB。看起来像你
通过引入 arrayLengthMacro(a) 来解决代码中的这个错误，但是
hack 在算法的上下文中没有意义。一旦“董事会”填满，
你可以在board[lengthA][lengthB]中查找解决方案，这意味着你可以得到
去掉不必要的“if (LCS < board[i][j])”块并返回
板[长度A][长度B]。另外，循环边界看起来是错误的
初始化（我很确定它应该是 (i = 0; i <= maxLength; i++)
其中 maxLength = max(lengthA, lengthB))。

回复收藏 0 原文

戴着白色围巾的女孩 2024-11-25 01:33:06

我建议获取一份 Gusfield 的字符串、树和序列的算法，全部都是关于计算生物学的字符串操作。

回复收藏 0 原文

一江春梦 2024-11-25 01:33:06

有几种方法可以使计算速度更快：

您可以使用 A* 搜索（通过使用启发式算法，直到 (i,j) 的部分对齐必须有 |ij| 删除或插入），而不是简单的动态编程。
如果您将一个序列与一大堆其他序列进行比较，您可以通过计算导致该前缀的部分的动态编程表（或 A* 搜索状态）并重新使用该计算部分来节省时间。如果您坚持使用动态编程表，您可以按字典顺序对字符串库进行排序，然后只丢弃更改的部分（例如，如果您有“香蕉”并想将其与“巴拿马”和“泛美主义”进行比较，您可以重复使用表格中包含“panam”的部分）。
如果大多数字符串非常相似，您可以通过查找公共前缀并将公共前缀从计算中排除来节省时间（例如，“panama”和“panamericanism”的 LCS 是公共前缀“panam”加上“a”和“ericanism”）
如果字符串非常不同，您可以使用符号计数来获得编辑次数的下限（例如，“AAAB”到“ABBB”至少需要 2 次编辑，因为其中一个有 3 个 A，另一个只有 1 个 A）。这也可以用在 A* 搜索的启发式中。

编辑：对于与相同的蜇刺情况进行比较，一个人建议使用 BK-Tree 数据结构
计算相似度分数的有效方法当样本量很大时字符串的数量？

回复收藏 0 原文

~没有更多了~