最长公共子序列 (LCS) 长度的快速(呃)算法

发布于 2024-11-18 01:33:06 字数 4888 浏览 2 评论 0原文

问题:需要两个字符串之间的 LCS 长度。字符串的大小最多为 100 个字符。字母表是常见的 DNA 字母表,4 个字符“ACGT”。动态方法不够快。

我的问题是,我正在处理很多很多对(据我所知,有数亿对)。我相信我已经将 LCS_length 函数的调用减少到尽可能少,因此使我的程序运行得更快的唯一其他方法是拥有更高效的 LCS_Length 函数。

我首先采用通常的动态编程方法来实现。 这给出了正确的答案,并希望得到正确的实施。

#define arrayLengthMacro(a) strlen(a) + 1
#define MAX_STRING 101

static int MaxLength(int lengthA, int lengthB);

/* 
 * Then the two strings are compared following a dynamic computing
 * LCS table algorithm. Since we only require the length of the LCS 
 * we can get this rather easily.
 */
int LCS_Length(char *a, char *b)
{
    int lengthA = arrayLengthMacro(a),lengthB = arrayLengthMacro(b), 
        LCS = 0, i, j, maxLength, board[MAX_STRING][MAX_STRING];

        maxLength = MaxLength(lengthA, lengthB);

    //printf("%d %d\n", lengthA, lengthB);
    for (i = 0; i < maxLength - 1; i++)
    {
        board[i][0] = 0;
        board[0][i] = 0;
    }

    for (i = 1; i < lengthA; i++)
    {
        for (j = 1; j < lengthB; j++)
        {
/* If a match is found we allocate the number in (i-1, j-1) incremented  
 * by 1 to the (i, j) position
 */
            if (a[i - 1] == b[j - 1])
            {

                board[i][j] = board[i-1][j-1] + 1;
                if(LCS < board[i][j])
                {
                    LCS++;
                }
            }
            else
            {
                if (board[i-1][j] > board[i][j-1])
                {
                    board[i][j] = board[i-1][j];
                }
                else
                {
                    board[i][j] = board[i][j-1];
                }
            }
        }
    }

    return LCS;
}

那应该是 O(mn) (希望如此)。

然后在寻找速度时,我发现了这篇文章 最长公共子序列 这给出了 O(ND )迈尔斯的论文。我尝试过将 LCS 与最短编辑脚本 (SES) 联系起来。 他们给出的关系是 D = M + N - 2L,其中 D 是 SES 的长度,M 和 N 是两个字符串的长度,L 是 LCS 长度。但这在我的实现中并不总是正确的。我给出我的实现(我认为是正确的):

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define arrayLengthMacro(a) strlen(a)

int LCS_Length (char *a, char *b);
int MinLength (int A, int B);
int Max (int A, int B);
int snake(int k, int max, char *a, char *b, int lengthA, int lengthB);

int main(void)
{
    int L;
    char a[] = "tomato", b[] = "potato"; //must give LCS = 4
    L =  LCS_Length(a, b);
    printf("LCS: %d\n", L );    
    char c[] = "GTCGTTCGGAATGCCGTTGCTCTGTAAA", d[] = "ACCGGTCGAGTGCGCGGAAGCCGGCCGAA"; // must give LCS = 20
    L =  LCS_Length(c, d);
    printf("LCS: %d\n", L );
    char e[] = "human", f[] = "chimpanzee"; // must give LCS = 4
    L =  LCS_Length(e, f);
    printf("LCS: %d\n", L );
    char g[] = "howareyou", h[] = "whoareyou"; // LCS =8
    L =  LCS_Length(g, h);
    printf("LCS: %d\n", L );
    char i[] = "TTCTTTCGGTAACGCCTACTTTATGAAGAGGTTACATTGCAATCGGGTAAATTAACCAACAAGTAATGGTAGTTCCTAGTAGAGAAACCCTCCCGCTCAC", 
        j[] = "GCACGCGCCTGTTGCTACGCTCTGTCCATCCTTTGTGTGCCGGGTACTCAGACCGGTAACTCGAGTTGCTATCGCGGTTATCAGGATCATTTACGGACTC"; // 61
    L =  LCS_Length(i, j);
    printf("LCS: %d\n", L );


    return 0;
}

int LCS_Length(char *a, char *b)
{

    int D, lengthA = arrayLengthMacro(a), lengthB = arrayLengthMacro(b), 
        max, *V_, *V, i, k, x, y;

    max = lengthA + lengthB;
    V_ = malloc(sizeof(int) * (max+1));
    if(V_ == NULL)
    {
        fprintf(stderr, "Failed to allocate memory for LCS");
        exit(1);
    }
    V = V_ + lengthA;
    V[1] = 0;

    for (D = 0; D < max; D++)
    {
        for (k = -D; k <= D; k = k + 2)
        {
            if ((k == -D && V[k-1] < V[k+1]) || (k != D && V[k-1] < V[k+1]))
            {
                x = V[k+1];
            }
            else
            {
                x = V[k-1] + 1;
            }
            y = x - k;
            while ((x < lengthB) && (y < lengthA) && (a[x+1] == b[y+1]))
            {
                x++;
                y++;
            }
            V[k] = x;
            if ((x >= lengthB) && (y >= lengthA))
            {
                return (lengthA + lengthB - D)/2;
            }
        }
    }
    return  (lengthA + lengthB - D)/2;
}

main中有例子。 例如。 “tomato”和“potato”(不评论)的 LCS 长度为 4。 实现发现 SES(代码中的 D)是 5 倍,结果是 2,而不是我期望的 4(删除“t”,插入“p”,删除“m”,插入“t”)。我倾向于认为 O(ND) 算法可能也会计算替换,但我对此不确定。

任何可实施的方法(我没有丰富的编程技能)都是受欢迎的。 (例如,如果有人知道如何利用小字母表)。

编辑:我认为最重要的是,我可以随时在任何两个字符串之间使用 LCS 函数。因此,与其他几百万个字符串相比,这不仅仅是 s1 字符串。可能是 s200 和 s1000,然后是 s0 和 s10000,最后是 250 和 s100000。也不可能跟踪最常用的字符串。 我要求 LCS 长度不是近似值,因为我正在实现近似算法并且我不想添加额外的误差。

编辑:刚刚运行callgrind。对于 10000 个字符串的输入,对于不同的字符串对,我似乎调用 lcs 函数大约 50,000,000 次。 (10000 个字符串是我想要运行的最低值,最大值是 100 万个(如果可行的话))。

Problem: Need the Length of the LCS between two strings. The size of the strings is at most 100 characters. The alphabet is the usual DNA one, 4 characters "ACGT". The dynamic approach is not quick enough.

My problem is that I am dealing with lot's and lot's of pairs (of the rank of hundreds of million as far as I can see). I believe I have decreased the calling of the LCS_length function to the minimum possible so the only other way to make my program run faster is to have a more efficient LCS_Length function.

I have started off by implementing in the usual dynamic programming approach.
That gives the correct answer and is hopefully implemented in properly.

#define arrayLengthMacro(a) strlen(a) + 1
#define MAX_STRING 101

static int MaxLength(int lengthA, int lengthB);

/* 
 * Then the two strings are compared following a dynamic computing
 * LCS table algorithm. Since we only require the length of the LCS 
 * we can get this rather easily.
 */
int LCS_Length(char *a, char *b)
{
    int lengthA = arrayLengthMacro(a),lengthB = arrayLengthMacro(b), 
        LCS = 0, i, j, maxLength, board[MAX_STRING][MAX_STRING];

        maxLength = MaxLength(lengthA, lengthB);

    //printf("%d %d\n", lengthA, lengthB);
    for (i = 0; i < maxLength - 1; i++)
    {
        board[i][0] = 0;
        board[0][i] = 0;
    }

    for (i = 1; i < lengthA; i++)
    {
        for (j = 1; j < lengthB; j++)
        {
/* If a match is found we allocate the number in (i-1, j-1) incremented  
 * by 1 to the (i, j) position
 */
            if (a[i - 1] == b[j - 1])
            {

                board[i][j] = board[i-1][j-1] + 1;
                if(LCS < board[i][j])
                {
                    LCS++;
                }
            }
            else
            {
                if (board[i-1][j] > board[i][j-1])
                {
                    board[i][j] = board[i-1][j];
                }
                else
                {
                    board[i][j] = board[i][j-1];
                }
            }
        }
    }

    return LCS;
}

That should be O(mn) (hopefully).

Then in search for speed I found this post Longest Common Subsequence
Which gave the O(ND) paper by Myers. I tried this which relates the LCS with the shortest edit script (SES).
The relation they give is D = M + N - 2L, where D is the length of the SES, M and N are the lengths of the two strings and L is the LCS length. But this isn't always correct in my implementation. I give my implementation (which I think is correct):

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define arrayLengthMacro(a) strlen(a)

int LCS_Length (char *a, char *b);
int MinLength (int A, int B);
int Max (int A, int B);
int snake(int k, int max, char *a, char *b, int lengthA, int lengthB);

int main(void)
{
    int L;
    char a[] = "tomato", b[] = "potato"; //must give LCS = 4
    L =  LCS_Length(a, b);
    printf("LCS: %d\n", L );    
    char c[] = "GTCGTTCGGAATGCCGTTGCTCTGTAAA", d[] = "ACCGGTCGAGTGCGCGGAAGCCGGCCGAA"; // must give LCS = 20
    L =  LCS_Length(c, d);
    printf("LCS: %d\n", L );
    char e[] = "human", f[] = "chimpanzee"; // must give LCS = 4
    L =  LCS_Length(e, f);
    printf("LCS: %d\n", L );
    char g[] = "howareyou", h[] = "whoareyou"; // LCS =8
    L =  LCS_Length(g, h);
    printf("LCS: %d\n", L );
    char i[] = "TTCTTTCGGTAACGCCTACTTTATGAAGAGGTTACATTGCAATCGGGTAAATTAACCAACAAGTAATGGTAGTTCCTAGTAGAGAAACCCTCCCGCTCAC", 
        j[] = "GCACGCGCCTGTTGCTACGCTCTGTCCATCCTTTGTGTGCCGGGTACTCAGACCGGTAACTCGAGTTGCTATCGCGGTTATCAGGATCATTTACGGACTC"; // 61
    L =  LCS_Length(i, j);
    printf("LCS: %d\n", L );


    return 0;
}

int LCS_Length(char *a, char *b)
{

    int D, lengthA = arrayLengthMacro(a), lengthB = arrayLengthMacro(b), 
        max, *V_, *V, i, k, x, y;

    max = lengthA + lengthB;
    V_ = malloc(sizeof(int) * (max+1));
    if(V_ == NULL)
    {
        fprintf(stderr, "Failed to allocate memory for LCS");
        exit(1);
    }
    V = V_ + lengthA;
    V[1] = 0;

    for (D = 0; D < max; D++)
    {
        for (k = -D; k <= D; k = k + 2)
        {
            if ((k == -D && V[k-1] < V[k+1]) || (k != D && V[k-1] < V[k+1]))
            {
                x = V[k+1];
            }
            else
            {
                x = V[k-1] + 1;
            }
            y = x - k;
            while ((x < lengthB) && (y < lengthA) && (a[x+1] == b[y+1]))
            {
                x++;
                y++;
            }
            V[k] = x;
            if ((x >= lengthB) && (y >= lengthA))
            {
                return (lengthA + lengthB - D)/2;
            }
        }
    }
    return  (lengthA + lengthB - D)/2;
}

There are examples in the main.
Eg. "tomato" and "potato" (don't comment), have an LCS length of 4.
The implementation finds that it is 5 sice the SES (D in the code) comes out as 2 instead of 4 that I'd expect (delete "t", insert "p", delete "m", insert "t"). I am inclined to think that maybe the O(ND) algorithm counts substitutions as well, but I am not sure about this.

Any approach that is implementable (I don't have immense programming skills), is welcome.
(If someone would know how to take advantage of the small alphabet for example).

EDIT: I think it would be useful to say on top of everything else, that I use the LCS function between ANY two strings at ANY time. So it is not only string say s1, compared with few million others. It might be s200 with s1000, then s0 with s10000, and then 250 with s100000. Not likely to be able to track most used strings either.
I require that the LCS length is NOT an approximation, since I am implementing an approximation algorithm and I don't want to add extra error.

EDIT: Just ran callgrind. For an input of 10000 strings I seem to call the lcs function about 50,000,000 times, for different pairs of strings. (10000 strings is the lowest I want to run and the max is 1 million (if that is feasible)).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

梦中楼上月下 2024-11-25 01:33:06

我不熟悉比动态编程更奇特的算法
计算 LCS,但我想指出一些事情:

首先,O(ND) 方法只有在比较非常大、非常大的情况时才有意义。
类似的字符串。你的情况似乎并非如此。

其次,加速 LCD_Length 函数的渐近性能是
可能不是你应该关注的,因为你的琴弦很漂亮
短的。如果您只关心寻找相似或不相似的对(而不是所有
对的精确 LCS),那么 Yannick 提到的 BK 树看起来是一个有前途的
要走的路。

最后,关于你的 DP 实现,有些事情让我困扰。正确的
代码中“board[i][j]”的解释是“最长的子序列
字符串 a[1..i], b[1..j]" (我在此表示法中使用 1 索引)。因此,
你的 for 循环应该包括 i = lengthA 和 j = lengthB。看起来像你
通过引入 arrayLengthMacro(a) 来解决代码中的这个错误,但是
hack 在算法的上下文中没有意义。一旦“董事会”填满,
你可以在board[lengthA][lengthB]中查找解决方案,这意味着你可以得到
去掉不必要的“if (LCS < board[i][j])”块并返回
板[长度A][长度B]。另外,循环边界看起来是错误的
初始化(我很确定它应该是 (i = 0; i <= maxLength; i++)
其中 maxLength = max(lengthA, lengthB))。

I'm not familiar with the fancier-than-dynamic-programming algorithms for
computing LCS, but I wanted to point out a few things:

First, the O(ND) approach only makes sense if you're comparing very large, very
similar strings. This doesn't seem to be the case for you.

Second, speeding up the asymptotic performance of your LCD_Length function is
probably not what you should be focusing on since your strings are pretty
short. If you only care about finding similar or dissimilar pairs (and not all
pairs' exact LCS), then the BK-tree mentioned by Yannick looks like a promising
way to go.

Finally, some things bothered me about your DP implementation. The correct
interpretation of "board[i][j]" in your code is "the longest subsequence of
strings a[1..i], b[1..j]" (I'm using 1-indexing in this notation). Therefore,
your for loops should include i = lengthA and j = lengthB. It looks like you
hacked around this bug in your code by introducing arrayLengthMacro(a), but that
hack doesn't make sense in the context of the algorithm. Once "board" is filled,
you can look up the solution in board[lengthA][lengthB], which means you can get
rid of the unnecessary "if (LCS < board[i][j])" block and return
board[lengthA][lengthB]. Also, the loop bounds look wrong in the
initialization (I'm pretty sure it should be for (i = 0; i <= maxLength; i++)
where maxLength = max(lengthA, lengthB)).

戴着白色围巾的女孩 2024-11-25 01:33:06

我建议获取一份 Gusfield 的 字符串、树和序列的算法,全部都是关于计算生物学的字符串操作。

I'd recommend getting hold of a copy of Gusfield's Algorithms on Strings, Trees and Sequences which is all about string operations for computational biology.

一江春梦 2024-11-25 01:33:06

有几种方法可以使计算速度更快:

  • 您可以使用 A* 搜索(通过使用启发式算法,直到 (i,j) 的部分对齐必须有 |ij| 删除或插入),而不是简单的动态编程。
  • 如果您将一个序列与一大堆其他序列进行比较,您可以通过计算导致该前缀的部分的动态编程表(或 A* 搜索状态)并重新使用该计算部分来节省时间。如果您坚持使用动态编程表,您可以按字典顺序对字符串库进行排序,然后只丢弃更改的部分(例如,如果您有“香蕉”并想将其与“巴拿马”和“泛美主义”进行比较,您可以重复使用表格中包含“panam”的部分)。
  • 如果大多数字符串非常相似,您可以通过查找公共前缀并将公共前缀从计算中排除来节省时间(例如,“panama”和“panamericanism”的 LCS 是公共前缀“panam”加上“a”和“ericanism”)
  • 如果字符串非常不同,您可以使用符号计数来获得编辑次数的下限(例如,“AAAB”到“ABBB”至少需要 2 次编辑,因为其中一个有 3 个 A,另一个只有 1 个 A)。这也可以用在 A* 搜索的启发式中。

编辑:对于与相同的蜇刺情况进行比较,一个人建议使用 BK-Tree 数据结构
计算相似度分数的有效方法当样本量很大时字符串的数量?

There several ways to make your computation faster:

  • Instead of plain dynamic programming, you can use A* search (by using a heuristic that partial alignment up to (i,j) must necessarily have |i-j| deletions or insertions in it).
  • If you're comparing one sequence with a whole bunch of other ones, you can save time by computing the dynamic programming table (or the A* search state) for the part leading up to that prefix and re-use the part of your computation. If you stick with the dynamic programming table, you could sort the library of strings by lexicographic order and then only throw away the part that changes (e.g. if you have 'banana' and want to compare it with 'panama' and 'panamericanism', you can reuse the part of the table that covers 'panam').
  • If most of the strings are quite similar, you can save time by looking for a common prefix and excluding the common prefix from the computation (e.g. the LCS of "panama" and "panamericanism" is the common prefix "panam" plus the LCS of "a" and "ericanism")
  • if the strings are quite dissimilar, you can use the symbol counts to get a lower bound on the number of edits (e.g., "AAAB" to "ABBB" need at least 2 edits because there are 3 As in one and only 1 A in the other). This can also be used in the heuristic for an A* search.

EDIT: for the comparing-to-the-same-set-of-stings case, one person suggested a BK-Tree data structure in
Efficient way of calculating likeness scores of strings when sample size is large?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文