最长公共子序列 (LCS) 长度的快速(呃)算法
问题:需要两个字符串之间的 LCS 长度。字符串的大小最多为 100 个字符。字母表是常见的 DNA 字母表,4 个字符“ACGT”。动态方法不够快。
我的问题是,我正在处理很多很多对(据我所知,有数亿对)。我相信我已经将 LCS_length 函数的调用减少到尽可能少,因此使我的程序运行得更快的唯一其他方法是拥有更高效的 LCS_Length 函数。
我首先采用通常的动态编程方法来实现。 这给出了正确的答案,并希望得到正确的实施。
#define arrayLengthMacro(a) strlen(a) + 1
#define MAX_STRING 101
static int MaxLength(int lengthA, int lengthB);
/*
* Then the two strings are compared following a dynamic computing
* LCS table algorithm. Since we only require the length of the LCS
* we can get this rather easily.
*/
int LCS_Length(char *a, char *b)
{
int lengthA = arrayLengthMacro(a),lengthB = arrayLengthMacro(b),
LCS = 0, i, j, maxLength, board[MAX_STRING][MAX_STRING];
maxLength = MaxLength(lengthA, lengthB);
//printf("%d %d\n", lengthA, lengthB);
for (i = 0; i < maxLength - 1; i++)
{
board[i][0] = 0;
board[0][i] = 0;
}
for (i = 1; i < lengthA; i++)
{
for (j = 1; j < lengthB; j++)
{
/* If a match is found we allocate the number in (i-1, j-1) incremented
* by 1 to the (i, j) position
*/
if (a[i - 1] == b[j - 1])
{
board[i][j] = board[i-1][j-1] + 1;
if(LCS < board[i][j])
{
LCS++;
}
}
else
{
if (board[i-1][j] > board[i][j-1])
{
board[i][j] = board[i-1][j];
}
else
{
board[i][j] = board[i][j-1];
}
}
}
}
return LCS;
}
那应该是 O(mn) (希望如此)。
然后在寻找速度时,我发现了这篇文章 最长公共子序列 这给出了 O(ND )迈尔斯的论文。我尝试过将 LCS 与最短编辑脚本 (SES) 联系起来。 他们给出的关系是 D = M + N - 2L,其中 D 是 SES 的长度,M 和 N 是两个字符串的长度,L 是 LCS 长度。但这在我的实现中并不总是正确的。我给出我的实现(我认为是正确的):
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define arrayLengthMacro(a) strlen(a)
int LCS_Length (char *a, char *b);
int MinLength (int A, int B);
int Max (int A, int B);
int snake(int k, int max, char *a, char *b, int lengthA, int lengthB);
int main(void)
{
int L;
char a[] = "tomato", b[] = "potato"; //must give LCS = 4
L = LCS_Length(a, b);
printf("LCS: %d\n", L );
char c[] = "GTCGTTCGGAATGCCGTTGCTCTGTAAA", d[] = "ACCGGTCGAGTGCGCGGAAGCCGGCCGAA"; // must give LCS = 20
L = LCS_Length(c, d);
printf("LCS: %d\n", L );
char e[] = "human", f[] = "chimpanzee"; // must give LCS = 4
L = LCS_Length(e, f);
printf("LCS: %d\n", L );
char g[] = "howareyou", h[] = "whoareyou"; // LCS =8
L = LCS_Length(g, h);
printf("LCS: %d\n", L );
char i[] = "TTCTTTCGGTAACGCCTACTTTATGAAGAGGTTACATTGCAATCGGGTAAATTAACCAACAAGTAATGGTAGTTCCTAGTAGAGAAACCCTCCCGCTCAC",
j[] = "GCACGCGCCTGTTGCTACGCTCTGTCCATCCTTTGTGTGCCGGGTACTCAGACCGGTAACTCGAGTTGCTATCGCGGTTATCAGGATCATTTACGGACTC"; // 61
L = LCS_Length(i, j);
printf("LCS: %d\n", L );
return 0;
}
int LCS_Length(char *a, char *b)
{
int D, lengthA = arrayLengthMacro(a), lengthB = arrayLengthMacro(b),
max, *V_, *V, i, k, x, y;
max = lengthA + lengthB;
V_ = malloc(sizeof(int) * (max+1));
if(V_ == NULL)
{
fprintf(stderr, "Failed to allocate memory for LCS");
exit(1);
}
V = V_ + lengthA;
V[1] = 0;
for (D = 0; D < max; D++)
{
for (k = -D; k <= D; k = k + 2)
{
if ((k == -D && V[k-1] < V[k+1]) || (k != D && V[k-1] < V[k+1]))
{
x = V[k+1];
}
else
{
x = V[k-1] + 1;
}
y = x - k;
while ((x < lengthB) && (y < lengthA) && (a[x+1] == b[y+1]))
{
x++;
y++;
}
V[k] = x;
if ((x >= lengthB) && (y >= lengthA))
{
return (lengthA + lengthB - D)/2;
}
}
}
return (lengthA + lengthB - D)/2;
}
main中有例子。 例如。 “tomato”和“potato”(不评论)的 LCS 长度为 4。 实现发现 SES(代码中的 D)是 5 倍,结果是 2,而不是我期望的 4(删除“t”,插入“p”,删除“m”,插入“t”)。我倾向于认为 O(ND) 算法可能也会计算替换,但我对此不确定。
任何可实施的方法(我没有丰富的编程技能)都是受欢迎的。 (例如,如果有人知道如何利用小字母表)。
编辑:我认为最重要的是,我可以随时在任何两个字符串之间使用 LCS 函数。因此,与其他几百万个字符串相比,这不仅仅是 s1 字符串。可能是 s200 和 s1000,然后是 s0 和 s10000,最后是 250 和 s100000。也不可能跟踪最常用的字符串。 我要求 LCS 长度不是近似值,因为我正在实现近似算法并且我不想添加额外的误差。
编辑:刚刚运行callgrind。对于 10000 个字符串的输入,对于不同的字符串对,我似乎调用 lcs 函数大约 50,000,000 次。 (10000 个字符串是我想要运行的最低值,最大值是 100 万个(如果可行的话))。
Problem: Need the Length of the LCS between two strings. The size of the strings is at most 100 characters. The alphabet is the usual DNA one, 4 characters "ACGT". The dynamic approach is not quick enough.
My problem is that I am dealing with lot's and lot's of pairs (of the rank of hundreds of million as far as I can see). I believe I have decreased the calling of the LCS_length function to the minimum possible so the only other way to make my program run faster is to have a more efficient LCS_Length function.
I have started off by implementing in the usual dynamic programming approach.
That gives the correct answer and is hopefully implemented in properly.
#define arrayLengthMacro(a) strlen(a) + 1
#define MAX_STRING 101
static int MaxLength(int lengthA, int lengthB);
/*
* Then the two strings are compared following a dynamic computing
* LCS table algorithm. Since we only require the length of the LCS
* we can get this rather easily.
*/
int LCS_Length(char *a, char *b)
{
int lengthA = arrayLengthMacro(a),lengthB = arrayLengthMacro(b),
LCS = 0, i, j, maxLength, board[MAX_STRING][MAX_STRING];
maxLength = MaxLength(lengthA, lengthB);
//printf("%d %d\n", lengthA, lengthB);
for (i = 0; i < maxLength - 1; i++)
{
board[i][0] = 0;
board[0][i] = 0;
}
for (i = 1; i < lengthA; i++)
{
for (j = 1; j < lengthB; j++)
{
/* If a match is found we allocate the number in (i-1, j-1) incremented
* by 1 to the (i, j) position
*/
if (a[i - 1] == b[j - 1])
{
board[i][j] = board[i-1][j-1] + 1;
if(LCS < board[i][j])
{
LCS++;
}
}
else
{
if (board[i-1][j] > board[i][j-1])
{
board[i][j] = board[i-1][j];
}
else
{
board[i][j] = board[i][j-1];
}
}
}
}
return LCS;
}
That should be O(mn) (hopefully).
Then in search for speed I found this post Longest Common Subsequence
Which gave the O(ND) paper by Myers. I tried this which relates the LCS with the shortest edit script (SES).
The relation they give is D = M + N - 2L, where D is the length of the SES, M and N are the lengths of the two strings and L is the LCS length. But this isn't always correct in my implementation. I give my implementation (which I think is correct):
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define arrayLengthMacro(a) strlen(a)
int LCS_Length (char *a, char *b);
int MinLength (int A, int B);
int Max (int A, int B);
int snake(int k, int max, char *a, char *b, int lengthA, int lengthB);
int main(void)
{
int L;
char a[] = "tomato", b[] = "potato"; //must give LCS = 4
L = LCS_Length(a, b);
printf("LCS: %d\n", L );
char c[] = "GTCGTTCGGAATGCCGTTGCTCTGTAAA", d[] = "ACCGGTCGAGTGCGCGGAAGCCGGCCGAA"; // must give LCS = 20
L = LCS_Length(c, d);
printf("LCS: %d\n", L );
char e[] = "human", f[] = "chimpanzee"; // must give LCS = 4
L = LCS_Length(e, f);
printf("LCS: %d\n", L );
char g[] = "howareyou", h[] = "whoareyou"; // LCS =8
L = LCS_Length(g, h);
printf("LCS: %d\n", L );
char i[] = "TTCTTTCGGTAACGCCTACTTTATGAAGAGGTTACATTGCAATCGGGTAAATTAACCAACAAGTAATGGTAGTTCCTAGTAGAGAAACCCTCCCGCTCAC",
j[] = "GCACGCGCCTGTTGCTACGCTCTGTCCATCCTTTGTGTGCCGGGTACTCAGACCGGTAACTCGAGTTGCTATCGCGGTTATCAGGATCATTTACGGACTC"; // 61
L = LCS_Length(i, j);
printf("LCS: %d\n", L );
return 0;
}
int LCS_Length(char *a, char *b)
{
int D, lengthA = arrayLengthMacro(a), lengthB = arrayLengthMacro(b),
max, *V_, *V, i, k, x, y;
max = lengthA + lengthB;
V_ = malloc(sizeof(int) * (max+1));
if(V_ == NULL)
{
fprintf(stderr, "Failed to allocate memory for LCS");
exit(1);
}
V = V_ + lengthA;
V[1] = 0;
for (D = 0; D < max; D++)
{
for (k = -D; k <= D; k = k + 2)
{
if ((k == -D && V[k-1] < V[k+1]) || (k != D && V[k-1] < V[k+1]))
{
x = V[k+1];
}
else
{
x = V[k-1] + 1;
}
y = x - k;
while ((x < lengthB) && (y < lengthA) && (a[x+1] == b[y+1]))
{
x++;
y++;
}
V[k] = x;
if ((x >= lengthB) && (y >= lengthA))
{
return (lengthA + lengthB - D)/2;
}
}
}
return (lengthA + lengthB - D)/2;
}
There are examples in the main.
Eg. "tomato" and "potato" (don't comment), have an LCS length of 4.
The implementation finds that it is 5 sice the SES (D in the code) comes out as 2 instead of 4 that I'd expect (delete "t", insert "p", delete "m", insert "t"). I am inclined to think that maybe the O(ND) algorithm counts substitutions as well, but I am not sure about this.
Any approach that is implementable (I don't have immense programming skills), is welcome.
(If someone would know how to take advantage of the small alphabet for example).
EDIT: I think it would be useful to say on top of everything else, that I use the LCS function between ANY two strings at ANY time. So it is not only string say s1, compared with few million others. It might be s200 with s1000, then s0 with s10000, and then 250 with s100000. Not likely to be able to track most used strings either.
I require that the LCS length is NOT an approximation, since I am implementing an approximation algorithm and I don't want to add extra error.
EDIT: Just ran callgrind. For an input of 10000 strings I seem to call the lcs function about 50,000,000 times, for different pairs of strings. (10000 strings is the lowest I want to run and the max is 1 million (if that is feasible)).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我不熟悉比动态编程更奇特的算法
计算 LCS,但我想指出一些事情:
首先,O(ND) 方法只有在比较非常大、非常大的情况时才有意义。
类似的字符串。你的情况似乎并非如此。
其次,加速 LCD_Length 函数的渐近性能是
可能不是你应该关注的,因为你的琴弦很漂亮
短的。如果您只关心寻找相似或不相似的对(而不是所有
对的精确 LCS),那么 Yannick 提到的 BK 树看起来是一个有前途的
要走的路。
最后,关于你的 DP 实现,有些事情让我困扰。正确的
代码中“board[i][j]”的解释是“最长的子序列
字符串 a[1..i], b[1..j]" (我在此表示法中使用 1 索引)。因此,
你的 for 循环应该包括 i = lengthA 和 j = lengthB。看起来像你
通过引入 arrayLengthMacro(a) 来解决代码中的这个错误,但是
hack 在算法的上下文中没有意义。一旦“董事会”填满,
你可以在board[lengthA][lengthB]中查找解决方案,这意味着你可以得到
去掉不必要的“if (LCS < board[i][j])”块并返回
板[长度A][长度B]。另外,循环边界看起来是错误的
初始化(我很确定它应该是 (i = 0; i <= maxLength; i++)
其中 maxLength = max(lengthA, lengthB))。
I'm not familiar with the fancier-than-dynamic-programming algorithms for
computing LCS, but I wanted to point out a few things:
First, the O(ND) approach only makes sense if you're comparing very large, very
similar strings. This doesn't seem to be the case for you.
Second, speeding up the asymptotic performance of your LCD_Length function is
probably not what you should be focusing on since your strings are pretty
short. If you only care about finding similar or dissimilar pairs (and not all
pairs' exact LCS), then the BK-tree mentioned by Yannick looks like a promising
way to go.
Finally, some things bothered me about your DP implementation. The correct
interpretation of "board[i][j]" in your code is "the longest subsequence of
strings a[1..i], b[1..j]" (I'm using 1-indexing in this notation). Therefore,
your for loops should include i = lengthA and j = lengthB. It looks like you
hacked around this bug in your code by introducing arrayLengthMacro(a), but that
hack doesn't make sense in the context of the algorithm. Once "board" is filled,
you can look up the solution in board[lengthA][lengthB], which means you can get
rid of the unnecessary "if (LCS < board[i][j])" block and return
board[lengthA][lengthB]. Also, the loop bounds look wrong in the
initialization (I'm pretty sure it should be for (i = 0; i <= maxLength; i++)
where maxLength = max(lengthA, lengthB)).
我建议获取一份 Gusfield 的 字符串、树和序列的算法,全部都是关于计算生物学的字符串操作。
I'd recommend getting hold of a copy of Gusfield's Algorithms on Strings, Trees and Sequences which is all about string operations for computational biology.
有几种方法可以使计算速度更快:
编辑:对于与相同的蜇刺情况进行比较,一个人建议使用 BK-Tree 数据结构
计算相似度分数的有效方法当样本量很大时字符串的数量?
There several ways to make your computation faster:
EDIT: for the comparing-to-the-same-set-of-stings case, one person suggested a BK-Tree data structure in
Efficient way of calculating likeness scores of strings when sample size is large?