如何计算最长公共子序列的数量

发布于 2024-08-21 17:51:01 字数 387 浏览 0 评论 0原文

我正在尝试计算两个字符串之间存在的最长可能子序列的数量

例如 字符串 X =“efgefg”; 字符串 Y =“efegf”;

输出:最长公共序列的数量为:3 (即:efeg、efef、efgf - 这不需要通过算法计算,仅在此处显示以进行演示)

我已经设法使用基于 O(|X|*|Y|) 的动态规划在 O(|X|*|Y|) 中做到这一点总体思路如下:最便宜路径算法

任何人都可以想出一种方法来以更好的运行时间有效地进行这种计算吗?

——针对杰森的评论进行编辑。

I'm trying to calculate the amount of longest possible subsequences that exist between two strings.

e.g.
String X = "efgefg";
String Y = "efegf";

output: The Number of longest common sequences is: 3
(i.e.: efeg, efef, efgf - this doesn't need to be calculated by the algorithm, just shown here for demonstration)

I've managed to do this in O(|X|*|Y|) using dynamic programming based on the general idea here: Cheapest path algorithm.

Can anyone think of a way to do this calculation with better runtime efficiently?

--Edited in response to Jason's comment.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

无边思念无边月 2024-08-28 17:51:01

我不知道,但这里有一些大声思考的尝试:

我能够构造的最坏情况有一个指数 - 2**(0.5 |X|) - 最长公共子序列的数量:

X = "aAbBcCdD..."
Y = "AaBbCcDd..."

其中最长公共子序列恰好包含一个{A, a} 中的一个,{B, b} 中的一个等等......(挑剔:如果你的字母表限制为 256 个字符,这最终会崩溃 - 但 2**128 已经很大了。)

但是,您不必生成所有子序列来对它们进行计数。
如果你有 O(|X| * |Y|),那么你已经比这更好了!我们从中学到的是,任何比您的算法更好的算法都不得尝试生成实际的子序列

I don't know but here are some attempts at thinking aloud:

The worst case I was able to construct has an exponential - 2**(0.5 |X|) - number of longest common subsequences:

X = "aAbBcCdD..."
Y = "AaBbCcDd..."

where the longest common subsequences include exactly one of {A, a}, exactly one of {B, b} and so forth... (nitpicking: if you alphabet is limited to 256 chars, this breaks down eventually - but 2**128 is already huge.)

However, you don't necessarily have to generate all subsequences to count them.
If you've got O(|X| * |Y|), you are already better than that! What we learn from this is that any algorithm better than yours must not attempt to generate the actual subsequences.

泛泛之交 2024-08-28 17:51:01

最长公共子序列问题是一个经过充分研究的 CS 问题。

您可能想在这里阅读:http://en.wikipedia.org/wiki/Longest_common_subsequence_problem< /a>

Longest common subsequence problem is a well studied CS problem.

You may want to read up on it here: http://en.wikipedia.org/wiki/Longest_common_subsequence_problem

山色无中 2024-08-28 17:51:01

首先,我们确实知道,除非强指数时间假设失败,否则无法在 O(n2-ε) 时间内找到长度为 n 的两个序列的任何最长公共子序列,请参阅:
https://arxiv.org/abs/1412.0348

这几乎意味着您无法计算如何在 O(n2-ε) 时间内将公共子序列与输入序列对齐的方法。
另一方面,可以在 O(n2) 时间内计算出这种对齐方式的数量。也可以通过所谓的四俄罗斯人加速在 O(n2/log(n)) 时间内对它们进行计数。

现在真正的问题是你是否真的打算计算这个或者你想找到不同子序列的数量?恐怕后者是一个#P-完全计数问题。至少,我们确实知道,计算正则语法可以生成的给定长度的序列数量是#P-complete的:

S。 Kannan、Z.Sweedyk 和 SR Mahaney。计数
以及随机生成常规语言的字符串。
ACM-SIAM 离散算法研讨会
(SODA),第 551–557 页,1995 年

这是一个类似的问题,因为计算正则语法可以生成给定长度序列的方式的数量是一个简单的动态规划算法。然而,如果你不想区分产生相同序列的世代,那么问题就会从简单变得极其困难。我的自然猜想是,序列比对问题也应该是这种情况(最长公共子序列、编辑距离、最短公共超串等)。

因此,如果您想计算两个序列的不同子序列的数量,那么您当前的算法很可能是错误的,并且任何算法都无法在多项式时间内计算它,除非 P = NP (以及更多... )。

First of all, we do know that finding any longest common subsequence of two sequences with length n cannot be done in O(n2-ε) time unless the Strong Exponential Time Hypothesis fails, see:
https://arxiv.org/abs/1412.0348

This pretty much implies that you cannot count the number of ways how to align common subsequences to the input sequences in O(n2-ε) time.
On the other hand, it is possible to count the number of ways of such alignments in O(n2) time. It is also possible to count them in O(n2/log(n)) time with the so-called four-Russians speed-up.

Now the real question if you really intended to calculate this or you want to find the number of different subsequences? I am afraid that this latter is a #P-complete counting problem. At least, we do know that counting the number of sequences with a given length that a regular grammar can generate is #P-complete:

S. Kannan, Z. Sweedyk, and S. R. Mahaney. Counting
and random generation of strings in regular languages.
In ACM-SIAM Symposium on Discrete Algorithms
(SODA), pages 551–557, 1995

This is a similar problem in that sense that counting the number of ways a regular grammar can generate sequences of a given length is a trivial dynamic programming algorithm. However, if you do not want to distinguish generations resulting the same sequence, then the problem turns from easy to extremely hard. My natural conjecture is that this should be the case for sequence alignment problems, too (longest common subsequence, edit distance, shortest common superstring, etc.).

So if you would like to calculate the number of different subsequences of two sequences, then very likely your current algorithm is wrong and any algorithm cannot calculate it in polynomial time unless P = NP (and more...).

浮光之海 2024-08-28 17:51:01

我发现的最佳解释(带代码):

计算所有 LCS

Best Explanation(with Code) I found :

Count all LCS

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文