用于在两个很长的文本序列中查找唯一集合的快速算法

发布于 2024-09-16 08:34:18 字数 244 浏览 13 评论 0原文

我需要比较 X 和 Y 染色体的 DNA 序列，并找到 Y 染色体特有的模式（由大约 50-75 个碱基对组成）。请注意，这些序列部分可以在染色体中重复。这需要快速完成（BLAST 需要 47 天，需要几个小时或更短时间）。是否有任何算法或程序特别适合这种比较？同样，速度是这里的关键。

我将其放在 SO 上的原因之一是从特定应用程序领域之外的人那里获得观点，他们可以提出他们在日常使用中在字符串比较中使用的算法，这些算法可能适用于我们的使用。所以不要害羞！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

眼波传意 2024-09-23 08:34:18

在序列 X 上构建后缀树 S。
对于序列 Y 中的每个起始位置 i，查找S 中的字符串 Y[i..i+75]。如果从位置 i 开始找不到匹配项（即，如果在 j < 75 个核苷酸匹配后查找失败），那么您就找到了 Y 所特有的长度为 j 的字符串。
所有起始位置 i 中最小的此类字符串是最短的唯一字符串（或者如果您对最小化长度不感兴趣，则在找到任何此类字符串后停止）。

总时间：O(|X| + m|Y|)，其中 m 是最大字符串长度（例如 m = 75）。

可能存在基于广义后缀树的更有效的算法。

回复收藏 0 原文

烟燃烟灭 2024-09-23 08:34:18

我假设您有一个 X 和一个 Y 来比较。将它们连接起来，并用两者中都没有出现的标记字符分隔，以形成例如 XoY。现在在线性时间内形成http://en.wikipedia.org/wiki/Suffix_array。

您得到的是指向连接字符串中位置的指针数组，其中指针的排列方式使得它们指向的子字符串按字母顺序出现在数组中。您还可以获得一个 LCP 数组，给出后缀和数组中紧邻其之前的后缀之间共享的最长公共前缀的长度，即排序比它短的后缀。事实上，这是该位置和小于它的任何子字符串之间共享的最长公共前缀，因为具有更长公共前缀且小于 string[i] 的任何内容都会在它和当前 string[i - 1] 之间排序。

您可以从指针的位置看出指针指向哪个原始字符串，因为 X 在 Y 之前。您可以将数组切割成 X 和 Y 指针交替的子部分。 pos[i]和pos[i - 1]之间共享的公共前缀的长度是lcp[i]。 pos[i]和pos[i-2]之间共享的前缀长度为min(lcp[i],lcp[i-1])。因此，如果您从 X 范围之前的 Y 值开始，您可以通过逐步向下该部分并在每一步执行最小操作来依次计算出该 Y 和所有 X 之间的前缀字符数。类似地，您可以计算出所有这些 X 和后缀数组中下一个出现的 Y 之间共享的前缀字符数，每个 X 花费一分钟。当然，对于 Y 范围也是如此。现在计算每个条目的最大值，计算出 X（或 Y）中的每个位置与 Y（或 X）中的任何位置之间共享的最长前缀。

我认为您希望 X 或 Y 中的子字符串在它与其他性别的任何其他子字符串之间共享小前缀，因为从同一位置开始比此长一个字符的字符串根本不会出现在其他性别中。我认为一旦完成上面的 min() 计算，您就可以使用堆提取 N 个最小的前缀子字符串来跟踪 N 个最小的条目。我认为这里的一切都需要 |X| 线性时间+ |Y| （除非 N 与 |X| 或 |Y| 相当）。

I am assuming that you have a single X and a single Y to compare. Concatenate them, separated by a marker character that does not appear in either, to form e.g. XoY. Now form the http://en.wikipedia.org/wiki/Suffix_array in linear time.

What you get is an array of pointers to positions in the concatenated string, where the pointers are arranged so that the substrings they point to appear in alphabetical order in the array. You also get an LCP array, giving the length of the longest common prefix shared between a suffix and the suffix directly before it in the array, which is the suffix that sorts just less than it. This is in fact the longest common prefix shared between that position and ANY substring less than it, because anything with a longer common prefix and less than string[i] would sort between it and the current string[i - 1].

You can tell which original string a pointer points into from its position, because X comes before Y. You can cut the array up into alternating sub-sections of X and Y pointers. The length of the common prefix shared between pos[i] and pos[i - 1] is lcp[i]. The length of the prefix shared between pos[i] and pos[i-2] is min(lcp[i], lcp[i-1]). So if you start at the Y value just before a range of Xs you can work out the number of characters of prefix between that Y and all of the Xs in turn by stepping down the section, doing a min operation at each step. Similarly you can work out the number of characters of prefix shared between all of those Xs and the Y that appears next in the suffix array at the cost of one min per X. Ditto, of course for the Y ranges. Now do a max per entry to work out the longest prefix shared between each position in X (or Y) and any position in Y (or X).

I think you want the substrings within either X or Y which have small prefixes shared between it and any other substring of the other sex, because the strings one character longer than this starting from the same position do not appear in the other sex at all. I think once you have done the min() calculations above you can extract the N smallest prefix substrings using a heap to keep track of the N smallest entries. I think everything here takes time linear in |X| + |Y| (unless N is comparable to |X| or |Y|).

回复收藏 0 原文

阪姬 2024-09-23 08:34:18

此论文可能有一些替代方案来适应BLAST以提高其性能（通过细分问题空间AFAIKS）。

回复收藏 0 原文

兲鉂ぱ嘚淚 2024-09-23 08:34:18

我有一个有趣的答案，这将是技术性的。主要思想是子序列的比较应该在GPU上进行，因为现代显卡的GPU是高度并行的处理环境（就像小型超级计算机）。
因此，我们可以将碱基对编码为一个像素，假设 X 染色体有 1.54 亿对，我们得到一张由 1.54 亿像素组成的 X 染色体图像，该图像大小约为 500 MB。对于 Y 染色体，我们得到一张大小为 160 MB 的图像。
所以这两个（500+160）MB图像可以通过下降视频卡非常有效地处理。（只需购买具有 >= 1 GB 显存 RAM 的显卡即可）。

下一步是编写 GPU 程序，可能需要借助 Pixel Shader 或 Cuda 或 OpenCL

GPU 程序很简单 - 基本上它将 Y 染色体图像中的 50-75 个相邻像素与 X 染色体图像的所有像素进行比较。所以这个 GPU 程序最多应该有 75*1.54 亿次操作，这将在现代 GPU 上在一小时左右的时间内完成。因为Y的所有子序列都会被并行测试。

希望有帮助

回复收藏 0 原文

~没有更多了~