字符串平铺算法

发布于 2024-08-05 06:12:10 字数 1928 浏览 10 评论 0原文

我正在寻找一种有效的算法来进行字符串平铺。基本上,您会得到一个字符串列表,例如 BCDCDEABCA,生成的平铺字符串应为ABCDE ,因为 BCDCDE 对齐产生 BCDE,然后与 ABC 对齐产生最终结果ABCDE

目前,我正在使用一种稍微幼稚的算法,其工作原理如下。从一对随机字符串开始,例如 BCDCDE,我使用以下内容(在 Java 中):

public static String tile(String first, String second) {
  for (int i = 0; i < first.length() || i < second.length(); i++) {
    // "right" tile (e.g., "BCD" and "CDE")
    String firstTile = first.substring(i);
    // "left" tile (e.g., "CDE" and "BCD")  
    String secondTile = second.substring(i);
    if (second.contains(firstTile)) {
      return first.substring(0, i) + second;
    } else if (first.contains(secondTile)) {
      return second.substring(0, i) + first;
    }
  }
  return EMPTY;
}

System.out.println(tile("CDE", "ABCDEF")); // ABCDEF
System.out.println(tile("BCD", "CDE")); // BCDE
System.out.println(tile("CDE", "ABC")); // ABCDE
System.out.println(tile("ABC", tile("BCX", "XYZ"))); // ABCXYZ

虽然这有效,但效率不是很高,因为它会迭代同样的角色一遍又一遍地出现。

那么,有人知道更好(更有效)的算法来做到这一点吗?这个问题类似于 DNA 序列比对问题,因此非常欢迎来自该领域的人(当然还有其他人)的任何建议。另请注意,我不是在寻找对齐,而是平铺,因为我需要其中一个字符串与另一个字符串完全重叠。

我目前正在寻找 Rabin-Karp 算法的改编版本,以便提高算法的渐近复杂度,但在进一步研究这个问题之前我想听听一些建议。

提前致谢。


对于存在歧义的情况——例如,{ABC, CBA}可能导致ABCBACBABC——任何平铺都可以回来了。不过这种情况很少出现,因为我是平铺的话,例如 {This is, is me} =>; {This is me},它们被操纵以便上述算法起作用。

类似的问题:重叠字符串连接的高效算法

I'm looking for an efficient algorithm to do string tiling. Basically, you are given a list of strings, say BCD, CDE, ABC, A, and the resulting tiled string should be ABCDE, because BCD aligns with CDE yielding BCDE, which is then aligned with ABC yielding the final ABCDE.

Currently, I'm using a slightly naïve algorithm, that works as follows. Starting with a random pair of strings, say BCD and CDE, I use the following (in Java):

public static String tile(String first, String second) {
  for (int i = 0; i < first.length() || i < second.length(); i++) {
    // "right" tile (e.g., "BCD" and "CDE")
    String firstTile = first.substring(i);
    // "left" tile (e.g., "CDE" and "BCD")  
    String secondTile = second.substring(i);
    if (second.contains(firstTile)) {
      return first.substring(0, i) + second;
    } else if (first.contains(secondTile)) {
      return second.substring(0, i) + first;
    }
  }
  return EMPTY;
}

System.out.println(tile("CDE", "ABCDEF")); // ABCDEF
System.out.println(tile("BCD", "CDE")); // BCDE
System.out.println(tile("CDE", "ABC")); // ABCDE
System.out.println(tile("ABC", tile("BCX", "XYZ"))); // ABCXYZ

Although this works, it's not very efficient, as it iterates over the same characters over and over again.

So, does anybody know a better (more efficient) algorithm to do this ? This problem is similar to a DNA sequence alignment problem, so any advice from someone in this field (and others, of course) are very much welcome. Also note that I'm not looking for an alignment, but a tiling, because I require a full overlap of one of the strings over the other.

I'm currently looking for an adaptation of the Rabin-Karp algorithm, in order to improve the asymptotic complexity of the algorithm, but I'd like to hear some advice before delving any further into this matter.

Thanks in advance.


For situations where there is ambiguity -- e.g., {ABC, CBA} which could result in ABCBA or CBABC --, any tiling can be returned. However, this situation seldom occurs, because I'm tiling words, e.g. {This is, is me} => {This is me}, which are manipulated so that the aforementioned algorithm works.

Similar question: Efficient Algorithm for String Concatenation with Overlap

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

层林尽染 2024-08-12 06:12:11

按第一个字符,然后是长度(从小到大)对字符串进行排序,然后将适应应用于 这个问题关于连接重叠字符串。

Order the strings by the first character, then length (smallest to largest), and then apply the adaptation to KMP found in this question about concatenating overlapping strings.

一个人练习一个人 2024-08-12 06:12:11

我认为这应该适用于两个字符串的平铺,并且比当前使用 substring 和 contains 的实现更有效。从概念上讲,我循环遍历“左”字符串中的字符,并将它们与“右”字符串中的字符进行比较。如果两个字符匹配,我将移动到右侧字符串中的下一个字符。根据首先到达末尾的字符串,以及最后比较的字符是否匹配,将识别一种可能的平铺情况。

我没有想到任何可以提高平铺两个以上字符串的时间复杂度的方法。作为多个字符串的一个小注释,下面的算法可以轻松扩展为一次检查单个“左”字符串与多个“右”字符串的平铺,如果您尝试这样做,这可能会阻止对字符串进行额外的循环。通过尝试所有可能性来找出是否执行(“ABC”,“BCX”,“XYZ”)或(“ABC”,“XYZ”,BCX”)。一点。

string Tile(string a, string b)
{
    // Try both orderings of a and b,
    // since TileLeftToRight is not commutative.

    string ab = TileLeftToRight(a, b);

    if (ab != "")
        return ab;

    return TileLeftToRight(b, a);

    // Alternatively you could return whichever
    // of the two results is longest, for cases
    // like ("ABC" "BCABC").
}

string TileLeftToRight(string left, string right)
{
    int i = 0;
    int j = 0;

    while (true)
    {
        if (left[i] != right[j])
        {
            i++;

            if (i >= left.Length)
                return "";
        }
        else
        {
            i++;
            j++;

            if (i >= left.Length)
                return left + right.Substring(j);

            if (j >= right.Length)
                return left;
        }
    }
}

I think this should work for the tiling of two strings, and be more efficient than your current implementation using substring and contains. Conceptually I loop across the characters in the 'left' string and compare them to a character in the 'right' string. If the two characters match, I move to the next character in the right string. Depending on which string the end is first reached of, and if the last compared characters match or not, one of the possible tiling cases is identified.

I haven't thought of anything to improve the time complexity of tiling more than two strings. As a small note for multiple strings, this algorithm below is easily extended to checking the tiling of a single 'left' string with multiple 'right' strings at once, which might prevent extra looping over the strings a bit if you're trying to find out whether to do ("ABC", "BCX", "XYZ") or ("ABC", "XYZ", BCX") by just trying all the possibilities. A bit.

string Tile(string a, string b)
{
    // Try both orderings of a and b,
    // since TileLeftToRight is not commutative.

    string ab = TileLeftToRight(a, b);

    if (ab != "")
        return ab;

    return TileLeftToRight(b, a);

    // Alternatively you could return whichever
    // of the two results is longest, for cases
    // like ("ABC" "BCABC").
}

string TileLeftToRight(string left, string right)
{
    int i = 0;
    int j = 0;

    while (true)
    {
        if (left[i] != right[j])
        {
            i++;

            if (i >= left.Length)
                return "";
        }
        else
        {
            i++;
            j++;

            if (i >= left.Length)
                return left + right.Substring(j);

            if (j >= right.Length)
                return left;
        }
    }
}
埖埖迣鎅 2024-08-12 06:12:11

如果开源代码可以接受,那么您应该检查斯坦福大学 STAMP基因组基准> 基准套件:它几乎完全符合您的要求。从一堆字符串(“基因”)开始,它寻找包含所有基因的最短字符串。例如,如果您有 ATGC 和 GCAA,它会找到 ATGCAA。该算法没有将其限制为 4 个字符的字母表,因此这应该能够对您有所帮助。

If Open Source code is acceptable, then you should check the genome benchmarks in Stanford's STAMP benchmark suite: it does pretty much exactly what you're looking for. Starting with a bunch of strings ("genes"), it looks for the shortest string that incorporates all the genes. So for example if you have ATGC and GCAA, it'll find ATGCAA. There's nothing about the algorithm that limits it to a 4-character alphabet, so this should be able to help you.

暖心男生 2024-08-12 06:12:11

首先要问的是要不要求{CDB,CDA}的耕作?没有单一的耕作。

The first thing to ask is if you want to find the tilling of {CDB, CDA}? There is no single tilling.

堇色安年 2024-08-12 06:12:11

有趣的问题。你需要某种回溯。例如,如果您有:

ABC, BCD, DBC 

将 DBC 与 BCD 组合会产生:

ABC, DBCD

哪个是不可解的。但是将 ABC 与 BCD 组合会得到:

ABCD, DBC

可以组合为:

ABCDBC.

Interesting problem. You need some kind of backtracking. For example if you have:

ABC, BCD, DBC 

Combining DBC with BCD results in:

ABC, DBCD

Which is not solvable. But combining ABC with BCD results in:

ABCD, DBC

Which can be combined to:

ABCDBC.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文