回文检测效率

发布于 2024-07-08 06:36:36 字数 696 浏览 7 评论 0原文

我对 Jon Limjap 的面试事故感到好奇，并开始寻找有效的方法来进行回文检测。我检查了回文高尔夫答案，在我看来，答案中只有两种算法，反转字符串和从尾部和头部检查。

def palindrome_short(s):
    length = len(s)
    for i in xrange(0,length/2):
        if s[i] != s[(length-1)-i]: return False
    return True

def palindrome_reverse(s):
    return s == s[::-1]

我认为这两种方法都不能用于检测巨大 DNA 序列中的精确回文。我环顾四周，没有找到任何关于这种超有效方法的免费文章。

一个好的方法可能是以分而治之的方法并行化第一个版本，为每个线程或处理器分配一对字符数组 1..n 和 length-1-n..length-1。

什么是更好的方法？

你知道任何？

原文

I got curious by Jon Limjap's interview mishap and started to look for efficient ways to do palindrome detection. I checked the palindrome golf answers and it seems to me that in the answers are two algorithms only, reversing the string and checking from tail and head.

def palindrome_short(s):
    length = len(s)
    for i in xrange(0,length/2):
        if s[i] != s[(length-1)-i]: return False
    return True

def palindrome_reverse(s):
    return s == s[::-1]

I think neither of these methods are used in the detection of exact palindromes in huge DNA sequences. I looked around a bit and didn't find any free article about what an ultra efficient way for this might be.

A good way might be parallelizing the first version in a divide-and-conquer approach, assigning a pair of char arrays 1..n and length-1-n..length-1 to each thread or processor.

What would be a better way?

Do you know any?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

泪之魂 2024-07-15 06:36:37

给定一个回文数，你必须在 O(N) 时间内完成，是的。正如您所说，通过拆分字符串，您可以提高多处理器的效率。

现在假设您想要进行精确的 DNA 匹配。这些字符串有数千个字符长，而且非常重复。这给了我们优化的机会。

假设您将一个 1000 个字符的长字符串分成 5 对，每对 100,100。代码将如下所示：

isPal(w[0:100],w[-100:]) and isPal(w[101:200], w[-200:-100]) ...

等等...第一次进行这些匹配时，您将必须处理它们。但是，您可以将完成的所有结果添加到哈希表中，将映射对映射到布尔值：

isPal = {("ATTAGC", "CGATTA"): True, ("ATTGCA", "CAGTAA"): False}

等等...但这会占用太多内存。对于 100,100 个对，哈希映射将具有 2*4^100 个元素。假设您只存储两个 32 位字符串哈希值作为密钥，您将需要大约 10^55 兆字节，这是荒谬的。

也许如果你使用较小的字符串，问题就可以解决。然后你将拥有一个巨大的哈希图，但至少回文（假设 10x10 对）需要 O(1)，因此检查 1000 个字符串是否是回文将需要 100 次查找而不是 500 次比较。但它仍然是 O(N)...

Given only one palindrome, you will have to do it in O(N), yes. You can get more efficiency with multi-processors by splitting the string as you said.

Now say you want to do exact DNA matching. These strings are thousands of characters long, and they are very repetitive. This gives us the opportunity to optimize.

Say you split a 1000-char long string into 5 pairs of 100,100. The code will look like this:

isPal(w[0:100],w[-100:]) and isPal(w[101:200], w[-200:-100]) ...

etc... The first time you do these matches, you will have to process them. However, you can add all results you've done into a hashtable mapping pairs to booleans:

isPal = {("ATTAGC", "CGATTA"): True, ("ATTGCA", "CAGTAA"): False}

etc... this will take way too much memory, though. For pairs of 100,100, the hash map will have 2*4^100 elements. Say that you only store two 32-bit hashes of strings as the key, you will need something like 10^55 megabytes, which is ridiculous.

Maybe if you use smaller strings, the problem can be tractable. Then you'll have a huge hashmap, but at least palindrome for let's say 10x10 pairs will take O(1), so checking if a 1000 string is a palindrome will take 100 lookups instead of 500 compares. It's still O(N), though...

回复收藏 0 原文

沙与沫 2024-07-15 06:36:37

不是她的第二个功能的变体。我们不需要检查正常字符串和反向字符串的正确部分是否相等。

def palindrome_reverse(s):
  l = len(s) // 2
  return s[:l] == s[l::-1]

Another variant of your second function. We need no check equals of the right parts of normal and reverse strings.

def palindrome_reverse(s):
  l = len(s) // 2
  return s[:l] == s[l::-1]

回复收藏 0 原文

聚集的泪 2024-07-15 06:36:37

显然，您无法获得比 O(n) 更好的渐近效率，因为每个字符必须至少检查一次。不过，您可以获得更好的乘法常数。

对于单个线程，您可以使用汇编来获得加速。您还可以通过一次检查大于一个字节的块中的数据来做得更好，但由于对齐考虑，这可能会很棘手。如果您可以一次检查大至 16 字节的块，那么使用 SIMD 效果会更好。

如果你想并行化它，你可以将字符串分成 N 段，然后让处理器 i 比较段 [i*n/2, (i+1)*N/2) 与段[L-(i+1)*N/2, Li*N/2)。

回复收藏 0 原文

撩动你心 2024-07-15 06:36:37

没有，除非你进行模糊匹配。这就是他们在 DNA 中可能做的事情（我已经使用 smith-waterman 在 DNA 中进行了 EST 搜索，但这显然比匹配序列中的回文或反向互补要困难得多）。

回复收藏 0 原文

拧巴小姐 2024-07-15 06:36:37

它们的复杂度都是 O(N)，所以我认为这些解决方案都不存在任何特定的效率问题。也许我的创意不够，但我不明白如何在少于 N 个步骤的时间内比较 N 个元素，所以像 O(log N) 这样的东西恕我直言绝对是不可能的。

并行性可能会有所帮助，但它仍然不会改变算法的大哦等级，因为它相当于在更快的机器上运行它。

回复收藏 0 原文

爱冒险 2024-07-15 06:36:37

从中心进行比较总是更有效，因为您可以在错过时尽早退出，但它还允许您进行更快的最大回文搜索，无论您是在寻找最大半径还是所有不重叠的回文。

唯一真正的并行化是如果您有多个独立的字符串要处理。每次未命中都会浪费大量工作，而且未命中的次数总是多于命中的次数。

回复收藏 0 原文

过去的过去 2024-07-15 06:36:37

除了其他人所说的之外，我还为非常大的输入添加了一些预检查标准：

quick check whether tail-character matches 
                    head character 

if NOT, just early exit by returning Boolean-False

if (input-length < 4) { 

    # The quick check just now already confirmed it's palindrome 

    return Boolean-True  

} else if (200 < input-length) {
     
     # adjust this parameter to your preferences
     #
     # e.g. make it 20 for longer than 8000 etc
     #      or make it scale to input size,
     #      either logarithmically, or a fixed ratio like 2.5%
     #
     reverse last ( N = 4 ) characters/bytes of the input  

     if that **DOES NOT** match first N chars/bytes {

         return boolean-false # early exit
                              # no point to reverse rest of it
                              # when head and tail don't even match
     } else {
         
         if N was substantial

             trim out the head and tail of the input
             you've confirmed; avoid duplicated work

             remember to also update the variable(s)
             you've elected to track the input size  

     }

     [optional 1 : if that substring of N characters you've 
                   just checked happened to all contain the
                   same character, let's call it C,
                    
                   then locate the index position, P, for the first               
                   character that isn't C
                   
                   if P == input-size 

                       then you've already proven
                       the entire string is a nonstop repeat 
                       of one single character, which, by def, 
                       must be a palindrome

                       then just return Boolean-True


                   but the P is more than the half-way point,
                       you've also proven it cannot possibly be a 
                       palindrome, because second half contains a              
                       component that doesn't exist in first half,
                       

                       then just return Boolean-False ]


     [optional 2 : for extremely long inputs, 
                   like over 200,000 chars,
                   take the N chars right at the middle of it,
                   and see if the reversed one matches
                 
                   if that fails, then do early exit and save time ]

 }

 if all pre-checks passed,
 then simply do it BAU style :

     reverse second-half of it, 
     and see if it's same as first half

On top of what others said, I'd also add a few pre-check criteria for really large inputs :

quick check whether tail-character matches 
                    head character 

if NOT, just early exit by returning Boolean-False

if (input-length < 4) { 

    # The quick check just now already confirmed it's palindrome 

    return Boolean-True  

} else if (200 < input-length) {
     
     # adjust this parameter to your preferences
     #
     # e.g. make it 20 for longer than 8000 etc
     #      or make it scale to input size,
     #      either logarithmically, or a fixed ratio like 2.5%
     #
     reverse last ( N = 4 ) characters/bytes of the input  

     if that **DOES NOT** match first N chars/bytes {

         return boolean-false # early exit
                              # no point to reverse rest of it
                              # when head and tail don't even match
     } else {
         
         if N was substantial

             trim out the head and tail of the input
             you've confirmed; avoid duplicated work

             remember to also update the variable(s)
             you've elected to track the input size  

     }

     [optional 1 : if that substring of N characters you've 
                   just checked happened to all contain the
                   same character, let's call it C,
                    
                   then locate the index position, P, for the first               
                   character that isn't C
                   
                   if P == input-size 

                       then you've already proven
                       the entire string is a nonstop repeat 
                       of one single character, which, by def, 
                       must be a palindrome

                       then just return Boolean-True


                   but the P is more than the half-way point,
                       you've also proven it cannot possibly be a 
                       palindrome, because second half contains a              
                       component that doesn't exist in first half,
                       

                       then just return Boolean-False ]


     [optional 2 : for extremely long inputs, 
                   like over 200,000 chars,
                   take the N chars right at the middle of it,
                   and see if the reversed one matches
                 
                   if that fails, then do early exit and save time ]

 }

 if all pre-checks passed,
 then simply do it BAU style :

     reverse second-half of it, 
     and see if it's same as first half

回复收藏 0 原文

滥情空心 2024-07-15 06:36:37

使用Python，短代码可以更快，因为它将负载放入更快的VM内部（并且有整个缓存和其他类似的东西）

def ispalin(x):
   return all(x[a]==x[-a-1] for a in xrange(len(x)>>1))

With Python, short code can be faster since it puts the load into the faster internals of the VM (And there is the whole cache and other such things)

def ispalin(x):
   return all(x[a]==x[-a-1] for a in xrange(len(x)>>1))

回复收藏 0 原文

您的好友蓝忘机已上羡 2024-07-15 06:36:37

您可以使用哈希表来放置字符并拥有一个计数器变量，每当您找到不在表/映射中的元素时，该变量的值就会增加。如果您检查并找到表中已有的元素，则会减少计数。

For odd lettered string the counter should be back to 1 and for even it should hit 0.I hope this approach is right.

See below the snippet.
s->refers to string
eg: String s="abbcaddc";
Hashtable<Character,Integer> textMap= new Hashtable<Character,Integer>();
        char charA[]= s.toCharArray();
        for(int i=0;i<charA.length;i++)
        {

            if(!textMap.containsKey(charA[i]))
            {   
                textMap.put(charA[i], ++count);

            }
            else
                {
                textMap.put(charA[i],--count);


        }
        if(length%2 !=0)
        {
            if(count == 1)
            System.out.println("(odd case:PALINDROME)");
            else
                System.out.println("(odd case:not palindrome)");
        }
        else if(length%2==0)    
        {
            if(count ==0)
                System.out.println("(even case:palindrome)");
            else
                System.out.println("(even case :not palindrome)");
        }

You can use a hashtable to put the character and have a counter variable whose value increases everytime you find an element not in table/map. If u check and find element thats already in table decrease the count.

For odd lettered string the counter should be back to 1 and for even it should hit 0.I hope this approach is right.

See below the snippet.
s->refers to string
eg: String s="abbcaddc";
Hashtable<Character,Integer> textMap= new Hashtable<Character,Integer>();
        char charA[]= s.toCharArray();
        for(int i=0;i<charA.length;i++)
        {

            if(!textMap.containsKey(charA[i]))
            {   
                textMap.put(charA[i], ++count);

            }
            else
                {
                textMap.put(charA[i],--count);


        }
        if(length%2 !=0)
        {
            if(count == 1)
            System.out.println("(odd case:PALINDROME)");
            else
                System.out.println("(odd case:not palindrome)");
        }
        else if(length%2==0)    
        {
            if(count ==0)
                System.out.println("(even case:palindrome)");
            else
                System.out.println("(even case :not palindrome)");
        }

回复收藏 0 原文

南汐寒笙箫 2024-07-15 06:36:37

public class Palindrome{
    private static boolean isPalindrome(String s){
        if(s == null)
            return false;   //unitialized String ? return false
        if(s.isEmpty())     //Empty Strings is a Palindrome 
            return true;
        //we want check characters on opposite sides of the string 
        //and stop in the middle <divide and conquer>
        int left = 0;  //left-most char    
        int right = s.length() - 1; //right-most char

        while(left < right){  //this elegantly handles odd characters string 
            if(s.charAt(left) != s.charAt(right)) //left char must equal 
                return false; //right else its not a palindrome
            left++; //converge left to right 
            right--;//converge right to left 
        }
        return true; // return true if the while doesn't exit 
    }
}

虽然我们正在进行 n/2 次计算，但它仍然是 O(n)
这也可以使用线程来完成，但计算会变得混乱，最好避免它。这不会测试特殊字符并且区分大小写。我有可以做到这一点的代码，但是可以修改该代码以轻松处理该问题。

public class Palindrome{
    private static boolean isPalindrome(String s){
        if(s == null)
            return false;   //unitialized String ? return false
        if(s.isEmpty())     //Empty Strings is a Palindrome 
            return true;
        //we want check characters on opposite sides of the string 
        //and stop in the middle <divide and conquer>
        int left = 0;  //left-most char    
        int right = s.length() - 1; //right-most char

        while(left < right){  //this elegantly handles odd characters string 
            if(s.charAt(left) != s.charAt(right)) //left char must equal 
                return false; //right else its not a palindrome
            left++; //converge left to right 
            right--;//converge right to left 
        }
        return true; // return true if the while doesn't exit 
    }
}

though we are doing n/2 calculations its still O(n)
this can done also using threads, but calculations get messy, best to avoid it. this doesn't test for special characters and is case sensitive. I have code that does it, but this code can be modified to handle that easily.

回复收藏 0 原文

~没有更多了~