搜索允许字符串任意位置存在一个不匹配的字符串

发布于 2024-08-24 20:04:46 字数 913 浏览 14 评论 0原文

我正在处理长度为 25 的 DNA 序列(参见下面的示例)。我有一个 230,000 个列表,需要查找整个基因组(弓形虫寄生虫)中的每个序列。我不确定基因组有多大,但比 230,000 个序列长得多。

我需要查找每个 25 个字符的序列,例如 (AGCCTCCCATGATTGAACAGATCAT)。

基因组被格式化为连续的字符串,即(CATGGGAGGCTTGCGGAGCCTGAGGGCGGAGCCTGAGGTGGGAGGCTTGCGGAGTGCGGAGCCTGAGCCTGAGGGCGGAGCCTGAGGTGGGAGGCTT....

我不在乎在哪里或多少次找到它,只关心是否找到。
我认为这很简单——

str.find(AGCCTCCCATGATTGAACAGATCAT)

但我还想在任何位置找到定义为错误(不匹配)的紧密匹配,但只有一个位置,并在序列中记录该位置。我不知道该怎么做。我唯一能想到的是使用通配符并在每个位置使用通配符执行搜索。即搜索25次。

例如,

AGCCTCCCATGATTGAACAGATCAT    
AGCCTCCCATGATAGAACAGATCAT

在位置 13 处存在不匹配的势均力敌的比赛。

速度不是一个大问题,因为我只做了 3 次,但如果速度快的话那就太好了。

有一些程序可以执行此操作(查找匹配项和部分匹配项),但我正在寻找这些应用程序无法发现的一种部分匹配类型。

这是 Perl 的类似帖子,尽管它们只是比较序列而不是搜索连续字符串:

相关帖子

I am working with DNA sequences of length 25 (see examples below). I have a list of 230,000 and need to look for each sequence in the entire genome (toxoplasma gondii parasite). I am not sure how large the genome is, but much longer than 230,000 sequences.

I need to look for each of my sequences of 25 characters, for example, (AGCCTCCCATGATTGAACAGATCAT).

The genome is formatted as a continuous string, i.e. (CATGGGAGGCTTGCGGAGCCTGAGGGCGGAGCCTGAGGTGGGAGGCTTGCGGAGTGCGGAGCCTGAGCCTGAGGGCGGAGCCTGAGGTGGGAGGCTT....)

I don't care where or how many times it is found, only whether it is or not.
This is simple I think --

str.find(AGCCTCCCATGATTGAACAGATCAT)

But I also what to find a close match defined as wrong (mismatched) at any location, but only one location, and record the location in the sequence. I am not sure how do do this. The only thing I can think of is using a wildcard and performing the search with a wildcard in each position. I.e., search 25 times.

For example,

AGCCTCCCATGATTGAACAGATCAT    
AGCCTCCCATGATAGAACAGATCAT

A close match with a mismatch at position 13.

Speed is not a big issue because I am only doing it 3 times, though it would be nice if it was fast.

There are programs that do this -- find matches and partial matches -- but I am looking for a type of partial match that is not discoverable with these applications.

Here is a similar post for perl, although they are only comparing sequences and not searching a continuous string :

Related post

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(14

旧人哭 2024-08-31 20:04:46

在继续阅读之前,您是否看过biopython

看来您想要找到具有一个替换错误和零插入/删除错误(即汉明距离为 1)的近似匹配。

如果您有汉明距离匹配函数(请参阅 Ignacio 提供的链接),您可以像这样使用它这是为了搜索第一个匹配项:

any(Hamming_distance(genome[x:x+25], sequence) == 1 for x in xrange(len(genome)))

但这会相当慢,因为(1)汉明距离函数将在第二个替换错误后继续磨砺(2)失败后,它将光标前进一个而不是向前跳过基于它所看到的内容(就像博耶-摩尔搜索所做的那样)。

您可以使用如下函数来克服 (1):

def Hamming_check_0_or_1(genome, posn, sequence):
    errors = 0
    for i in xrange(25):
        if genome[posn+i] != sequence[i]:
            errors += 1
            if errors >= 2:
                return errors
    return errors 

注意:这不是 Pythonic,而是 Cic,因为您需要使用 C(可能通过 Cython)来获得合理的速度。

Navarro 和 Raffinot(谷歌“Navarro Raffinot nrgrep”)已经完成了一些关于带有跳跃的位并行近似 Levenshtein 搜索的工作,并且这可以适用于 Hamming 搜索。请注意,位并行方法对查询字符串的长度和字母大小有限制,但您的分别是 25 和 4,所以没有问题。更新:对于字母大小为 4 的情况,跳过可能没有太大帮助。

当您在 google 上搜索汉明距离搜索时,您会注意到很多有关在硬件中实现它的内容,而在软件中却没有太多帮助。这是一个很大的暗示,也许您想出的任何算法都应该用 C 或其他编译语言来实现。

更新: 位并行方法的工作代码

我还提供了一种简单的方法来帮助进行正确性检查,并且我打包了 Paul 的重新代码的变体进行一些比较。请注意,使用 re.finditer() 会提供不重叠的结果,这可能会导致距离 1 匹配隐藏精确匹配;请参阅我的最后一个测试用例。

位并行方法具有以下特征: 保证线性行为 O(N),其中 N 是文本长度。请注意,朴素方法的复杂度为 O(NM),正则表达式方法也是如此(M 是模式长度)。 Boyer-Moore 式方法的最坏情况为 O(NM),预期为 O(N)。当必须缓冲输入时,也可以轻松使用位并行方法:一次可以输入一个字节或一个兆字节;没有前瞻,缓冲区边界没有问题。最大的优点:用 C 语言编码时,简单的每输入字节代码的速度。缺点

:模式长度实际上受限于快速寄存器中的位数,例如 32 或 64。在这种情况下,模式长度为 25 ;没问题。它使用与模式中不同字符的数量成比例的额外内存(S_table)。在这种情况下,“字母大小”只有4;没问题。

详细信息来自此技术报告。该算法使用 Levenshtein 距离进行近似搜索。为了转换为使用汉明距离,我简单地(!)删除了语句 2.1 中处理插入和删除的部分。您会注意到很多对“R”的引用都带有“d”上标。 “d”是距离。我们只需要 0 和 1。这些“R”成为下面代码中的 R0 和 R1 变量。

# coding: ascii

from collections import defaultdict
import re

_DEBUG = 0


# "Fast Text Searching with Errors" by Sun Wu and Udi Manber
# TR 91-11, Dept of Computer Science, University of Arizona, June 1991.
# http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.20.8854

def WM_approx_Ham1_search(pattern, text):
    """Generate (Hamming_dist, start_offset)
    for matches with distance 0 or 1"""
    m = len(pattern)
    S_table = defaultdict(int)
    for i, c in enumerate(pattern):
        S_table[c] |= 1 << i
    R0 = 0
    R1 = 0
    mask = 1 << (m - 1)
    for j, c in enumerate(text):
        S = S_table[c]
        shR0 = (R0 << 1) | 1
        R0 = shR0 & S
        R1 = ((R1 << 1) | 1) & S | shR0
        if _DEBUG:
            print "j= %2d msk=%s S=%s R0=%s R1=%s" \
                % tuple([j] + map(bitstr, [mask, S, R0, R1]))
        if R0 & mask: # exact match
            yield 0, j - m + 1
        elif R1 & mask: # match with one substitution
            yield 1, j - m + 1

if _DEBUG:

    def bitstr(num, mlen=8):
       wstr = ""
       for i in xrange(mlen):
          if num & 1:
             wstr = "1" + wstr
          else:
             wstr = "0" + wstr
          num >>= 1
       return wstr

def Ham_dist(s1, s2):
    """Calculate Hamming distance between 2 sequences."""
    assert len(s1) == len(s2)
    return sum(c1 != c2 for c1, c2 in zip(s1, s2))

def long_check(pattern, text):
    """Naively and understandably generate (Hamming_dist, start_offset)
    for matches with distance 0 or 1"""
    m = len(pattern)
    for i in xrange(len(text) - m + 1):
        d = Ham_dist(pattern, text[i:i+m])
        if d < 2:
            yield d, i

def Paul_McGuire_regex(pattern, text):
    searchSeqREStr = (
        '('
        + pattern
        + ')|('
        + ')|('.join(
            pattern[:i]
            + "[ACTGN]".replace(c,'')
            + pattern[i+1:]
            for i,c in enumerate(pattern)
            )
        + ')'
        )
    searchSeqRE = re.compile(searchSeqREStr)
    for match in searchSeqRE.finditer(text):
        locn = match.start()
        dist = int(bool(match.lastindex - 1))
        yield dist, locn


if __name__ == "__main__":

    genome1 = "TTTACGTAAACTAAACTGTAA"
    #         01234567890123456789012345
    #                   1         2

    tests = [
        (genome1, "ACGT ATGT ACTA ATCG TTTT ATTA TTTA"),
        ("T" * 10, "TTTT"),
        ("ACGTCGTAAAA", "TCGT"), # partial match can shadow an exact match
        ]

    nfailed = 0
    for genome, patterns in tests:
        print "genome:", genome
        for pattern in patterns.split():
            print pattern
            a1 = list(WM_approx_Ham1_search(pattern, genome))
            a2 = list(long_check(pattern, genome))
            a3 = list(Paul_McGuire_regex(pattern, genome))
            print a1
            print a2
            print a3
            print a1 == a2, a2 == a3
            nfailed += (a1 != a2 or a2 != a3)
    print "***", nfailed

Before you read on, have you looked at biopython?

It appears that you want to find approximate matches with one substitution error, and zero insertion/deletion errors i.e. a Hamming distance of 1.

If you have a Hamming distance match function (see e.g. the link provided by Ignacio), you could use it like this to do a search for the first match:

any(Hamming_distance(genome[x:x+25], sequence) == 1 for x in xrange(len(genome)))

but this would be rather slow, because (1) the Hamming distance function would keep on grinding after the 2nd substitution error (2) after failure, it advances the cursor by one rather than skipping ahead based on what it saw (like a Boyer-Moore search does).

You can overcome (1) with a function like this:

def Hamming_check_0_or_1(genome, posn, sequence):
    errors = 0
    for i in xrange(25):
        if genome[posn+i] != sequence[i]:
            errors += 1
            if errors >= 2:
                return errors
    return errors 

Note: that's intentionally not Pythonic, it's Cic, because you'd need to use C (perhaps via Cython) to get reasonable speed.

Some work on bit-parallel approximate Levenshtein searches with skipping has been done by Navarro and Raffinot (google "Navarro Raffinot nrgrep") and this could be adapted to Hamming searches. Note that bit-parallel methods have limitations on length of query string and alphabet size but yours are 25 and 4 respectively so no problems there. Update: skipping probably not much help with an alphabet size of 4.

When you google for Hamming distance search, you will notice lots of stuff about implementing it in hardware, and not much in software. This is a big hint that maybe whatever algorithm you come up with ought to be implemented in C or some other compiled language.

Update: Working code for a bit-parallel method

I've also supplied a simplistic method for helping with the correctness checking, and I've packaged a variation of Paul's re code for some comparisons. Note that using re.finditer() delivers non-overlapping results, and this can cause a distance-1 match to shadow an exact match; see my last test case.

The bit-parallel method has these features: guaranteed linear behaviour O(N) where N is text length. Note naive method is O(NM) as is the regex method (M is the pattern length). A Boyer-Moore-style method would be worst case O(NM) and expected O(N). Also the bit-parallel method can be used easily when input has to be buffered: it can be fed a byte or a megabyte at a time; no look-ahead, no problems with buffer boundaries. The big advantage: the speed of that simple per-input-byte code when coded in C.

Downsides: the pattern length is effectively limited to the number of bits in a fast register e.g. 32 or 64. In this case the pattern length is 25; no problem. It uses extra memory (S_table) proportional to the number of distinct characters in the pattern. In this case, the "alphabet size" is only 4; no problem.

Details from this technical report. The algorithm there is for approximate search usin Levenshtein distance. To convert to using Hamming distance, I simply (!) removed the pieces of statement 2.1 that handle insertion and deletion. You'll notice lots of reference to "R" with a "d" superscript. "d" is distance. We need only 0 and 1. These "R"s become the R0 and R1 variables in the code below.

# coding: ascii

from collections import defaultdict
import re

_DEBUG = 0


# "Fast Text Searching with Errors" by Sun Wu and Udi Manber
# TR 91-11, Dept of Computer Science, University of Arizona, June 1991.
# http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.20.8854

def WM_approx_Ham1_search(pattern, text):
    """Generate (Hamming_dist, start_offset)
    for matches with distance 0 or 1"""
    m = len(pattern)
    S_table = defaultdict(int)
    for i, c in enumerate(pattern):
        S_table[c] |= 1 << i
    R0 = 0
    R1 = 0
    mask = 1 << (m - 1)
    for j, c in enumerate(text):
        S = S_table[c]
        shR0 = (R0 << 1) | 1
        R0 = shR0 & S
        R1 = ((R1 << 1) | 1) & S | shR0
        if _DEBUG:
            print "j= %2d msk=%s S=%s R0=%s R1=%s" \
                % tuple([j] + map(bitstr, [mask, S, R0, R1]))
        if R0 & mask: # exact match
            yield 0, j - m + 1
        elif R1 & mask: # match with one substitution
            yield 1, j - m + 1

if _DEBUG:

    def bitstr(num, mlen=8):
       wstr = ""
       for i in xrange(mlen):
          if num & 1:
             wstr = "1" + wstr
          else:
             wstr = "0" + wstr
          num >>= 1
       return wstr

def Ham_dist(s1, s2):
    """Calculate Hamming distance between 2 sequences."""
    assert len(s1) == len(s2)
    return sum(c1 != c2 for c1, c2 in zip(s1, s2))

def long_check(pattern, text):
    """Naively and understandably generate (Hamming_dist, start_offset)
    for matches with distance 0 or 1"""
    m = len(pattern)
    for i in xrange(len(text) - m + 1):
        d = Ham_dist(pattern, text[i:i+m])
        if d < 2:
            yield d, i

def Paul_McGuire_regex(pattern, text):
    searchSeqREStr = (
        '('
        + pattern
        + ')|('
        + ')|('.join(
            pattern[:i]
            + "[ACTGN]".replace(c,'')
            + pattern[i+1:]
            for i,c in enumerate(pattern)
            )
        + ')'
        )
    searchSeqRE = re.compile(searchSeqREStr)
    for match in searchSeqRE.finditer(text):
        locn = match.start()
        dist = int(bool(match.lastindex - 1))
        yield dist, locn


if __name__ == "__main__":

    genome1 = "TTTACGTAAACTAAACTGTAA"
    #         01234567890123456789012345
    #                   1         2

    tests = [
        (genome1, "ACGT ATGT ACTA ATCG TTTT ATTA TTTA"),
        ("T" * 10, "TTTT"),
        ("ACGTCGTAAAA", "TCGT"), # partial match can shadow an exact match
        ]

    nfailed = 0
    for genome, patterns in tests:
        print "genome:", genome
        for pattern in patterns.split():
            print pattern
            a1 = list(WM_approx_Ham1_search(pattern, genome))
            a2 = list(long_check(pattern, genome))
            a3 = list(Paul_McGuire_regex(pattern, genome))
            print a1
            print a2
            print a3
            print a1 == a2, a2 == a3
            nfailed += (a1 != a2 or a2 != a3)
    print "***", nfailed
知你几分 2024-08-31 20:04:46

Python 正则表达式 库支持 模糊正则表达式匹配。相对于 TRE 的一个优点是它允许查找文本中正则表达式的所有匹配项(支持重叠匹配,如出色地)。

import regex
m=regex.findall("AA", "CAG")
>>> []
m=regex.findall("(AA){e<=1}", "CAAG") # means allow up to 1 error
m
>>> ['CA', 'AG']

Python regex library supports fuzzy regular expression matching. One advantage over TRE is that it allows to find all matches of regular expression in the text (supports overlapping matches as well).

import regex
m=regex.findall("AA", "CAG")
>>> []
m=regex.findall("(AA){e<=1}", "CAAG") # means allow up to 1 error
m
>>> ['CA', 'AG']
没︽人懂的悲伤 2024-08-31 20:04:46

我在谷歌上搜索“弓形虫寄生虫基因组”,在网上找到了一些基因组文件。我在 http://toxodb.org 找到了一个我认为很接近的文件,名为“TgondiiGenomic_ToxoDB-6.0.fasta”,大小约158Mb。我使用下面的 pyparsing 表达式来提取基因序列,只花了不到 2 分钟:(

fname = "TgondiiGenomic_ToxoDB-6.0.fasta"
fastasrc = open(fname).read()   # yes! just read the whole dang 158Mb!

"""
Sample header:
>gb|scf_1104442823584 | organism=Toxoplasma_gondii_VEG | version=2008-07-23 | length=1448
"""
integer = Word(nums).setParseAction(lambda t:int(t[0]))
genebit = Group(">gb|" + Word(printables)("id") + SkipTo("length=") + 
                "length=" + integer("genelen") + LineEnd() + 
                Combine(OneOrMore(Word("ACGTN")),adjacent=False)("gene"))

# read gene data from .fasta file - takes just under a couple of minutes
genedata = OneOrMore(genebit).parseString(fastasrc)

惊讶!一些基因序列包含 'N' 的运行!那到底是什么?!)

然后我将这个类编写为子类pyparsing Token 类的,用于进行紧密匹配:

class CloseMatch(Token):
    def __init__(self, seq, maxMismatches=1):
        super(CloseMatch,self).__init__()
        self.name = seq
        self.sequence = seq
        self.maxMismatches = maxMismatches
        self.errmsg = "Expected " + self.sequence
        self.mayIndexError = False
        self.mayReturnEmpty = False

    def parseImpl( self, instring, loc, doActions=True ):
        start = loc
        instrlen = len(instring)
        maxloc = start + len(self.sequence)

        if maxloc <= instrlen:
            seq = self.sequence
            seqloc = 0
            mismatches = []
            throwException = False
            done = False
            while loc < maxloc and not done:
                if instring[loc] != seq[seqloc]:
                    mismatches.append(seqloc)
                    if len(mismatches) > self.maxMismatches:
                        throwException = True
                        done = True
                loc += 1
                seqloc += 1
        else:
            throwException = True

        if throwException:
            exc = self.myException
            exc.loc = loc
            exc.pstr = instring
            raise exc

        return loc, (instring[start:loc],mismatches)

对于每个匹配,这将返回一个包含匹配的实际字符串的元组以及不匹配位置的列表。精确匹配当然会为第二个值返回一个空列表。 (我喜欢这个类,我想我会将它添加到 pyparsing 的下一个版本中。)

然后我运行此代码以在从 .fasta 文件读取的所有序列中搜索“最多 2 个不匹配”匹配项(回想一下,genedata 是 ParseResults 组的序列,每个组包含一个 id、一个整数长度和一个序列字符串):

searchseq = CloseMatch("ATCATCGAATGGAATCTAATGGAAT", 2)
for g in genedata:
    print "%s (%d)" % (g.id, g.genelen)
    print "-"*24
    for t,startLoc,endLoc in searchseq.scanString(g.gene):
        matched, mismatches = t[0]
        print "MATCH:", searchseq.sequence
        print "FOUND:", matched
        if mismatches:
            print "      ", ''.join(' ' if i not in mismatches else '*' 
                            for i,c in enumerate(searchseq.sequence))
        else:
            print "<exact match>"
        print "at location", startLoc
        print
    print

我从一个基因位中随机获取搜索序列,以确保我可以找到精确匹配,只是出于好奇想看看有多少 1 和 2 元素不匹配。

这需要一点时间才能运行。 45 分钟后,我得到了这个输出,列出了每个 id 和基因长度,以及找到的任何部分匹配:

scf_1104442825154 (964)
------------------------

scf_1104442822828 (942)
------------------------

scf_1104442824510 (987)
------------------------

scf_1104442823180 (1065)
------------------------
...

我很沮丧,没有看到任何匹配,直到:

scf_1104442823952 (1188)
------------------------
MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAACGGAATCGAATGGAAT
                *      *        
at location 33

MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCGAATGGAAT
                       *        
at location 175

MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCGAATGGAAT
                       *        
at location 474

MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCGAATGGAAT
                       *        
at location 617

MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCGAATAGAAT
                       *   *    
at location 718

MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGATTCGAATGGAAT
                    *  *        
at location 896

MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCGAATGGTAT
                       *     *  
at location 945

最后我的精确匹配为:

scf_1104442823584 (1448)
------------------------
MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGACTCGAATGGAAT
                    *  *        
at location 177

MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCAAATGGAAT
                       *        
at location 203

MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCAAATGGAATCGAATGGAAT
             *         *        
at location 350

MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCGAATGGAAA
                       *       *
at location 523

MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCAAATGGAATCGAATGGAAT
             *         *        
at location 822

MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCTAATGGAAT
<exact match>
at location 848

MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCGTCGAATGGAGTCTAATGGAAT
          *         *           
at location 969

所以虽然这没有设置任何速度记录后,我完成了工作,还发现了一些 2 场比赛,以防万一他们感兴趣。

为了进行比较,这里是一个基于 RE 的版本,仅查找 1-不匹配的匹配项:(

import re
seqStr = "ATCATCGAATGGAATCTAATGGAAT"
searchSeqREStr = seqStr + '|' + \
    '|'.join(seqStr[:i]+"[ACTGN]".replace(c,'') +seqStr[i+1:] 
             for i,c in enumerate(seqStr))

searchSeqRE = re.compile(searchSeqREStr)

for g in genedata:
    print "%s (%d)" % (g.id, g.genelen)
    print "-"*24
    for match in searchSeqRE.finditer(g.gene):
        print "MATCH:", seqStr
        print "FOUND:", match.group(0)
        print "at location", match.start()
        print
    print

首先,我尝试搜索原始 FASTA 文件源本身,但很困惑为什么与 pyparsing 版本相比匹配项如此之少。然后我意识到一些匹配必须跨越换行符,因为 fasta 文件输出被包裹在 n 个字符处。)

因此,在第一次 pyparsing 传递以提取要匹配的基因序列之后,这个基于 RE 的搜索器又花费了大约 1-1/ 2 分钟扫描所有未文本包装的序列,以查找 pyparsing 解决方案所做的所有相同的 1 不匹配条目。

I googled for "toxoplasma gondii parasite genome" to find some of these genome files online. I found what I think was close, a file titled "TgondiiGenomic_ToxoDB-6.0.fasta" at http://toxodb.org, about 158Mb in size. I used the following pyparsing expression to extract the gene sequences, it took just under 2 minutes:

fname = "TgondiiGenomic_ToxoDB-6.0.fasta"
fastasrc = open(fname).read()   # yes! just read the whole dang 158Mb!

"""
Sample header:
>gb|scf_1104442823584 | organism=Toxoplasma_gondii_VEG | version=2008-07-23 | length=1448
"""
integer = Word(nums).setParseAction(lambda t:int(t[0]))
genebit = Group(">gb|" + Word(printables)("id") + SkipTo("length=") + 
                "length=" + integer("genelen") + LineEnd() + 
                Combine(OneOrMore(Word("ACGTN")),adjacent=False)("gene"))

# read gene data from .fasta file - takes just under a couple of minutes
genedata = OneOrMore(genebit).parseString(fastasrc)

(Surprise! some of the gene sequences include runs of 'N's! What the heck is that about?!)

Then I wrote this class as a subclass of the pyparsing Token class, for doing close matches:

class CloseMatch(Token):
    def __init__(self, seq, maxMismatches=1):
        super(CloseMatch,self).__init__()
        self.name = seq
        self.sequence = seq
        self.maxMismatches = maxMismatches
        self.errmsg = "Expected " + self.sequence
        self.mayIndexError = False
        self.mayReturnEmpty = False

    def parseImpl( self, instring, loc, doActions=True ):
        start = loc
        instrlen = len(instring)
        maxloc = start + len(self.sequence)

        if maxloc <= instrlen:
            seq = self.sequence
            seqloc = 0
            mismatches = []
            throwException = False
            done = False
            while loc < maxloc and not done:
                if instring[loc] != seq[seqloc]:
                    mismatches.append(seqloc)
                    if len(mismatches) > self.maxMismatches:
                        throwException = True
                        done = True
                loc += 1
                seqloc += 1
        else:
            throwException = True

        if throwException:
            exc = self.myException
            exc.loc = loc
            exc.pstr = instring
            raise exc

        return loc, (instring[start:loc],mismatches)

For every match, this will return a tuple containing the actual string that was matched, and a list of the mismatch locations. Exact matches would of course return an empty list for the second value. (I like this class, I think I'll add it to the next release of pyparsing.)

I then ran this code to search for "up-to-2-mismatch" matches in all of the sequences read from the .fasta file (recall that genedata is a sequence of ParseResults groups, each containing an id, an integer length, and a sequence string):

searchseq = CloseMatch("ATCATCGAATGGAATCTAATGGAAT", 2)
for g in genedata:
    print "%s (%d)" % (g.id, g.genelen)
    print "-"*24
    for t,startLoc,endLoc in searchseq.scanString(g.gene):
        matched, mismatches = t[0]
        print "MATCH:", searchseq.sequence
        print "FOUND:", matched
        if mismatches:
            print "      ", ''.join(' ' if i not in mismatches else '*' 
                            for i,c in enumerate(searchseq.sequence))
        else:
            print "<exact match>"
        print "at location", startLoc
        print
    print

I took the search sequence at random from one of the gene bits, to be sure I could find an exact match, and just out of curiosity to see how many 1- and 2-element mismatches there were.

This took a little while to run. After 45 minutes, I had this output, listing each id and gene length, and any partial matches found:

scf_1104442825154 (964)
------------------------

scf_1104442822828 (942)
------------------------

scf_1104442824510 (987)
------------------------

scf_1104442823180 (1065)
------------------------
...

I was getting discouraged, not to see any matches until:

scf_1104442823952 (1188)
------------------------
MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAACGGAATCGAATGGAAT
                *      *        
at location 33

MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCGAATGGAAT
                       *        
at location 175

MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCGAATGGAAT
                       *        
at location 474

MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCGAATGGAAT
                       *        
at location 617

MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCGAATAGAAT
                       *   *    
at location 718

MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGATTCGAATGGAAT
                    *  *        
at location 896

MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCGAATGGTAT
                       *     *  
at location 945

And finally my exact match at:

scf_1104442823584 (1448)
------------------------
MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGACTCGAATGGAAT
                    *  *        
at location 177

MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCAAATGGAAT
                       *        
at location 203

MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCAAATGGAATCGAATGGAAT
             *         *        
at location 350

MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCGAATGGAAA
                       *       *
at location 523

MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCAAATGGAATCGAATGGAAT
             *         *        
at location 822

MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCTAATGGAAT
<exact match>
at location 848

MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCGTCGAATGGAGTCTAATGGAAT
          *         *           
at location 969

So while this didn't set any speed records, I got the job done, and found some 2-matches too, in case they might be of interest.

For comparison, here is an RE-based version, that finds 1-mismatch matches only:

import re
seqStr = "ATCATCGAATGGAATCTAATGGAAT"
searchSeqREStr = seqStr + '|' + \
    '|'.join(seqStr[:i]+"[ACTGN]".replace(c,'') +seqStr[i+1:] 
             for i,c in enumerate(seqStr))

searchSeqRE = re.compile(searchSeqREStr)

for g in genedata:
    print "%s (%d)" % (g.id, g.genelen)
    print "-"*24
    for match in searchSeqRE.finditer(g.gene):
        print "MATCH:", seqStr
        print "FOUND:", match.group(0)
        print "at location", match.start()
        print
    print

(At first, I tried searching the raw FASTA file source itself, but was puzzled why so few matches compared to the pyparsing version. Then I realized that some of the matches must cross the line breaks, since the fasta file output is wrapped at n characters.)

So after the first pyparsing pass to extract the gene sequences to match against, this RE-based searcher then took about another 1-1/2 minutes to scan all of the un-textwrapped sequences, to find all of the same 1-mismatch entries that the pyparsing solution did.

北风几吹夏 2024-08-31 20:04:46

您可能会在 Python-Levenshtein 中找到一些有用的例程。

You might find the various routines in Python-Levenshtein of some use.

我一向站在原地 2024-08-31 20:04:46
>>> import re
>>> seq="AGCCTCCCATGATTGAACAGATCAT"
>>> genome = "CATGGGAGGCTTGCGGAGCCTGAGGGCGGAGCCTGAGGTGGGAGGCTTGCGGAGTGCGGAGCCTGAGCCTGAGGGCGGAGCCTGAGGTGGGAGGCTT..."
>>> seq_re=re.compile('|'.join(seq[:i]+'.'+seq[i+1:] for i in range(len(seq))))

>>> seq_re.findall(genome)  # list of matches
[]  

>>> seq_re.search(genome) # None if not found, otherwise a match object

这个会停止第一个匹配,因此当长度为 10,000,000 的基因组有多个匹配时,速度可能会快一点,

>>> print "found" if any(seq_re.finditer(genome)) else "not found"
not found

>>> print "found" if seq_re.search(genome) else "not found" 
not found

>>> seq="CAT"
>>> seq_re=re.compile('|'.join(seq[:i]+'.'+seq[i+1:] for i in range(len(seq))))
>>> print "found" if seq_re.search(genome) else "not found"
found

您正在查看单个线程扫描 230,000 个序列大约需要 2.5 天,因此您可能需要拆分任务到几个核心/CPU 上。

您始终可以在该算法运行时开始实现更有效的算法:)

如果您希望搜索单个删除或添加的元素,请将正则表达式更改为此

>>> seq_re=re.compile('|'.join(seq[:i]+'.{0,2}'+seq[i+1:] for i in range(len(seq))))
>>> import re
>>> seq="AGCCTCCCATGATTGAACAGATCAT"
>>> genome = "CATGGGAGGCTTGCGGAGCCTGAGGGCGGAGCCTGAGGTGGGAGGCTTGCGGAGTGCGGAGCCTGAGCCTGAGGGCGGAGCCTGAGGTGGGAGGCTT..."
>>> seq_re=re.compile('|'.join(seq[:i]+'.'+seq[i+1:] for i in range(len(seq))))

>>> seq_re.findall(genome)  # list of matches
[]  

>>> seq_re.search(genome) # None if not found, otherwise a match object

This one stops a the first match, so may be a bit faster when there are multiple matches

>>> print "found" if any(seq_re.finditer(genome)) else "not found"
not found

>>> print "found" if seq_re.search(genome) else "not found" 
not found

>>> seq="CAT"
>>> seq_re=re.compile('|'.join(seq[:i]+'.'+seq[i+1:] for i in range(len(seq))))
>>> print "found" if seq_re.search(genome) else "not found"
found

for a genome of length 10,000,000 you are looking at about 2.5 days for a single thread to scan 230,000 sequences, so you may want to split up the task onto a few cores/cpus.

You can always start implementing a more efficient algorithm while this one is running :)

If you should wish to search for single dropped or added elements change the regexp to this

>>> seq_re=re.compile('|'.join(seq[:i]+'.{0,2}'+seq[i+1:] for i in range(len(seq))))
木森分化 2024-08-31 20:04:46

这暗示了最长公共子序列问题。这里字符串相似度的问题是你需要针对 230000 个序列的连续字符串进行测试;因此,如果您将 25 个序列之一与连续字符串进行比较,您将得到非常低的相似度。

如果计算 25 个序列和连续字符串之间的最长公共子序列,如果长度相同,您就会知道它是否在字符串中。

This hints of the longest common subsequence problem. The problem with string similarity here is that you need to test against a continuous string of 230000 sequences; so if you are comparing one of your 25 sequences to the continuous string you'll get a very low similarity.

If you compute the longest common subsequence between your 25 sequences and the continuous string, you'll know if it is in the string if the lengths are the same.

时间你老了 2024-08-31 20:04:46

您可以从要匹配的所有不同序列中创建一个 trie 。现在是在 trie 中创建最多允许一个不匹配的深度优先搜索函数的棘手部分。

此方法的优点是您可以一次搜索所有序列。这将为您节省大量比较。例如,当您从顶部节点开始并沿着“A”分支向下移动时,您刚刚为自己节省了数千次比较,因为立即与所有以“A”开头的序列匹配。对于不同的论点,考虑与给定序列完全匹配的基因组切片。如果序列列表中有一个不同的序列,仅最后一个符号不同,那么使用 trie 就可以节省 23 次比较操作。

这是实现此目的的一种方法。将 'A'、'C'、T'、G' 转换为 0,1,2,3 或其变体。然后使用元组的元组作为特里树的结构。在每个节点,数组中的第一个元素对应于“A”,第二个元素对应于“C”,依此类推。如果“A”是该节点的分支,则还有另一个 4 个元素的元组作为该节点元组的第一项。如果没有“A”分支,则将第一项设置为 0。 trie 的底部是具有该序列 id 的节点,以便可以将其放入匹配列表中。

以下是允许此类 trie 出现不匹配的递归搜索函数:

def searchnomismatch(node,genome,i):
    if i == 25:
        addtomatches(node)
    else:
        for x in range(4):
            if node[x]:
                if x == genome[i]:
                    searchnomismatch(node[x],genome,i+1)
                else:
                    searchmismatch(node[x],genome,i+1,i)

def searchmismatch(node,genome,i,where):
    if i == 25:
        addtomatches(node,where)
    else:
        if node[genome[i]]:
            searchmismatch(node[genome[i]],genome,i+1,where)

在这里,我以类似的内容开始搜索

searchnomismatch(trie,genome[ind:ind+25],0)

,addtomatches 类似于

def addtomatches(id,where=-1):
    matches.append(id,where)

等于 -1 的情况表示不存在不匹配。不管怎样,我希望我说得足够清楚,这样你就能明白了。

You could make a trie out of all the different sequences that you want to match. Now is the tricky part of making a depth first search function down the trie that allows at most one mismatch.

The advantage of this method is that you are searching through all of the sequences at once. This will save you a lot of comparisons. For instance, when you start at the top node and go down the 'A' branch, you have just saved yourself many thousands of comparisons because have just instantly matched with all sequences that start with 'A'. For a different argument, consider a slice of the genome that matches exactly with a given sequence. If you have a different sequence in your list of sequences that differs only in the last symbol, then using a trie has just saved you 23 comparison operations.

Here is one way of implementing this. Convert 'A','C',T',G' to 0,1,2,3 or a variant of that. Then use tuples of tuples as your structure for your trie. At each node, the first element in your array corresponds with 'A', the second with 'C' and so on. If 'A' is a branch of this node, then there is another tuple of 4 elements as the first item of this node's tuple. If there isn't an 'A' branch, then set the first item to 0. At the bottom of the trie are nodes that have the id of that sequence so that it can be put into the list of matches.

Here are recursive search functions allowing one mismatch for this sort of trie:

def searchnomismatch(node,genome,i):
    if i == 25:
        addtomatches(node)
    else:
        for x in range(4):
            if node[x]:
                if x == genome[i]:
                    searchnomismatch(node[x],genome,i+1)
                else:
                    searchmismatch(node[x],genome,i+1,i)

def searchmismatch(node,genome,i,where):
    if i == 25:
        addtomatches(node,where)
    else:
        if node[genome[i]]:
            searchmismatch(node[genome[i]],genome,i+1,where)

Here, I start out the search with something like

searchnomismatch(trie,genome[ind:ind+25],0)

and addtomatches is something similar to

def addtomatches(id,where=-1):
    matches.append(id,where)

where equal to -1 means there wasn't a mismatch. Anyway, I hope that I was clear enough so that you get the picture.

树深时见影 2024-08-31 20:04:46

我尝试了一些解决方案,但正如已经写的那样,它们在处理大量序列(字符串)时速度很慢。

我想出了使用 bowtie 并将感兴趣的子字符串(soi)映射到包含以下内容的参考文件FASTA 格式的字符串。您可以提供允许的不匹配数 (0..3),然后根据允许的不匹配数返回 soi 映射到的字符串。这效果很好而且相当快。

I tried some of the solutions, but as already written they are slow when dealing with a large amount of sequences (strings).

I came up with using bowtie and mapping the substring of interest (soi) against a reference file which contains the strings in FASTA format. You can provide the number of allowed mismatches (0..3) and you get back the strings to which the soi mapped given the allowed mismatches. This works well and pretty fast.

メ斷腸人バ 2024-08-31 20:04:46

我也遇到了这个问题,但是建议的 regex 包的行为不符合预期(正则表达式中带有“e”),特别是对于查询和搜索序列中的插入/删除。例如,无法找到匹配项:regex.search("xxgyy" + "{e<=1}", "xxggyy")

我终于找到了一个可以用的包! fuzzysearch 包中的 find_near_matches()https://github.com/taleinat/fuzzysearch

它利用 Levenshtein 距离 (max_l_dist)。

>>> from fuzzysearch import find_near_matches
# search for 'PATTERN' with a maximum Levenshtein Distance of 1
>>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1)
[Match(start=3, end=9, dist=1, matched="PATERN")]

如果你想获取返回值:

>>> x= find_near_matches('PATTERN', '---PATERN---', max_l_dist=1)
## get the start position
>>> x[0].start
3
## get the matched string
>>> x[0].matched
'PATERN'

I came across this problem too, but the suggested regex package does not behave as expected (with the "e" in their regular expression), especially for insertion/deletion in the query and search sequence. For example, this failed to find match: regex.search("xxgyy" + "{e<=1}", "xxggyy").

I finally found a package that works! The find_near_matches() in the fuzzysearch package! https://github.com/taleinat/fuzzysearch

It utilizes the Levenshtein distance (max_l_dist).

>>> from fuzzysearch import find_near_matches
# search for 'PATTERN' with a maximum Levenshtein Distance of 1
>>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1)
[Match(start=3, end=9, dist=1, matched="PATERN")]

if you want to get the return value:

>>> x= find_near_matches('PATTERN', '---PATERN---', max_l_dist=1)
## get the start position
>>> x[0].start
3
## get the matched string
>>> x[0].matched
'PATERN'
风情万种。 2024-08-31 20:04:46

您可以使用 Python 内置功能通过正则表达式匹配进行搜索。

python 中的 re 模块
http://docs.python.org/library/re.html

正则表达式入门
http://www.regular-expressions.info/

You could use Pythons built in capability to do the search with regular expression matching.

re module in python
http://docs.python.org/library/re.html

regular expression primer
http://www.regular-expressions.info/

失而复得 2024-08-31 20:04:46

我想这可能来得有点晚了,但是有一个名为 PatMaN 的工具可以完全满足您的需求:
http://bioinf.eva.mpg.de/patman/

I guess this may come a bit late, but there is a tool named PatMaN that does exactly what you want:
http://bioinf.eva.mpg.de/patman/

枯叶蝶 2024-08-31 20:04:46

您可以使用正则表达式匹配库 TRE 进行“近似匹配”。它还具有 Python、Perl 和 Haskell 的绑定。

import tre

pt = tre.compile("Don(ald)?( Ervin)? Knuth", tre.EXTENDED)
data = """
In addition to fundamental contributions in several branches of
theoretical computer science, Donnald Erwin Kuth is the creator of
the TeX computer typesetting system, the related METAFONT font
definition language and rendering system, and the Computer Modern
family of typefaces.
"""

fz = tre.Fuzzyness(maxerr = 3)
print fz
m = pt.search(data, fz)

if m:
    print m.groups()
    print m[0]

这将输出

tre.Fuzzyness(delcost=1,inscost=1,maxcost=2147483647,subcost=1, maxdel=2147483647,maxerr=3,maxins=2147483647,maxsub=2147483647)
((95, 113), (99, 108), (102, 108))
Donnald Erwin Kuth

http://en.wikipedia.org/wiki/TRE_%28computing%29< /a>

http://laurikari.net/tre/

You can use regex matching library TRE, for "approximate matching". It also has bindings for Python, Perl and Haskell.

import tre

pt = tre.compile("Don(ald)?( Ervin)? Knuth", tre.EXTENDED)
data = """
In addition to fundamental contributions in several branches of
theoretical computer science, Donnald Erwin Kuth is the creator of
the TeX computer typesetting system, the related METAFONT font
definition language and rendering system, and the Computer Modern
family of typefaces.
"""

fz = tre.Fuzzyness(maxerr = 3)
print fz
m = pt.search(data, fz)

if m:
    print m.groups()
    print m[0]

which will output

tre.Fuzzyness(delcost=1,inscost=1,maxcost=2147483647,subcost=1, maxdel=2147483647,maxerr=3,maxins=2147483647,maxsub=2147483647)
((95, 113), (99, 108), (102, 108))
Donnald Erwin Kuth

http://en.wikipedia.org/wiki/TRE_%28computing%29

http://laurikari.net/tre/

命硬 2024-08-31 20:04:46

我认为下面的代码既简单又方便。

in_pattern = "";
in_genome = "";
in_mistake = d;
out_result = ""


kmer = len(in_pattern)

def FindMistake(v):
    mistake = 0
    for i in range(0, kmer, 1):
        if (v[i]!=in_pattern[i]):
            mistake+=1
        if mistake>in_mistake:
            return False
    return True


for i in xrange(len(in_genome)-kmer+1):
    v = in_genome[i:i+kmer]
    if FindMistake(v):
        out_result+= str(i) + " "

print out_result

您可以轻松插入想要检查的基因组和片段,还可以设置不匹配的值。

I thought the code below is simple and convenient.

in_pattern = "";
in_genome = "";
in_mistake = d;
out_result = ""


kmer = len(in_pattern)

def FindMistake(v):
    mistake = 0
    for i in range(0, kmer, 1):
        if (v[i]!=in_pattern[i]):
            mistake+=1
        if mistake>in_mistake:
            return False
    return True


for i in xrange(len(in_genome)-kmer+1):
    v = in_genome[i:i+kmer]
    if FindMistake(v):
        out_result+= str(i) + " "

print out_result

You can easily insert the genomes and segments you want to check and also set up the value of mismatch.

最美的太阳 2024-08-31 20:04:46

这是相当古老的,但也许这个简单的解决方案可以工作。
循环遍历包含 25 个字符片段的序列。将切片转换为 numpy 数组。与 25char 字符串(也作为 numpy 数组)进行比较。对答案求和,如果答案是 24,则打印出循环中的位置和不匹配的位置。

接下来的几行显示它正在工作

<块引用>
<块引用>

将 numpy 导入为 np

a = ['A','B','C']

b = np.array(a)

b


array(['A', 'B', 'C'], dtype='

<块引用>
<块引用>

c = ['A','D','C']

d = np.array(c)

b==d


array([ True, False, True])

<块引用>
<块引用>

求和(b==d)


2

This is quite old but perhaps this simple solution could work.
loop through the sequence taking 25character slices. convert the slice to an numpy array. Compare to the 25char string (also as a numpy array). Sum the answer and if the answer is 24 print out the position in the loop and the mismatch.

te next few lines show it working

import numpy as np

a = ['A','B','C']

b = np.array(a)

b

array(['A', 'B', 'C'], dtype='

c = ['A','D','C']

d = np.array(c)

b==d

array([ True, False, True])

sum(b==d)

2

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文