搜索允许字符串任意位置存在一个不匹配的字符串
我正在处理长度为 25 的 DNA 序列(参见下面的示例)。我有一个 230,000 个列表,需要查找整个基因组(弓形虫寄生虫)中的每个序列。我不确定基因组有多大,但比 230,000 个序列长得多。
我需要查找每个 25 个字符的序列,例如 (AGCCTCCCATGATTGAACAGATCAT
)。
基因组被格式化为连续的字符串,即(CATGGGAGGCTTGCGGAGCCTGAGGGCGGAGCCTGAGGTGGGAGGCTTGCGGAGTGCGGAGCCTGAGCCTGAGGGCGGAGCCTGAGGTGGGAGGCTT....
)
我不在乎在哪里或多少次找到它,只关心是否找到。
我认为这很简单——
str.find(AGCCTCCCATGATTGAACAGATCAT)
但我还想在任何位置找到定义为错误(不匹配)的紧密匹配,但只有一个位置,并在序列中记录该位置。我不知道该怎么做。我唯一能想到的是使用通配符并在每个位置使用通配符执行搜索。即搜索25次。
例如,
AGCCTCCCATGATTGAACAGATCAT
AGCCTCCCATGATAGAACAGATCAT
在位置 13 处存在不匹配的势均力敌的比赛。
速度不是一个大问题,因为我只做了 3 次,但如果速度快的话那就太好了。
有一些程序可以执行此操作(查找匹配项和部分匹配项),但我正在寻找这些应用程序无法发现的一种部分匹配类型。
这是 Perl 的类似帖子,尽管它们只是比较序列而不是搜索连续字符串:
I am working with DNA sequences of length 25 (see examples below). I have a list of 230,000 and need to look for each sequence in the entire genome (toxoplasma gondii parasite). I am not sure how large the genome is, but much longer than 230,000 sequences.
I need to look for each of my sequences of 25 characters, for example, (AGCCTCCCATGATTGAACAGATCAT
).
The genome is formatted as a continuous string, i.e. (CATGGGAGGCTTGCGGAGCCTGAGGGCGGAGCCTGAGGTGGGAGGCTTGCGGAGTGCGGAGCCTGAGCCTGAGGGCGGAGCCTGAGGTGGGAGGCTT....
)
I don't care where or how many times it is found, only whether it is or not.
This is simple I think --
str.find(AGCCTCCCATGATTGAACAGATCAT)
But I also what to find a close match defined as wrong (mismatched) at any location, but only one location, and record the location in the sequence. I am not sure how do do this. The only thing I can think of is using a wildcard and performing the search with a wildcard in each position. I.e., search 25 times.
For example,
AGCCTCCCATGATTGAACAGATCAT
AGCCTCCCATGATAGAACAGATCAT
A close match with a mismatch at position 13.
Speed is not a big issue because I am only doing it 3 times, though it would be nice if it was fast.
There are programs that do this -- find matches and partial matches -- but I am looking for a type of partial match that is not discoverable with these applications.
Here is a similar post for perl, although they are only comparing sequences and not searching a continuous string :
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(14)
在继续阅读之前,您是否看过biopython?
看来您想要找到具有一个替换错误和零插入/删除错误(即汉明距离为 1)的近似匹配。
如果您有汉明距离匹配函数(请参阅 Ignacio 提供的链接),您可以像这样使用它这是为了搜索第一个匹配项:
但这会相当慢,因为(1)汉明距离函数将在第二个替换错误后继续磨砺(2)失败后,它将光标前进一个而不是向前跳过基于它所看到的内容(就像博耶-摩尔搜索所做的那样)。
您可以使用如下函数来克服 (1):
注意:这不是 Pythonic,而是 Cic,因为您需要使用 C(可能通过 Cython)来获得合理的速度。
Navarro 和 Raffinot(谷歌“Navarro Raffinot nrgrep”)已经完成了一些关于带有跳跃的位并行近似 Levenshtein 搜索的工作,并且这可以适用于 Hamming 搜索。请注意,位并行方法对查询字符串的长度和字母大小有限制,但您的分别是 25 和 4,所以没有问题。更新:对于字母大小为 4 的情况,跳过可能没有太大帮助。
当您在 google 上搜索汉明距离搜索时,您会注意到很多有关在硬件中实现它的内容,而在软件中却没有太多帮助。这是一个很大的暗示,也许您想出的任何算法都应该用 C 或其他编译语言来实现。
更新: 位并行方法的工作代码
我还提供了一种简单的方法来帮助进行正确性检查,并且我打包了 Paul 的重新代码的变体进行一些比较。请注意,使用 re.finditer() 会提供不重叠的结果,这可能会导致距离 1 匹配隐藏精确匹配;请参阅我的最后一个测试用例。
位并行方法具有以下特征: 保证线性行为 O(N),其中 N 是文本长度。请注意,朴素方法的复杂度为 O(NM),正则表达式方法也是如此(M 是模式长度)。 Boyer-Moore 式方法的最坏情况为 O(NM),预期为 O(N)。当必须缓冲输入时,也可以轻松使用位并行方法:一次可以输入一个字节或一个兆字节;没有前瞻,缓冲区边界没有问题。最大的优点:用 C 语言编码时,简单的每输入字节代码的速度。缺点
:模式长度实际上受限于快速寄存器中的位数,例如 32 或 64。在这种情况下,模式长度为 25 ;没问题。它使用与模式中不同字符的数量成比例的额外内存(S_table)。在这种情况下,“字母大小”只有4;没问题。
详细信息来自此技术报告。该算法使用 Levenshtein 距离进行近似搜索。为了转换为使用汉明距离,我简单地(!)删除了语句 2.1 中处理插入和删除的部分。您会注意到很多对“R”的引用都带有“d”上标。 “d”是距离。我们只需要 0 和 1。这些“R”成为下面代码中的 R0 和 R1 变量。
Before you read on, have you looked at biopython?
It appears that you want to find approximate matches with one substitution error, and zero insertion/deletion errors i.e. a Hamming distance of 1.
If you have a Hamming distance match function (see e.g. the link provided by Ignacio), you could use it like this to do a search for the first match:
but this would be rather slow, because (1) the Hamming distance function would keep on grinding after the 2nd substitution error (2) after failure, it advances the cursor by one rather than skipping ahead based on what it saw (like a Boyer-Moore search does).
You can overcome (1) with a function like this:
Note: that's intentionally not Pythonic, it's Cic, because you'd need to use C (perhaps via Cython) to get reasonable speed.
Some work on bit-parallel approximate Levenshtein searches with skipping has been done by Navarro and Raffinot (google "Navarro Raffinot nrgrep") and this could be adapted to Hamming searches. Note that bit-parallel methods have limitations on length of query string and alphabet size but yours are 25 and 4 respectively so no problems there. Update: skipping probably not much help with an alphabet size of 4.
When you google for Hamming distance search, you will notice lots of stuff about implementing it in hardware, and not much in software. This is a big hint that maybe whatever algorithm you come up with ought to be implemented in C or some other compiled language.
Update: Working code for a bit-parallel method
I've also supplied a simplistic method for helping with the correctness checking, and I've packaged a variation of Paul's re code for some comparisons. Note that using re.finditer() delivers non-overlapping results, and this can cause a distance-1 match to shadow an exact match; see my last test case.
The bit-parallel method has these features: guaranteed linear behaviour O(N) where N is text length. Note naive method is O(NM) as is the regex method (M is the pattern length). A Boyer-Moore-style method would be worst case O(NM) and expected O(N). Also the bit-parallel method can be used easily when input has to be buffered: it can be fed a byte or a megabyte at a time; no look-ahead, no problems with buffer boundaries. The big advantage: the speed of that simple per-input-byte code when coded in C.
Downsides: the pattern length is effectively limited to the number of bits in a fast register e.g. 32 or 64. In this case the pattern length is 25; no problem. It uses extra memory (S_table) proportional to the number of distinct characters in the pattern. In this case, the "alphabet size" is only 4; no problem.
Details from this technical report. The algorithm there is for approximate search usin Levenshtein distance. To convert to using Hamming distance, I simply (!) removed the pieces of statement 2.1 that handle insertion and deletion. You'll notice lots of reference to "R" with a "d" superscript. "d" is distance. We need only 0 and 1. These "R"s become the R0 and R1 variables in the code below.
Python 正则表达式 库支持 模糊正则表达式匹配。相对于 TRE 的一个优点是它允许查找文本中正则表达式的所有匹配项(支持重叠匹配,如出色地)。
Python regex library supports fuzzy regular expression matching. One advantage over TRE is that it allows to find all matches of regular expression in the text (supports overlapping matches as well).
我在谷歌上搜索“弓形虫寄生虫基因组”,在网上找到了一些基因组文件。我在 http://toxodb.org 找到了一个我认为很接近的文件,名为“TgondiiGenomic_ToxoDB-6.0.fasta”,大小约158Mb。我使用下面的 pyparsing 表达式来提取基因序列,只花了不到 2 分钟:(
惊讶!一些基因序列包含 'N' 的运行!那到底是什么?!)
然后我将这个类编写为子类pyparsing Token 类的,用于进行紧密匹配:
对于每个匹配,这将返回一个包含匹配的实际字符串的元组以及不匹配位置的列表。精确匹配当然会为第二个值返回一个空列表。 (我喜欢这个类,我想我会将它添加到 pyparsing 的下一个版本中。)
然后我运行此代码以在从 .fasta 文件读取的所有序列中搜索“最多 2 个不匹配”匹配项(回想一下,genedata 是 ParseResults 组的序列,每个组包含一个 id、一个整数长度和一个序列字符串):
我从一个基因位中随机获取搜索序列,以确保我可以找到精确匹配,只是出于好奇想看看有多少 1 和 2 元素不匹配。
这需要一点时间才能运行。 45 分钟后,我得到了这个输出,列出了每个 id 和基因长度,以及找到的任何部分匹配:
我很沮丧,没有看到任何匹配,直到:
最后我的精确匹配为:
所以虽然这没有设置任何速度记录后,我完成了工作,还发现了一些 2 场比赛,以防万一他们感兴趣。
为了进行比较,这里是一个基于 RE 的版本,仅查找 1-不匹配的匹配项:(
首先,我尝试搜索原始 FASTA 文件源本身,但很困惑为什么与 pyparsing 版本相比匹配项如此之少。然后我意识到一些匹配必须跨越换行符,因为 fasta 文件输出被包裹在 n 个字符处。)
因此,在第一次 pyparsing 传递以提取要匹配的基因序列之后,这个基于 RE 的搜索器又花费了大约 1-1/ 2 分钟扫描所有未文本包装的序列,以查找 pyparsing 解决方案所做的所有相同的 1 不匹配条目。
I googled for "toxoplasma gondii parasite genome" to find some of these genome files online. I found what I think was close, a file titled "TgondiiGenomic_ToxoDB-6.0.fasta" at http://toxodb.org, about 158Mb in size. I used the following pyparsing expression to extract the gene sequences, it took just under 2 minutes:
(Surprise! some of the gene sequences include runs of 'N's! What the heck is that about?!)
Then I wrote this class as a subclass of the pyparsing Token class, for doing close matches:
For every match, this will return a tuple containing the actual string that was matched, and a list of the mismatch locations. Exact matches would of course return an empty list for the second value. (I like this class, I think I'll add it to the next release of pyparsing.)
I then ran this code to search for "up-to-2-mismatch" matches in all of the sequences read from the .fasta file (recall that genedata is a sequence of ParseResults groups, each containing an id, an integer length, and a sequence string):
I took the search sequence at random from one of the gene bits, to be sure I could find an exact match, and just out of curiosity to see how many 1- and 2-element mismatches there were.
This took a little while to run. After 45 minutes, I had this output, listing each id and gene length, and any partial matches found:
I was getting discouraged, not to see any matches until:
And finally my exact match at:
So while this didn't set any speed records, I got the job done, and found some 2-matches too, in case they might be of interest.
For comparison, here is an RE-based version, that finds 1-mismatch matches only:
(At first, I tried searching the raw FASTA file source itself, but was puzzled why so few matches compared to the pyparsing version. Then I realized that some of the matches must cross the line breaks, since the fasta file output is wrapped at n characters.)
So after the first pyparsing pass to extract the gene sequences to match against, this RE-based searcher then took about another 1-1/2 minutes to scan all of the un-textwrapped sequences, to find all of the same 1-mismatch entries that the pyparsing solution did.
您可能会在 Python-Levenshtein 中找到一些有用的例程。
You might find the various routines in Python-Levenshtein of some use.
这个会停止第一个匹配,因此当长度为 10,000,000 的基因组有多个匹配时,速度可能会快一点,
您正在查看单个线程扫描 230,000 个序列大约需要 2.5 天,因此您可能需要拆分任务到几个核心/CPU 上。
您始终可以在该算法运行时开始实现更有效的算法:)
如果您希望搜索单个删除或添加的元素,请将正则表达式更改为此
This one stops a the first match, so may be a bit faster when there are multiple matches
for a genome of length 10,000,000 you are looking at about 2.5 days for a single thread to scan 230,000 sequences, so you may want to split up the task onto a few cores/cpus.
You can always start implementing a more efficient algorithm while this one is running :)
If you should wish to search for single dropped or added elements change the regexp to this
这暗示了最长公共子序列问题。这里字符串相似度的问题是你需要针对 230000 个序列的连续字符串进行测试;因此,如果您将 25 个序列之一与连续字符串进行比较,您将得到非常低的相似度。
如果计算 25 个序列和连续字符串之间的最长公共子序列,如果长度相同,您就会知道它是否在字符串中。
This hints of the longest common subsequence problem. The problem with string similarity here is that you need to test against a continuous string of 230000 sequences; so if you are comparing one of your 25 sequences to the continuous string you'll get a very low similarity.
If you compute the longest common subsequence between your 25 sequences and the continuous string, you'll know if it is in the string if the lengths are the same.
您可以从要匹配的所有不同序列中创建一个 trie 。现在是在 trie 中创建最多允许一个不匹配的深度优先搜索函数的棘手部分。
此方法的优点是您可以一次搜索所有序列。这将为您节省大量比较。例如,当您从顶部节点开始并沿着“A”分支向下移动时,您刚刚为自己节省了数千次比较,因为立即与所有以“A”开头的序列匹配。对于不同的论点,考虑与给定序列完全匹配的基因组切片。如果序列列表中有一个不同的序列,仅最后一个符号不同,那么使用 trie 就可以节省 23 次比较操作。
这是实现此目的的一种方法。将 'A'、'C'、T'、G' 转换为 0,1,2,3 或其变体。然后使用元组的元组作为特里树的结构。在每个节点,数组中的第一个元素对应于“A”,第二个元素对应于“C”,依此类推。如果“A”是该节点的分支,则还有另一个 4 个元素的元组作为该节点元组的第一项。如果没有“A”分支,则将第一项设置为 0。 trie 的底部是具有该序列 id 的节点,以便可以将其放入匹配列表中。
以下是允许此类 trie 出现不匹配的递归搜索函数:
在这里,我以类似的内容开始搜索
,addtomatches 类似于
等于 -1 的情况表示不存在不匹配。不管怎样,我希望我说得足够清楚,这样你就能明白了。
You could make a trie out of all the different sequences that you want to match. Now is the tricky part of making a depth first search function down the trie that allows at most one mismatch.
The advantage of this method is that you are searching through all of the sequences at once. This will save you a lot of comparisons. For instance, when you start at the top node and go down the 'A' branch, you have just saved yourself many thousands of comparisons because have just instantly matched with all sequences that start with 'A'. For a different argument, consider a slice of the genome that matches exactly with a given sequence. If you have a different sequence in your list of sequences that differs only in the last symbol, then using a trie has just saved you 23 comparison operations.
Here is one way of implementing this. Convert 'A','C',T',G' to 0,1,2,3 or a variant of that. Then use tuples of tuples as your structure for your trie. At each node, the first element in your array corresponds with 'A', the second with 'C' and so on. If 'A' is a branch of this node, then there is another tuple of 4 elements as the first item of this node's tuple. If there isn't an 'A' branch, then set the first item to 0. At the bottom of the trie are nodes that have the id of that sequence so that it can be put into the list of matches.
Here are recursive search functions allowing one mismatch for this sort of trie:
Here, I start out the search with something like
and addtomatches is something similar to
where equal to -1 means there wasn't a mismatch. Anyway, I hope that I was clear enough so that you get the picture.
我尝试了一些解决方案,但正如已经写的那样,它们在处理大量序列(字符串)时速度很慢。
我想出了使用 bowtie 并将感兴趣的子字符串(soi)映射到包含以下内容的参考文件FASTA 格式的字符串。您可以提供允许的不匹配数 (0..3),然后根据允许的不匹配数返回 soi 映射到的字符串。这效果很好而且相当快。
I tried some of the solutions, but as already written they are slow when dealing with a large amount of sequences (strings).
I came up with using bowtie and mapping the substring of interest (soi) against a reference file which contains the strings in FASTA format. You can provide the number of allowed mismatches (0..3) and you get back the strings to which the soi mapped given the allowed mismatches. This works well and pretty fast.
我也遇到了这个问题,但是建议的 regex 包的行为不符合预期(正则表达式中带有“e”),特别是对于查询和搜索序列中的插入/删除。例如,无法找到匹配项:
regex.search("xxgyy" + "{e<=1}", "xxggyy")
。我终于找到了一个可以用的包!
fuzzysearch
包中的find_near_matches()
! https://github.com/taleinat/fuzzysearch它利用 Levenshtein 距离 (max_l_dist)。
如果你想获取返回值:
I came across this problem too, but the suggested
regex
package does not behave as expected (with the "e" in their regular expression), especially for insertion/deletion in the query and search sequence. For example, this failed to find match:regex.search("xxgyy" + "{e<=1}", "xxggyy")
.I finally found a package that works! The
find_near_matches()
in thefuzzysearch
package! https://github.com/taleinat/fuzzysearchIt utilizes the Levenshtein distance (max_l_dist).
if you want to get the return value:
您可以使用 Python 内置功能通过正则表达式匹配进行搜索。
python 中的 re 模块
http://docs.python.org/library/re.html
正则表达式入门
http://www.regular-expressions.info/
You could use Pythons built in capability to do the search with regular expression matching.
re module in python
http://docs.python.org/library/re.html
regular expression primer
http://www.regular-expressions.info/
我想这可能来得有点晚了,但是有一个名为 PatMaN 的工具可以完全满足您的需求:
http://bioinf.eva.mpg.de/patman/
I guess this may come a bit late, but there is a tool named PatMaN that does exactly what you want:
http://bioinf.eva.mpg.de/patman/
您可以使用正则表达式匹配库 TRE 进行“近似匹配”。它还具有 Python、Perl 和 Haskell 的绑定。
这将输出
http://en.wikipedia.org/wiki/TRE_%28computing%29< /a>
http://laurikari.net/tre/
You can use regex matching library TRE, for "approximate matching". It also has bindings for Python, Perl and Haskell.
which will output
http://en.wikipedia.org/wiki/TRE_%28computing%29
http://laurikari.net/tre/
我认为下面的代码既简单又方便。
您可以轻松插入想要检查的基因组和片段,还可以设置不匹配的值。
I thought the code below is simple and convenient.
You can easily insert the genomes and segments you want to check and also set up the value of mismatch.
这是相当古老的,但也许这个简单的解决方案可以工作。
循环遍历包含 25 个字符片段的序列。将切片转换为 numpy 数组。与 25char 字符串(也作为 numpy 数组)进行比较。对答案求和,如果答案是 24,则打印出循环中的位置和不匹配的位置。
接下来的几行显示它正在工作
array(['A', 'B', 'C'], dtype='
array([ True, False, True])
2
This is quite old but perhaps this simple solution could work.
loop through the sequence taking 25character slices. convert the slice to an numpy array. Compare to the 25char string (also as a numpy array). Sum the answer and if the answer is 24 print out the position in the loop and the mismatch.
te next few lines show it working
array(['A', 'B', 'C'], dtype='
array([ True, False, True])
2