在 Perl 中查找字符串中匹配的两个字符的数量
Perl 中是否有一种方法(不是 BioPerl)来查找每两个连续字母的数量。
即,AA, AC, AG, AT, CC, CA, ...
的数量按如下顺序排列:
$sequence = 'AACGTACTGACGTACTGGTTGGTACGA'
PS: 我们可以使用正则表达式手动制作,即 $GC= ($sequence=~s/GC/GC/g) 返回序列中 GC 的数量。
我需要一种自动化且通用的方法。
Is there a method in Perl (not BioPerl) to find the number of each two consecutive letters.
I.e., number of AA, AC, AG, AT, CC, CA, ...
in a sequence like this:
$sequence = 'AACGTACTGACGTACTGGTTGGTACGA'
PS: We can make it manually by using the regular expression, i.e., $GC=($sequence=~s/GC/GC/g) which return the number of GC in the sequence.
I need an automated and generic way.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
你让我困惑了一段时间,但我认为你想计算给定字符串中的二核苷酸。
代码:
来自 Data::Dumper:
You had me confused for a while, but I take it you want to count the dinucleotides in a given string.
Code:
Output from Data::Dumper:
接近 TLP 的答案,但没有替换:
基准:
输出:
Close to TLP's answer, but without substitution:
Benchmark:
Output:
如果您小心的话,正则表达式可以工作,但是有一个使用 substr 的简单解决方案,它会更快、更灵活。
(截至本文发布,标记为已接受的正则表达式解决方案将无法正确计算“AAAA...”等重复区域中的二核苷酸,其中自然存在的序列中有很多。
一旦匹配'AA',正则表达式搜索在第三个字符上恢复,跳过中间的 'AA' 二核苷酸,这不会影响其他二核苷酸,因为如果您在一个位置有 'AC',则保证不会将其包含在其中。当然,问题中给出的特定序列不会遇到这个问题,因为没有碱基连续出现三次。)
我建议的方法更灵活,因为它可以计算任何长度的单词;将正则表达式方法扩展到更长的单词很复杂,因为您必须使用正则表达式做更多的练习才能获得准确的计数。
输出:
对于较长的序列(10-100 kbase),优势并不那么明显,但仍然胜出约 70%。
Regex works if you're careful, but there's a simple solution using substr that will be faster and more flexible.
(As of this posting, the regex solution marked as accepted will fail to correctly count dinucleotides in repeated regions like 'AAAA...', of which there are many in naturally occurring sequences.
Once you match 'AA', the regex search resumes on the third character, skipping the middle 'AA' dinucleotide. This doesn't affect the other dinucleotides since if you have 'AC' at one position, you're guaranteed not to have it in the next base, naturally. The particular sequence given in the question will not suffer from this problem since no base appears three times in a row.)
The method I suggest is more flexible in that it can count words of any length; extending the regex method to longer words is complicated since you have to do even more gymnastics with your regex to get an accurate count.
Output:
For longer sequences (10-100 kbase), the advantage is not as pronounced, but it still wins by about 70%.