使用AWK GGSUB计算正确数量的字符数
我正在尝试以序列(FASTA格式)计算特定字符模式的数量。就我而言,我想计算上下文“ CC”以顺序存在的频率。整个脚本都很好,但是我遇到了一个小问题。
对于计算“ CC”上下文,我使用脚本的以下部分:
CC=gsub(/CC/,"CC");
print CC
当我具有这样的快速序列时,我会遇到问题:
>name_sequence_1
CCCCC
在这种情况下,CC的数量应为4(位置1-2,2-3, 3-4和4-5),但是GSUB给了我数字2,因为在替换第一个CC之后,它跳到了第3 C等。
有什么办法可以解决使用GSUB的方法,还是可以使用其他代码来计算此类上下文?
谢谢!
I'm trying to calculate the number of a specific character pattern in a sequence (fasta format). In my case I want to calculate how often the context "CC" is present in a sequence. The whole script is working fine, but I experienced one small problem.
For calculating the "CC" context I use the following part of my script:
CC=gsub(/CC/,"CC");
print CC
I experience a problem when I have a fasta sequence like this:
>name_sequence_1
CCCCC
In this case, the number of CC should be 4 (positions 1-2, 2-3, 3-4, and 4-5), but gsub gives me the number 2, because after substituting the first CC, it jumps to the 3rd C and so on.
Is there any way how I can fix that using gsub or is there another code I can use to calculate such contexts?
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

假设您说的预期输出是错误的:这可能是您要做的事情:
但是,这是一个更强大的一般解决方案,即使目标字符串不重复1个字符和/或包含Regexp或BackReference Metachars:
This MAY be what you're trying to do, assuming the expected output you stated is wrong:
but here's a more robust general solution that'll work even when the target string isn't a repetition of 1 character and/or it contains regexp or backreference metachars: