使用AWK GGSUB计算正确数量的字符数

发布于 01-23 06:54 字数 385 浏览 2 评论 0原文

我正在尝试以序列（FASTA格式）计算特定字符模式的数量。就我而言，我想计算上下文“ CC”以顺序存在的频率。整个脚本都很好，但是我遇到了一个小问题。

对于计算“ CC”上下文，我使用脚本的以下部分：

CC=gsub(/CC/,"CC");
print CC

当我具有这样的快速序列时，我会遇到问题：

>name_sequence_1
CCCCC

在这种情况下，CC的数量应为4（位置1-2，2-3， 3-4和4-5），但是GSUB给了我数字2，因为在替换第一个CC之后，它跳到了第3 C等。

有什么办法可以解决使用GSUB的方法，还是可以使用其他代码来计算此类上下文？

谢谢！

原文

I'm trying to calculate the number of a specific character pattern in a sequence (fasta format). In my case I want to calculate how often the context "CC" is present in a sequence. The whole script is working fine, but I experienced one small problem.

For calculating the "CC" context I use the following part of my script:

CC=gsub(/CC/,"CC");
print CC

I experience a problem when I have a fasta sequence like this:

>name_sequence_1
CCCCC

In this case, the number of CC should be 4 (positions 1-2, 2-3, 3-4, and 4-5), but gsub gives me the number 2, because after substituting the first CC, it jumps to the 3rd C and so on.

Is there any way how I can fix that using gsub or is there another code I can use to calculate such contexts?

Thanks!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

递刀给你2025-01-30 06:54:01

假设您说的预期输出是错误的：这可能是您要做的事情：

$ echo 'CCCCC' |
    awk '{
        str = $0
        cnt = 0
        while ( sub(/CC/,"C",str) ) {
            cnt++
        }
        print cnt
    }'
4

$ echo 'CCCACCCCC' |
    awk '{
        str = $0
        cnt = 0
        while ( sub(/CC/,"C",str) ) {
            cnt++
        }
        print cnt
    }'
6

但是，这是一个更强大的一般解决方案，即使目标字符串不重复1个字符和/或包含Regexp或BackReference Metachars：

$ echo 'CCCCC' |
    awk '{
        cnt = 0
        for ( i=1; i<length($0); i++ ) {
            cnt += ( substr($0,i,2) == "CC" )
        }
        print cnt
    }'
4

$ echo 'CCCACCCCC' |
    awk '{
        cnt = 0
        for ( i=1; i<length($0); i++ ) {
            cnt += ( substr($0,i,2) == "CC" )
        }
        print cnt
    }'
6

This MAY be what you're trying to do, assuming the expected output you stated is wrong:

$ echo 'CCCCC' |
    awk '{
        str = $0
        cnt = 0
        while ( sub(/CC/,"C",str) ) {
            cnt++
        }
        print cnt
    }'
4

$ echo 'CCCACCCCC' |
    awk '{
        str = $0
        cnt = 0
        while ( sub(/CC/,"C",str) ) {
            cnt++
        }
        print cnt
    }'
6

but here's a more robust general solution that'll work even when the target string isn't a repetition of 1 character and/or it contains regexp or backreference metachars:

$ echo 'CCCCC' |
    awk '{
        cnt = 0
        for ( i=1; i<length($0); i++ ) {
            cnt += ( substr($0,i,2) == "CC" )
        }
        print cnt
    }'
4

$ echo 'CCCACCCCC' |
    awk '{
        cnt = 0
        for ( i=1; i<length($0); i++ ) {
            cnt += ( substr($0,i,2) == "CC" )
        }
        print cnt
    }'
6

回复收藏 0 原文

~没有更多了~