使用AWK GGSUB计算正确数量的字符数

发布于 01-23 06:54 字数 385 浏览 2 评论 0原文

我正在尝试以序列(FASTA格式)计算特定字符模式的数量。就我而言,我想计算上下文“ CC”以顺序存在的频率。整个脚本都很好,但是我遇到了一个小问题。

对于计算“ CC”上下文,我使用脚本的以下部分:

CC=gsub(/CC/,"CC");
print CC

当我具有这样的快速序列时,我会遇到问题:

>name_sequence_1
CCCCC 

在这种情况下,CC的数量应为4(位置1-2,2-3, 3-4和4-5),但是GSUB给了我数字2,因为在替换第一个CC之后,它跳到了第3 C等。

有什么办法可以解决使用GSUB的方法,还是可以使用其他代码来计算此类上下文?

谢谢!

I'm trying to calculate the number of a specific character pattern in a sequence (fasta format). In my case I want to calculate how often the context "CC" is present in a sequence. The whole script is working fine, but I experienced one small problem.

For calculating the "CC" context I use the following part of my script:

CC=gsub(/CC/,"CC");
print CC

I experience a problem when I have a fasta sequence like this:

>name_sequence_1
CCCCC 

In this case, the number of CC should be 4 (positions 1-2, 2-3, 3-4, and 4-5), but gsub gives me the number 2, because after substituting the first CC, it jumps to the 3rd C and so on.

Is there any way how I can fix that using gsub or is there another code I can use to calculate such contexts?

Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

递刀给你2025-01-30 06:54:01

假设您说的预期输出是错误的:这可能是您要做的事情:

$ echo 'CCCCC' |
    awk '{
        str = $0
        cnt = 0
        while ( sub(/CC/,"C",str) ) {
            cnt++
        }
        print cnt
    }'
4

$ echo 'CCCACCCCC' |
    awk '{
        str = $0
        cnt = 0
        while ( sub(/CC/,"C",str) ) {
            cnt++
        }
        print cnt
    }'
6

但是,这是一个更强大的一般解决方案,即使目标字符串不重复1个字符和/或包含Regexp或BackReference Metachars:

$ echo 'CCCCC' |
    awk '{
        cnt = 0
        for ( i=1; i<length($0); i++ ) {
            cnt += ( substr($0,i,2) == "CC" )
        }
        print cnt
    }'
4

$ echo 'CCCACCCCC' |
    awk '{
        cnt = 0
        for ( i=1; i<length($0); i++ ) {
            cnt += ( substr($0,i,2) == "CC" )
        }
        print cnt
    }'
6

This MAY be what you're trying to do, assuming the expected output you stated is wrong:

$ echo 'CCCCC' |
    awk '{
        str = $0
        cnt = 0
        while ( sub(/CC/,"C",str) ) {
            cnt++
        }
        print cnt
    }'
4

$ echo 'CCCACCCCC' |
    awk '{
        str = $0
        cnt = 0
        while ( sub(/CC/,"C",str) ) {
            cnt++
        }
        print cnt
    }'
6

but here's a more robust general solution that'll work even when the target string isn't a repetition of 1 character and/or it contains regexp or backreference metachars:

$ echo 'CCCCC' |
    awk '{
        cnt = 0
        for ( i=1; i<length($0); i++ ) {
            cnt += ( substr($0,i,2) == "CC" )
        }
        print cnt
    }'
4

$ echo 'CCCACCCCC' |
    awk '{
        cnt = 0
        for ( i=1; i<length($0); i++ ) {
            cnt += ( substr($0,i,2) == "CC" )
        }
        print cnt
    }'
6
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文