在匹配模式之前打印字符

发布于 2024-10-18 06:31:11 字数 482 浏览 5 评论 0原文

以下是 awk 命令的组合

awk '
    {if ($0~/>/) {head=$0;getline}
    {if($0~/pattern/) print head"\n"$0}}' filename1 | 
awk 
   'BEGIN  {pos=0;char=0}
    {if($0~/>/) head=$0;getline}
    {pos=0; 
     if($0~/pattern/)
       {pos=match($0,/pattern/);char=substr($0,pos,55)} 
     print head"\n"char}'

上面一个效果很好,我想在识别模式“AATTGGCC”后捕获 55 个字符。问题是如何获得匹配模式的 55 个字符前缀(匹配模式之前的 55 个单词)。 是的,我可以用 perl 编写整个内容,但是由于我在 awk 中有上述内容,所以想知道是否可以以某种方式修改它。

谢谢

Following is a combination of awk commands

awk '
    {if ($0~/>/) {head=$0;getline}
    {if($0~/pattern/) print head"\n"$0}}' filename1 | 
awk 
   'BEGIN  {pos=0;char=0}
    {if($0~/>/) head=$0;getline}
    {pos=0; 
     if($0~/pattern/)
       {pos=match($0,/pattern/);char=substr($0,pos,55)} 
     print head"\n"char}'

Above one works great, I wanted to capture 55 characters after identifying the pattern "AATTGGCC". Problem is how can I get 55 characters prefix to a matching pattern (55 words before the matching pattern).
Yes,I can write the whole this in perl, but since I have the above in awk was wondering if I can modify it somehow.

Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

无人问我粥可暖 2024-10-25 06:31:11

这有点蛮力,但您可以使用 AATTGGCC 之前有 55 个句点的模式,

例如:

/................................ .........................AATTGGCC/ {print substr(%$0,1,55)}

应该做窍门。最好看看 awk 正则表达式是否支持子表达式。

但最好是使用 Python 和像 pygep 这样的库,因为 Python 在生物信息学。

It is a bit brute force but you could use a pattern that has 55 periods before AATTGGCC

For instance:

/.......................................................AATTGGCC/ {print substr(%$0,1,55)}

should do the trick. It would be better to see if awk regular expressions support subexpressions.

But the best would be to use Python and a library like pygep because Python is used a lot in bioinformatics.

梦初启 2024-10-25 06:31:11

下面是打印模式前面的一些字符的演示:

echo 'abcdefghijklmnopqrstuvwxyz' | 
    awk 'BEGIN {pat = "jkl"; n = 5} 
        pat {
            i = index($0,pat);
            print substr($0, i-n, n + length(pat))
        }'

输出(“jkl”和“jkl”之前的五个字符):

efghijkl

如果您的数据有换行符并且您想要输出的字符序列跨越换行符,您将需要累积行,删除换行符并在缓冲区变量中保留足够的字符,以便可以输出它们。

无论如何,这是脚本的简化版本。它可能无法正常工作,但它更具可读性并且更 AWKish。我没有对它做任何事情来尝试让它执行您所需的功能,也没有测试过它。

awk '

    />/ {head = $0; getline}

    /pattern/ print head "\n" $0

    ' filename1 | 
awk '

    BEGIN  {pos = 0; char = 0}

    />/) {head = $0; getline}
    {
        pos = 0; 
        if ($0 ~ /pattern/) {
            pos = match($0, /pattern/); char = substr($0, pos, 55)
        } 
        print head "\n" char
    }'

Here is a demo of a way to print some characters that precede a pattern:

echo 'abcdefghijklmnopqrstuvwxyz' | 
    awk 'BEGIN {pat = "jkl"; n = 5} 
        pat {
            i = index($0,pat);
            print substr($0, i-n, n + length(pat))
        }'

Output (five character before "jkl" and "jkl"):

efghijkl

If your data has newlines and the character sequence you want to output spans across newlines, you'll need to accumulate the lines, remove the newlines and keep enough characters onhand in a buffer variable so you can output them.

For what it's worth, here is a simplified version of your script. It may not function correctly, but it's more readable and more AWKish. I haven't done anything to it to try to make it perform your required function nor have I tested it.

awk '

    />/ {head = $0; getline}

    /pattern/ print head "\n" $0

    ' filename1 | 
awk '

    BEGIN  {pos = 0; char = 0}

    />/) {head = $0; getline}
    {
        pos = 0; 
        if ($0 ~ /pattern/) {
            pos = match($0, /pattern/); char = substr($0, pos, 55)
        } 
        print head "\n" char
    }'
北风几吹夏 2024-10-25 06:31:11

如果没有一些示例输入,就很难测试,但我相信您的 C 风格 awk 可以简化为:

awk -v pattern="abcd_or_whatever" -v n=55'
    />/ {head=$0; next}
    pos = match($0, pattern) {print head "\n" substr($0, pos, n)} 
'

并且要在比赛前获取 55 个字符,您只需将 substr 参数更改为 substr( $0, pos-n, n)

Without having some sample input, it's hard to test, but I believe your very C-like awk can be reduced to this:

awk -v pattern="abcd_or_whatever" -v n=55'
    />/ {head=$0; next}
    pos = match($0, pattern) {print head "\n" substr($0, pos, n)} 
'

and to get the 55 chars before the match, you just have to change the substr arguments to substr($0, pos-n, n)

我只土不豪 2024-10-25 06:31:11

感谢大家的建议。
关于 awk 代码的格式,我没有在正确的脚本或任何内容中执行它。这一切都是在命令行中进行的,因此有很多输出的“管道”。但我确实理解,并且每当寻求帮助时都会尝试以正确的格式编写代码。

我发现 awk 中的 RSTART 是一个跟踪匹配模式的变量,因此我能够按如下方式使用它(这只是实际命令的一部分)。

awk 'BEGIN{pos=0;char=0}{if($0~/>/) head=$0;getline} {pos=0;if($0~/pattern/) {match($0,/pattern/);char=substr($0,RSTART-47,47)}print head"\n"char}'.

这将从匹配模式返回 47 个字符并打印它。

Thanks all for your suggestions.
Regarding the format of the awk code, well I was not executing it in a proper script or anything. It was all in command line and hence so much of "piping" of the output. But I do understand and will try to write the codes in proper format whenever asking for help.

I found that RSTART in awk is a variable that keeps track of matching pattern, hence I was able to use it as follows (this is only part of the actual command).

awk 'BEGIN{pos=0;char=0}{if($0~/>/) head=$0;getline} {pos=0;if($0~/pattern/) {match($0,/pattern/);char=substr($0,RSTART-47,47)}print head"\n"char}'.

This goes back 47 chars back from the matching pattern and prints it.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文