在匹配模式之前打印字符
以下是 awk 命令的组合
awk '
{if ($0~/>/) {head=$0;getline}
{if($0~/pattern/) print head"\n"$0}}' filename1 |
awk
'BEGIN {pos=0;char=0}
{if($0~/>/) head=$0;getline}
{pos=0;
if($0~/pattern/)
{pos=match($0,/pattern/);char=substr($0,pos,55)}
print head"\n"char}'
上面一个效果很好,我想在识别模式“AATTGGCC”后捕获 55 个字符。问题是如何获得匹配模式的 55 个字符前缀(匹配模式之前的 55 个单词)。 是的,我可以用 perl 编写整个内容,但是由于我在 awk 中有上述内容,所以想知道是否可以以某种方式修改它。
谢谢
Following is a combination of awk commands
awk '
{if ($0~/>/) {head=$0;getline}
{if($0~/pattern/) print head"\n"$0}}' filename1 |
awk
'BEGIN {pos=0;char=0}
{if($0~/>/) head=$0;getline}
{pos=0;
if($0~/pattern/)
{pos=match($0,/pattern/);char=substr($0,pos,55)}
print head"\n"char}'
Above one works great, I wanted to capture 55 characters after identifying the pattern "AATTGGCC". Problem is how can I get 55 characters prefix to a matching pattern (55 words before the matching pattern).
Yes,I can write the whole this in perl, but since I have the above in awk was wondering if I can modify it somehow.
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
这有点蛮力,但您可以使用 AATTGGCC 之前有 55 个句点的模式,
例如:
/................................ .........................AATTGGCC/ {print substr(%$0,1,55)}
应该做窍门。最好看看 awk 正则表达式是否支持子表达式。
但最好是使用 Python 和像 pygep 这样的库,因为 Python 在生物信息学。
It is a bit brute force but you could use a pattern that has 55 periods before AATTGGCC
For instance:
/.......................................................AATTGGCC/ {print substr(%$0,1,55)}
should do the trick. It would be better to see if awk regular expressions support subexpressions.
But the best would be to use Python and a library like pygep because Python is used a lot in bioinformatics.
下面是打印模式前面的一些字符的演示:
输出(“jkl”和“jkl”之前的五个字符):
如果您的数据有换行符并且您想要输出的字符序列跨越换行符,您将需要累积行,删除换行符并在缓冲区变量中保留足够的字符,以便可以输出它们。
无论如何,这是脚本的简化版本。它可能无法正常工作,但它更具可读性并且更 AWKish。我没有对它做任何事情来尝试让它执行您所需的功能,也没有测试过它。
Here is a demo of a way to print some characters that precede a pattern:
Output (five character before "jkl" and "jkl"):
If your data has newlines and the character sequence you want to output spans across newlines, you'll need to accumulate the lines, remove the newlines and keep enough characters onhand in a buffer variable so you can output them.
For what it's worth, here is a simplified version of your script. It may not function correctly, but it's more readable and more AWKish. I haven't done anything to it to try to make it perform your required function nor have I tested it.
如果没有一些示例输入,就很难测试,但我相信您的 C 风格 awk 可以简化为:
并且要在比赛前获取 55 个字符,您只需将 substr 参数更改为
substr( $0, pos-n, n)
Without having some sample input, it's hard to test, but I believe your very C-like awk can be reduced to this:
and to get the 55 chars before the match, you just have to change the substr arguments to
substr($0, pos-n, n)
感谢大家的建议。
关于 awk 代码的格式,我没有在正确的脚本或任何内容中执行它。这一切都是在命令行中进行的,因此有很多输出的“管道”。但我确实理解,并且每当寻求帮助时都会尝试以正确的格式编写代码。
我发现 awk 中的 RSTART 是一个跟踪匹配模式的变量,因此我能够按如下方式使用它(这只是实际命令的一部分)。
这将从匹配模式返回 47 个字符并打印它。
Thanks all for your suggestions.
Regarding the format of the awk code, well I was not executing it in a proper script or anything. It was all in command line and hence so much of "piping" of the output. But I do understand and will try to write the codes in proper format whenever asking for help.
I found that RSTART in awk is a variable that keeps track of matching pattern, hence I was able to use it as follows (this is only part of the actual command).
This goes back 47 chars back from the matching pattern and prints it.