grep/sed/awk解析文本文件以在模式匹配之后打印多行,然后转换为一行
我想使用grep/awk/sed来解析包含多个基因描述的文本文件。我希望每一行代表一个基因描述。
现在,我想将自动化和简洁的描述提取到单个TXT文件中,每行代表一个基因的描述。
下载文件
wget https://downloads.wormbase.org/releases/current-production-release/species/c_elegans/PRJNA13758/annotation/c_elegans.PRJNA13758.WS283.functional_descriptions.txt.gz
我已经能够提取所需的文本,并使用以下代码拥有单个文本文件。但是,我无法将文本输出到单行中。
awk '/Concise description:/{flag=1} flag; /Automated description/{flag=0}' c_elegans.PRJNA13758.WS283.functional_descriptions.txt | grep -v "Automated description" > WB283_concise.txt
#do this for the next section automated description
awk '/Automated description:/{flag=1} flag; /Gene class description/{flag=0}' c_elegans.PRJNA13758.WS283.functional_descriptions.txt | grep -v "Gene class description" > WB283_automated.txt
#I can also use sed
sed -ne '/Concise description:/,$ p' WB283_concise.txt > concise.txt
有人可以帮忙吗?
1个基因描述的当前文本结构
Concise description: aap-1 encodes the C. elegans ortholog of the phosphoinositide
3-kinase (PI3K) p50/p55 adaptor/regulatory subunit; AAP-1 negatively regulates lifespan
and dauer development, and likely functions as the sole adaptor subunit for the
AGE-1/p110 PI3K catalytic subunit to which it binds in vitro; although AAP-1 potentiates
insulin-like signaling, it is not absolutely required for insulin-like signaling
under most conditions.
1基因描述所需的文本结构,
Concise description: aap-1 encodes the C. elegans ortholog of the phosphoinositide 3-kinase (PI3K) p50/p55 adaptor/regulatory subunit; AAP-1 negatively regulates lifespan and dauer development, and likely functions as the sole adaptor subunit for the AGE-1/p110 PI3K catalytic subunit to which it binds in vitro; although AAP-1 potentiates insulin-like signaling, it is not absolutely required for insulin-like signaling under most conditions.
谢谢,何塞。
I am wanting to use grep/awk/sed to parse a text file containing various descriptions of several genes. I would like each row to represent a gene description.
Right now I am wanting to extract the Automated and Concise descriptions into single txt files each row representing the description for a single gene.
Download file
wget https://downloads.wormbase.org/releases/current-production-release/species/c_elegans/PRJNA13758/annotation/c_elegans.PRJNA13758.WS283.functional_descriptions.txt.gz
I have been able to extract the desired text and have individual text files using the code below. However, I am unable to output the text into single rows.
awk '/Concise description:/{flag=1} flag; /Automated description/{flag=0}' c_elegans.PRJNA13758.WS283.functional_descriptions.txt | grep -v "Automated description" > WB283_concise.txt
#do this for the next section automated description
awk '/Automated description:/{flag=1} flag; /Gene class description/{flag=0}' c_elegans.PRJNA13758.WS283.functional_descriptions.txt | grep -v "Gene class description" > WB283_automated.txt
#I can also use sed
sed -ne '/Concise description:/,$ p' WB283_concise.txt > concise.txt
Can someone help?
Current text structure for 1 gene description
Concise description: aap-1 encodes the C. elegans ortholog of the phosphoinositide
3-kinase (PI3K) p50/p55 adaptor/regulatory subunit; AAP-1 negatively regulates lifespan
and dauer development, and likely functions as the sole adaptor subunit for the
AGE-1/p110 PI3K catalytic subunit to which it binds in vitro; although AAP-1 potentiates
insulin-like signaling, it is not absolutely required for insulin-like signaling
under most conditions.
desired text structure for 1 gene description
Concise description: aap-1 encodes the C. elegans ortholog of the phosphoinositide 3-kinase (PI3K) p50/p55 adaptor/regulatory subunit; AAP-1 negatively regulates lifespan and dauer development, and likely functions as the sole adaptor subunit for the AGE-1/p110 PI3K catalytic subunit to which it binds in vitro; although AAP-1 potentiates insulin-like signaling, it is not absolutely required for insulin-like signaling under most conditions.
Thank you, Jose.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
对OP的当前
awk
代码:注意:我不确定我不知道OP的当前使用
grep -v
't有一组示例输入,表明需要grep -v
...?对于提供此生成的小样本:
假设:
concise> concise
或自动化
intput文件中的文本,所有输入应路由到两个输出文件之一,我们可以将OP的当前2x
awk
脚本整合到一个,例如:A few small changes to OP's current
awk
code:NOTE: I'm not sure I understand OP's current use of
grep -v
since we don't have a sample set of input that demonstrates the need for thegrep -v
... ?For the small sample provided this generates:
Assumptions:
Concise
orAutomated
text in the intput file, and all input is to be routed to one of two output filesWe could consolidate OP's current 2x
awk
scripts into one, eg:我可以建议一个稍微修改的解决方案(不是完全需要的,而是有可能有用的想法):
想法是滤除字符串“简洁描述”,因为这无论如何都是我们想要的。基因的名称在第一列中打印出来,因为许多“简明说明”的名称不包括名称。
输出格式是每个基因的一条线,以其名称(+ colon)开头,然后是“纯”简明说明。
顺便说一句:如果要在每行中使用“自动说明”创建第二个输出,请将第二个尴尬线从
/concise Descript/
更改为/automated Description/< /代码>
May I suggest a slightly modified solution (not exactly what is asked for, but with potentially useful thoughts):
The idea is to filter out the string "Concise description", as this is what we are looking for in any case. The name of the gene is printed in the first column, as many "Concise description's" don't include the name.
Output format is a single line for each gene, starting with its name (+ colon), followed by the "pure" concise description.
By the way: if you want to create a second output, with the "Automated description" in each line, change the second awk-line from
/Concise description/
to/Automated description/