grep/sed/awk解析文本文件以在模式匹配之后打印多行，然后转换为一行

发布于 2025-02-01 19:11:17 字数 1936 浏览 5 评论 0原文

我想使用grep/awk/sed来解析包含多个基因描述的文本文件。我希望每一行代表一个基因描述。

现在，我想将自动化和简洁的描述提取到单个TXT文件中，每行代表一个基因的描述。

下载文件

wget https://downloads.wormbase.org/releases/current-production-release/species/c_elegans/PRJNA13758/annotation/c_elegans.PRJNA13758.WS283.functional_descriptions.txt.gz

我已经能够提取所需的文本，并使用以下代码拥有单个文本文件。但是，我无法将文本输出到单行中。

awk '/Concise description:/{flag=1} flag; /Automated description/{flag=0}' c_elegans.PRJNA13758.WS283.functional_descriptions.txt | grep -v "Automated description" > WB283_concise.txt

#do this for the next section automated description

awk '/Automated description:/{flag=1} flag; /Gene class description/{flag=0}' c_elegans.PRJNA13758.WS283.functional_descriptions.txt | grep -v "Gene class description" > WB283_automated.txt

#I can also use sed
sed -ne '/Concise description:/,$ p' WB283_concise.txt > concise.txt

有人可以帮忙吗？

1个基因描述的当前文本结构

Concise description: aap-1 encodes the C. elegans ortholog of the phosphoinositide 
3-kinase (PI3K) p50/p55 adaptor/regulatory subunit; AAP-1 negatively regulates lifespan 
and dauer development, and likely functions as the sole adaptor subunit for the 
AGE-1/p110 PI3K catalytic subunit to which it binds in vitro; although AAP-1 potentiates 
insulin-like signaling, it is not absolutely required for insulin-like signaling 
under most conditions.

1基因描述所需的文本结构，

Concise description: aap-1 encodes the C. elegans ortholog of the phosphoinositide 3-kinase (PI3K) p50/p55 adaptor/regulatory subunit; AAP-1 negatively regulates lifespan and dauer development, and likely functions as the sole adaptor subunit for the AGE-1/p110 PI3K catalytic subunit to which it binds in vitro; although AAP-1 potentiates insulin-like signaling, it is not absolutely required for insulin-like signaling under most conditions.

谢谢，何塞。

原文

I am wanting to use grep/awk/sed to parse a text file containing various descriptions of several genes. I would like each row to represent a gene description.

Right now I am wanting to extract the Automated and Concise descriptions into single txt files each row representing the description for a single gene.

Download file

wget https://downloads.wormbase.org/releases/current-production-release/species/c_elegans/PRJNA13758/annotation/c_elegans.PRJNA13758.WS283.functional_descriptions.txt.gz

I have been able to extract the desired text and have individual text files using the code below. However, I am unable to output the text into single rows.

awk '/Concise description:/{flag=1} flag; /Automated description/{flag=0}' c_elegans.PRJNA13758.WS283.functional_descriptions.txt | grep -v "Automated description" > WB283_concise.txt

#do this for the next section automated description

awk '/Automated description:/{flag=1} flag; /Gene class description/{flag=0}' c_elegans.PRJNA13758.WS283.functional_descriptions.txt | grep -v "Gene class description" > WB283_automated.txt

#I can also use sed
sed -ne '/Concise description:/,$ p' WB283_concise.txt > concise.txt

Can someone help?

Current text structure for 1 gene description

Concise description: aap-1 encodes the C. elegans ortholog of the phosphoinositide 
3-kinase (PI3K) p50/p55 adaptor/regulatory subunit; AAP-1 negatively regulates lifespan 
and dauer development, and likely functions as the sole adaptor subunit for the 
AGE-1/p110 PI3K catalytic subunit to which it binds in vitro; although AAP-1 potentiates 
insulin-like signaling, it is not absolutely required for insulin-like signaling 
under most conditions.

desired text structure for 1 gene description

Concise description: aap-1 encodes the C. elegans ortholog of the phosphoinositide 3-kinase (PI3K) p50/p55 adaptor/regulatory subunit; AAP-1 negatively regulates lifespan and dauer development, and likely functions as the sole adaptor subunit for the AGE-1/p110 PI3K catalytic subunit to which it binds in vitro; although AAP-1 potentiates insulin-like signaling, it is not absolutely required for insulin-like signaling under most conditions.

Thank you, Jose.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

高冷爸爸 2025-02-08 19:11:17

对OP的当前awk代码：

awk '
/Concise description:/  { flag=1; pfx="" }
/Automated description/ { flag=0; print "" }                # close out current printf line out output
flag                    { printf "%s%s",pfx,$0; pfx=" " }   # assuming appended lines are separated by a single space
' file

注意：我不确定我不知道OP的当前使用grep -v 't有一组示例输入，表明需要grep -v ...？

对于提供此生成的小样本：

Concise description: aap-1 encodes the C. elegans ortholog of the phosphoinositide  3-kinase (PI3K) p50/p55 adaptor/regulatory subunit; AAP-1 negatively regulates lifespan  and dauer development, and likely functions as the sole adaptor subunit for the  AGE-1/p110 PI3K catalytic subunit to which it binds in vitro; although AAP-1 potentiates  insulin-like signaling, it is not absolutely required for insulin-like signaling  under most conditions.

假设：

OP需要两次解析输入文件（对于两个不同的文本块），
两个不同的文本块不重叠
，可能有多个concise> concise或自动化 intput文件中的文本，所有输入应路由到两个输出文件之一，

我们可以将OP的当前2x awk脚本整合到一个，例如：

awk '
function close_line()    { if (outfile) print "" > outfile }      # close out prior printf line of output?

/Concise description:/   { close_line()
                           outfile="WB283_concise.txt"
                           pfx=""
                         }
/Automated description:/ { close_line()
                           outfile="WB283_automated.txt"
                           pfx=""
                         }
/Gene class description/ { close_line()
                           outfile=""
                         }
outfile                  { printf "%s%s", pfx, $0 > outfile
                           pfx=" "
                         }
END                      { close_line() }
' file

A few small changes to OP's current awk code:

awk '
/Concise description:/  { flag=1; pfx="" }
/Automated description/ { flag=0; print "" }                # close out current printf line out output
flag                    { printf "%s%s",pfx,$0; pfx=" " }   # assuming appended lines are separated by a single space
' file

NOTE: I'm not sure I understand OP's current use of grep -v since we don't have a sample set of input that demonstrates the need for the grep -v ... ?

For the small sample provided this generates:

Concise description: aap-1 encodes the C. elegans ortholog of the phosphoinositide  3-kinase (PI3K) p50/p55 adaptor/regulatory subunit; AAP-1 negatively regulates lifespan  and dauer development, and likely functions as the sole adaptor subunit for the  AGE-1/p110 PI3K catalytic subunit to which it binds in vitro; although AAP-1 potentiates  insulin-like signaling, it is not absolutely required for insulin-like signaling  under most conditions.

Assumptions:

OP needs to parse the input file twice (for two different blocks of text)
the two different blocks of text do not overlap
there may be multiple blocks of Concise or Automated text in the intput file, and all input is to be routed to one of two output files

We could consolidate OP's current 2x awk scripts into one, eg:

awk '
function close_line()    { if (outfile) print "" > outfile }      # close out prior printf line of output?

/Concise description:/   { close_line()
                           outfile="WB283_concise.txt"
                           pfx=""
                         }
/Automated description:/ { close_line()
                           outfile="WB283_automated.txt"
                           pfx=""
                         }
/Gene class description/ { close_line()
                           outfile=""
                         }
outfile                  { printf "%s%s", pfx, $0 > outfile
                           pfx=" "
                         }
END                      { close_line() }
' file

回复收藏 0 原文

〃安静 2025-02-08 19:11:17

我可以建议一个稍微修改的解决方案（不是完全需要的，而是有可能有用的想法）：

awk '
/WBGene/              { printf("\n%s: ", $2) }
/Concise description/ { flag = 1; $1=$2="" }
/=/                   { flag = 0 }
/^.* description/     { flag = 0 }
flag                  { printf " %s", $0 }
' c_elegans.PRJNA13758.WS283.functional_descriptions.txt

想法是滤除字符串“简洁描述”，因为这无论如何都是我们想要的。基因的名称在第一列中打印出来，因为许多“简明说明”的名称不包括名称。

输出格式是每个基因的一条线，以其名称（+ colon）开头，然后是“纯”简明说明。

顺便说一句：如果要在每行中使用“自动说明”创建第二个输出，请将第二个尴尬线从/concise Descript/更改为/automated Description/< /代码>

May I suggest a slightly modified solution (not exactly what is asked for, but with potentially useful thoughts):

awk '
/WBGene/              { printf("\n%s: ", $2) }
/Concise description/ { flag = 1; $1=$2="" }
/=/                   { flag = 0 }
/^.* description/     { flag = 0 }
flag                  { printf " %s", $0 }
' c_elegans.PRJNA13758.WS283.functional_descriptions.txt

The idea is to filter out the string "Concise description", as this is what we are looking for in any case. The name of the gene is printed in the first column, as many "Concise description's" don't include the name.

Output format is a single line for each gene, starting with its name (+ colon), followed by the "pure" concise description.

By the way: if you want to create a second output, with the "Automated description" in each line, change the second awk-line from /Concise description/ to /Automated description/

回复收藏 0 原文

~没有更多了~