grep/sed/awk解析文本文件,将基因和描述作为列的基因和描述

发布于 2025-02-01 17:13:35 字数 7035 浏览 4 评论 0原文

我想使用grep/awk/sed解析包含各种基因描述的文本文件。

要下载文件

wget https://downloads.wormbase.org/releases/current-production-release/species/c_elegans/PRJNA13758/annotation/c_elegans.PRJNA13758.WS283.functional_descriptions.txt.gz

示例文本下面:

WBGene00000001  aap-1   Y110A7A.10
Concise description: aap-1 encodes the C. elegans ortholog of the phosphoinositide 
3-kinase (PI3K) p50/p55 adaptor/regulatory subunit; AAP-1 negatively regulates lifespan 
and dauer development, and likely functions as the sole adaptor subunit for the 
AGE-1/p110 PI3K catalytic subunit to which it binds in vitro; although AAP-1 potentiates 
insulin-like signaling, it is not absolutely required for insulin-like signaling 
under most conditions. 
Automated description: Enables protein kinase binding activity. Involved in dauer 
larval development; determination of adult lifespan; and insulin receptor signaling 
pathway. Part of phosphatidylinositol 3-kinase complex. Expressed in intestine and 
neurons. Human ortholog(s) of this gene implicated in several diseases, including 
Alzheimer's disease; SHORT syndrome; carcinoma (multiple); and immunodeficiency 
36. Is an ortholog of human PIK3R3 (phosphoinositide-3-kinase regulatory subunit 
3). 
Gene class description: phosphoinositide kinase AdAPter subunit 
=
WBGene00000002  aat-1   F27C8.1
Concise description: aat-1 encodes an amino acid transporter catalytic subunit; 
when co-expressed in Xenopus oocytes with the ATG-2 glycoprotein subunit, AAT-1 
is able to facilitate amino acid uptake and exchange, showing a relatively high 
affinity for small and some large neutral amino acids; in addition, AAT-1 is able 
to covalently associate with ATG-2 or ATG-1 to form heterodimers in the Xenopus 
expression system; when co-expressed with ATG-2, AAT-1 localizes to the cell surface 
of oocytes, but when expressed alone or with ATG-1, AAT-1 localizes intracellularly. 
Automated description: Contributes to L-amino acid transmembrane transporter activity. 
Involved in amino acid transmembrane transport. Located in plasma membrane. Part 
of amino acid transport complex. Expressed in egg-laying apparatus; head motor neurons; 
and tail. Human ortholog(s) of this gene implicated in cystinuria and lysinuric 
protein intolerance. Is an ortholog of human SLC7A8 (solute carrier family 7 member 
8). 
Gene class description: Amino Acid Transporter 
=
WBGene00000003  aat-2   F07C3.7
Concise description: aat-2 encodes a predicted amino acid transporter catalytic 
subunit; when co-expressed in Xenopus oocytes with a glycoprotein subunit, however, 
AAT-2 is not able to induce amino acid uptake. 
Automated description: Predicted to enable L-amino acid transmembrane transporter 
activity. Predicted to be involved in L-alpha-amino acid transmembrane transport 
and L-amino acid transport. Predicted to be located in membrane. Predicted to be 
integral component of membrane. Human ortholog(s) of this gene implicated in cystinuria. 
Is an ortholog of human SLC7A8 (solute carrier family 7 member 8). 
Gene class description: Amino Acid Transporter 

此文本文件包含每个基因名称(例如WBGENE00000004 AAT-3 F52H2.2A),简洁描述:,自动描述:,基因类描述:由等号隔开。

我一直在尝试解析此TXT文件,因此我认为我从分别提取每个列和行(基因)开始。 下面是我的代码


#genes
grep "WBGene" c_elegans.PRJNA13758.WS283.functional_descriptions.txt > WB283_WBgenes.txt

#gene class description:
awk '/Gene class description:/' c_elegans.PRJNA13758.WS283.functional_descriptions.txt > WB283_geneclass.txt

#concise description
awk '
/Concise description:/  { flag=1; pfx="" }
/Automated description/ { flag=0; print "" }
flag                    { printf "%s%s",pfx,$0; pfx=" " }   # assuming appended lines are separated by a single space
' c_elegans.PRJNA13758.WS283.functional_descriptions.txt > WB283_concise.txt

#automated description
awk '
/Automated description:/  { flag=1; pfx="" }
/Gene class description:/ { flag=0; print "" }
flag                    { printf "%s%s",pfx,$0; pfx=" " }   # assuming appended lines are separated by a single space
' c_elegans.PRJNA13758.WS283.functional_descriptions.txt > WB283_automated.txt


我的问题:有没有一种方法可以将代码/代码或新代码组合起来以更好地解决我的问题?

我想提取每个基因名称,简洁描述:自动说明:和基因类描述:在单独的列和代表基因的每一行上。

我想创建一个包含每一行作为基因的TXT文件,每列描述选择。

所需的文字:

WBGene00000001  aap-1   Y110A7A.10      phosphoinositide kinase AdAPter subunit         aap-1 encodes the C. elegans ortholog of the phosphoinositide 3-kinase (PI3K) p50/p55 adaptor/regulatory subunit; AAP-1 negatively regulates lifespan  and dauer development, and likely functions as the sole adaptor subunit for the  AGE-1/p110 PI3K catalytic subunit to which it binds in vitro; although AAP-1 potentiates  insulin-like signaling, it is not absolutely required for insulin-like signaling  under most conditions.      Enables protein kinase binding activity. Involved in dauer larval development; determination of adult lifespan; and insulin receptor signaling  pathway. Part of phosphatidylinositol 3-kinase complex. Expressed in intestine and  neurons. Human ortholog(s) of this gene implicated in several diseases, including  Alzheimer's disease; SHORT syndrome; carcinoma (multiple); and immunodeficiency  36. Is an ortholog of human PIK3R3 (phosphoinositide-3-kinase regulatory subunit  3).
WBGene00000002  aat-1   F27C8.1 Amino Acid Transporter  aat-1 encodes an amino acid transporter catalytic subunit; when co-expressed in Xenopus oocytes with the ATG-2 glycoprotein subunit, AAT-1  is able to facilitate amino acid uptake and exchange, showing a relatively high  affinity for small and some large neutral amino acids; in addition, AAT-1 is able  to covalently associate with ATG-2 or ATG-1 to form heterodimers in the Xenopus  expression system; when co-expressed with ATG-2, AAT-1 localizes to the cell surface  of oocytes, but when expressed alone or with ATG-1, AAT-1 localizes intracellularly.     Contributes to L-amino acid transmembrane transporter activity. Involved in amino acid transmembrane transport. Located in plasma membrane. Part  of amino acid transport complex. Expressed in egg-laying apparatus; head motor neurons;  and tail. Human ortholog(s) of this gene implicated in cystinuria and lysinuric  protein intolerance. Is an ortholog of human SLC7A8 (solute carrier family 7 member  8).
WBGene00000003  aat-2   F07C3.7 Amino Acid Transporter  aat-2 encodes a predicted amino acid transporter catalytic subunit; when co-expressed in Xenopus oocytes with a glycoprotein subunit, however,  AAT-2 is not able to induce amino acid uptake.  Predicted to enable L-amino acid transmembrane transporter activity. Predicted to be involved in L-alpha-amino acid transmembrane transport  and L-amino acid transport. Predicted to be located in membrane. Predicted to be  integral component of membrane. Human ortholog(s) of this gene implicated in cystinuria.  Is an ortholog of human SLC7A8 (solute carrier family 7 member 8).

I am wanting to use grep/awk/sed to parse a text file containing various gene descriptions.

To download file

wget https://downloads.wormbase.org/releases/current-production-release/species/c_elegans/PRJNA13758/annotation/c_elegans.PRJNA13758.WS283.functional_descriptions.txt.gz

Example text below:

WBGene00000001  aap-1   Y110A7A.10
Concise description: aap-1 encodes the C. elegans ortholog of the phosphoinositide 
3-kinase (PI3K) p50/p55 adaptor/regulatory subunit; AAP-1 negatively regulates lifespan 
and dauer development, and likely functions as the sole adaptor subunit for the 
AGE-1/p110 PI3K catalytic subunit to which it binds in vitro; although AAP-1 potentiates 
insulin-like signaling, it is not absolutely required for insulin-like signaling 
under most conditions. 
Automated description: Enables protein kinase binding activity. Involved in dauer 
larval development; determination of adult lifespan; and insulin receptor signaling 
pathway. Part of phosphatidylinositol 3-kinase complex. Expressed in intestine and 
neurons. Human ortholog(s) of this gene implicated in several diseases, including 
Alzheimer's disease; SHORT syndrome; carcinoma (multiple); and immunodeficiency 
36. Is an ortholog of human PIK3R3 (phosphoinositide-3-kinase regulatory subunit 
3). 
Gene class description: phosphoinositide kinase AdAPter subunit 
=
WBGene00000002  aat-1   F27C8.1
Concise description: aat-1 encodes an amino acid transporter catalytic subunit; 
when co-expressed in Xenopus oocytes with the ATG-2 glycoprotein subunit, AAT-1 
is able to facilitate amino acid uptake and exchange, showing a relatively high 
affinity for small and some large neutral amino acids; in addition, AAT-1 is able 
to covalently associate with ATG-2 or ATG-1 to form heterodimers in the Xenopus 
expression system; when co-expressed with ATG-2, AAT-1 localizes to the cell surface 
of oocytes, but when expressed alone or with ATG-1, AAT-1 localizes intracellularly. 
Automated description: Contributes to L-amino acid transmembrane transporter activity. 
Involved in amino acid transmembrane transport. Located in plasma membrane. Part 
of amino acid transport complex. Expressed in egg-laying apparatus; head motor neurons; 
and tail. Human ortholog(s) of this gene implicated in cystinuria and lysinuric 
protein intolerance. Is an ortholog of human SLC7A8 (solute carrier family 7 member 
8). 
Gene class description: Amino Acid Transporter 
=
WBGene00000003  aat-2   F07C3.7
Concise description: aat-2 encodes a predicted amino acid transporter catalytic 
subunit; when co-expressed in Xenopus oocytes with a glycoprotein subunit, however, 
AAT-2 is not able to induce amino acid uptake. 
Automated description: Predicted to enable L-amino acid transmembrane transporter 
activity. Predicted to be involved in L-alpha-amino acid transmembrane transport 
and L-amino acid transport. Predicted to be located in membrane. Predicted to be 
integral component of membrane. Human ortholog(s) of this gene implicated in cystinuria. 
Is an ortholog of human SLC7A8 (solute carrier family 7 member 8). 
Gene class description: Amino Acid Transporter 

This text file contains each gene name (e.g. WBGene00000004 aat-3 F52H2.2a), Concise description:, Automated description:, Gene class description: separated by equal signs "=".

I have been trying to parse this txt file so I figured I start with extracting every column and row (gene) separately.
Below is my code


#genes
grep "WBGene" c_elegans.PRJNA13758.WS283.functional_descriptions.txt > WB283_WBgenes.txt

#gene class description:
awk '/Gene class description:/' c_elegans.PRJNA13758.WS283.functional_descriptions.txt > WB283_geneclass.txt

#concise description
awk '
/Concise description:/  { flag=1; pfx="" }
/Automated description/ { flag=0; print "" }
flag                    { printf "%s%s",pfx,$0; pfx=" " }   # assuming appended lines are separated by a single space
' c_elegans.PRJNA13758.WS283.functional_descriptions.txt > WB283_concise.txt

#automated description
awk '
/Automated description:/  { flag=1; pfx="" }
/Gene class description:/ { flag=0; print "" }
flag                    { printf "%s%s",pfx,$0; pfx=" " }   # assuming appended lines are separated by a single space
' c_elegans.PRJNA13758.WS283.functional_descriptions.txt > WB283_automated.txt


MY Problem: is there a way I can combine my code/or new code to better address my problem?

I would like to extract each gene names, Concise description:, Automated description:, and Gene class description: on separate columns and each row representing a gene.

I would like to create a txt file that contains each row as a gene and each column the description choice.

Desired text:

WBGene00000001  aap-1   Y110A7A.10      phosphoinositide kinase AdAPter subunit         aap-1 encodes the C. elegans ortholog of the phosphoinositide 3-kinase (PI3K) p50/p55 adaptor/regulatory subunit; AAP-1 negatively regulates lifespan  and dauer development, and likely functions as the sole adaptor subunit for the  AGE-1/p110 PI3K catalytic subunit to which it binds in vitro; although AAP-1 potentiates  insulin-like signaling, it is not absolutely required for insulin-like signaling  under most conditions.      Enables protein kinase binding activity. Involved in dauer larval development; determination of adult lifespan; and insulin receptor signaling  pathway. Part of phosphatidylinositol 3-kinase complex. Expressed in intestine and  neurons. Human ortholog(s) of this gene implicated in several diseases, including  Alzheimer's disease; SHORT syndrome; carcinoma (multiple); and immunodeficiency  36. Is an ortholog of human PIK3R3 (phosphoinositide-3-kinase regulatory subunit  3).
WBGene00000002  aat-1   F27C8.1 Amino Acid Transporter  aat-1 encodes an amino acid transporter catalytic subunit; when co-expressed in Xenopus oocytes with the ATG-2 glycoprotein subunit, AAT-1  is able to facilitate amino acid uptake and exchange, showing a relatively high  affinity for small and some large neutral amino acids; in addition, AAT-1 is able  to covalently associate with ATG-2 or ATG-1 to form heterodimers in the Xenopus  expression system; when co-expressed with ATG-2, AAT-1 localizes to the cell surface  of oocytes, but when expressed alone or with ATG-1, AAT-1 localizes intracellularly.     Contributes to L-amino acid transmembrane transporter activity. Involved in amino acid transmembrane transport. Located in plasma membrane. Part  of amino acid transport complex. Expressed in egg-laying apparatus; head motor neurons;  and tail. Human ortholog(s) of this gene implicated in cystinuria and lysinuric  protein intolerance. Is an ortholog of human SLC7A8 (solute carrier family 7 member  8).
WBGene00000003  aat-2   F07C3.7 Amino Acid Transporter  aat-2 encodes a predicted amino acid transporter catalytic subunit; when co-expressed in Xenopus oocytes with a glycoprotein subunit, however,  AAT-2 is not able to induce amino acid uptake.  Predicted to enable L-amino acid transmembrane transporter activity. Predicted to be involved in L-alpha-amino acid transmembrane transport  and L-amino acid transport. Predicted to be located in membrane. Predicted to be  integral component of membrane. Human ortholog(s) of this gene implicated in cystinuria.  Is an ortholog of human SLC7A8 (solute carrier family 7 member 8).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

我做我的改变 2025-02-08 17:13:35

假设输出是界定的,一个awk构想:

awk '
BEGIN { OFS="\t" }
function print_output()    { if (baseID) print baseID,gene_name,trans_name,gene_desc,concise_desc,auto_desc; baseID="" }

$1 ~ /WBGene/              { baseID=$1; gene_name=$2; trans_name=$3 }
/^Gene class description:/ { gene_desc    =substr($0, index($0,": ")+2) ; in_block="" }
/^Concise description:/    { concise_desc =substr($0, index($0,": ")+2) ; in_block="concise"; pfx=""; next }
/^Automated description:/  { auto_desc    =substr($0, index($0,": ")+2) ; in_block="auto"   ; pfx=""; next }

in_block                   { if (in_block == "concise")
                                concise_desc = concise_desc pfx $0
                             else
                                auto_desc = auto_desc pfx $0
                             pfx=" "
                           }
$1 == "="                  { print_output() }

END                        { print_output() }
' input.file

对于提供的样本,将生成:

WBGene00000001  aap-1   Y110A7A.10      phosphoinositide kinase AdAPter subunit         aap-1 encodes the C. elegans ortholog of the phosphoinositide 3-kinase (PI3K) p50/p55 adaptor/regulatory subunit; AAP-1 negatively regulates lifespan  and dauer development, and likely functions as the sole adaptor subunit for the  AGE-1/p110 PI3K catalytic subunit to which it binds in vitro; although AAP-1 potentiates  insulin-like signaling, it is not absolutely required for insulin-like signaling  under most conditions.      Enables protein kinase binding activity. Involved in dauer larval development; determination of adult lifespan; and insulin receptor signaling  pathway. Part of phosphatidylinositol 3-kinase complex. Expressed in intestine and  neurons. Human ortholog(s) of this gene implicated in several diseases, including  Alzheimer's disease; SHORT syndrome; carcinoma (multiple); and immunodeficiency  36. Is an ortholog of human PIK3R3 (phosphoinositide-3-kinase regulatory subunit  3).
WBGene00000002  aat-1   F27C8.1 Amino Acid Transporter  aat-1 encodes an amino acid transporter catalytic subunit; when co-expressed in Xenopus oocytes with the ATG-2 glycoprotein subunit, AAT-1  is able to facilitate amino acid uptake and exchange, showing a relatively high  affinity for small and some large neutral amino acids; in addition, AAT-1 is able  to covalently associate with ATG-2 or ATG-1 to form heterodimers in the Xenopus  expression system; when co-expressed with ATG-2, AAT-1 localizes to the cell surface  of oocytes, but when expressed alone or with ATG-1, AAT-1 localizes intracellularly.     Contributes to L-amino acid transmembrane transporter activity. Involved in amino acid transmembrane transport. Located in plasma membrane. Part  of amino acid transport complex. Expressed in egg-laying apparatus; head motor neurons;  and tail. Human ortholog(s) of this gene implicated in cystinuria and lysinuric  protein intolerance. Is an ortholog of human SLC7A8 (solute carrier family 7 member  8).
WBGene00000003  aat-2   F07C3.7 Amino Acid Transporter  aat-2 encodes a predicted amino acid transporter catalytic subunit; when co-expressed in Xenopus oocytes with a glycoprotein subunit, however,  AAT-2 is not able to induce amino acid uptake.  Predicted to enable L-amino acid transmembrane transporter activity. Predicted to be involved in L-alpha-amino acid transmembrane transport  and L-amino acid transport. Predicted to be located in membrane. Predicted to be  integral component of membrane. Human ortholog(s) of this gene implicated in cystinuria.  Is an ortholog of human SLC7A8 (solute carrier family 7 member 8).

Assuming output is tab delimited, one awk idea:

awk '
BEGIN { OFS="\t" }
function print_output()    { if (baseID) print baseID,gene_name,trans_name,gene_desc,concise_desc,auto_desc; baseID="" }

$1 ~ /WBGene/              { baseID=$1; gene_name=$2; trans_name=$3 }
/^Gene class description:/ { gene_desc    =substr($0, index($0,": ")+2) ; in_block="" }
/^Concise description:/    { concise_desc =substr($0, index($0,": ")+2) ; in_block="concise"; pfx=""; next }
/^Automated description:/  { auto_desc    =substr($0, index($0,": ")+2) ; in_block="auto"   ; pfx=""; next }

in_block                   { if (in_block == "concise")
                                concise_desc = concise_desc pfx $0
                             else
                                auto_desc = auto_desc pfx $0
                             pfx=" "
                           }
$1 == "="                  { print_output() }

END                        { print_output() }
' input.file

For the provided sample this generates:

WBGene00000001  aap-1   Y110A7A.10      phosphoinositide kinase AdAPter subunit         aap-1 encodes the C. elegans ortholog of the phosphoinositide 3-kinase (PI3K) p50/p55 adaptor/regulatory subunit; AAP-1 negatively regulates lifespan  and dauer development, and likely functions as the sole adaptor subunit for the  AGE-1/p110 PI3K catalytic subunit to which it binds in vitro; although AAP-1 potentiates  insulin-like signaling, it is not absolutely required for insulin-like signaling  under most conditions.      Enables protein kinase binding activity. Involved in dauer larval development; determination of adult lifespan; and insulin receptor signaling  pathway. Part of phosphatidylinositol 3-kinase complex. Expressed in intestine and  neurons. Human ortholog(s) of this gene implicated in several diseases, including  Alzheimer's disease; SHORT syndrome; carcinoma (multiple); and immunodeficiency  36. Is an ortholog of human PIK3R3 (phosphoinositide-3-kinase regulatory subunit  3).
WBGene00000002  aat-1   F27C8.1 Amino Acid Transporter  aat-1 encodes an amino acid transporter catalytic subunit; when co-expressed in Xenopus oocytes with the ATG-2 glycoprotein subunit, AAT-1  is able to facilitate amino acid uptake and exchange, showing a relatively high  affinity for small and some large neutral amino acids; in addition, AAT-1 is able  to covalently associate with ATG-2 or ATG-1 to form heterodimers in the Xenopus  expression system; when co-expressed with ATG-2, AAT-1 localizes to the cell surface  of oocytes, but when expressed alone or with ATG-1, AAT-1 localizes intracellularly.     Contributes to L-amino acid transmembrane transporter activity. Involved in amino acid transmembrane transport. Located in plasma membrane. Part  of amino acid transport complex. Expressed in egg-laying apparatus; head motor neurons;  and tail. Human ortholog(s) of this gene implicated in cystinuria and lysinuric  protein intolerance. Is an ortholog of human SLC7A8 (solute carrier family 7 member  8).
WBGene00000003  aat-2   F07C3.7 Amino Acid Transporter  aat-2 encodes a predicted amino acid transporter catalytic subunit; when co-expressed in Xenopus oocytes with a glycoprotein subunit, however,  AAT-2 is not able to induce amino acid uptake.  Predicted to enable L-amino acid transmembrane transporter activity. Predicted to be involved in L-alpha-amino acid transmembrane transport  and L-amino acid transport. Predicted to be located in membrane. Predicted to be  integral component of membrane. Human ortholog(s) of this gene implicated in cystinuria.  Is an ortholog of human SLC7A8 (solute carrier family 7 member 8).
甜尕妞 2025-02-08 17:13:35

我不确定我是否正确理解您的问题。但是,为了在数据框架图片中获得结果,我建议这样

awk '
BEGIN                      { COLSEP = "\t"; gcd = ""; ad = ""; cd = ""; flag = 0 }
/^WBGene/                  { printf "\n%s%s%s%s%s", $1, COLSEP, $2, COLSEP, $3 }
/^Gene class description:/ { flag = 1; $1=$2=$3=""; }
/^Automated description:/  { flag = 2; $1=$2=""; }
/^Concise description:/    { flag = 3; $1=$2=""; }
/=/                        { flag = 0; printf "%s%s%s%s%s", gcd, COLSEP, cd, COLSEP, ad; gcd = ""; ad = ""; cd = ""}
flag==1                    { gcd = gcd $0 }
flag==2                    { ad = ad $0 }
flag==3                    { cd = cd $0 }
' c_elegans.PRJNA13758.WS283.functional_descriptions.txt

I'm not sure whether I understand your question right. But to achieve the result in your data frame picture I'd suggest something like

awk '
BEGIN                      { COLSEP = "\t"; gcd = ""; ad = ""; cd = ""; flag = 0 }
/^WBGene/                  { printf "\n%s%s%s%s%s", $1, COLSEP, $2, COLSEP, $3 }
/^Gene class description:/ { flag = 1; $1=$2=$3=""; }
/^Automated description:/  { flag = 2; $1=$2=""; }
/^Concise description:/    { flag = 3; $1=$2=""; }
/=/                        { flag = 0; printf "%s%s%s%s%s", gcd, COLSEP, cd, COLSEP, ad; gcd = ""; ad = ""; cd = ""}
flag==1                    { gcd = gcd $0 }
flag==2                    { ad = ad $0 }
flag==3                    { cd = cd $0 }
' c_elegans.PRJNA13758.WS283.functional_descriptions.txt
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文