如何从 XML NCBI BLAST 文件中提取第一个命中元素?
我试图从 NCBI xml BLAST 文件中仅提取第一个命中。接下来我只想得到第一个HSP。在最后阶段,我想根据最好成绩获得这些。 为了清楚起见,这里提供了 xml 文件的示例:
<?xml version="1.0"?>
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd">
<BlastOutput>
<BlastOutput_program>blastx</BlastOutput_program>
<BlastOutput_version>blastx 2.2.22 [Sep-27-2009]</BlastOutput_version>
<BlastOutput_reference>~Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, ~Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), ~"Gapped BLAST and PSI-BLAST: a new generation of protein database search~programs", Nucleic Acids Res. 25:3389-3402.</BlastOutput_reference>
<BlastOutput_db>/Applications/blast/db/viral1.protein.faa</BlastOutput_db>
<BlastOutput_query-ID>lcl|1_0</BlastOutput_query-ID>
<BlastOutput_query-def>DSAD-090629_plate11A01a.g1 CHROMAT_FILE: DSAD-090629_plate11A01a.g1 PHD_FILE: DSAD-090629_plate11A01a.g1.phd.1 CHEM: term DYE: big TIME: Thu Sep 17 15:33:59 2009 TEMPLATE: DSAD-090629_plate11A01a DIRECTION: rev</BlastOutput_query-def>
<BlastOutput_query-len>1024</BlastOutput_query-len>
<BlastOutput_param>
<Parameters>
<Parameters_matrix>BLOSUM62</Parameters_matrix>
<Parameters_expect>1e-05</Parameters_expect>
<Parameters_gap-open>11</Parameters_gap-open>
<Parameters_gap-extend>1</Parameters_gap-extend>
<Parameters_filter>F</Parameters_filter>
</Parameters>
</BlastOutput_param>
<BlastOutput_iterations>
<Iteration>
<Iteration_iter-num>1</Iteration_iter-num>
<Iteration_query-ID>lcl|1_0</Iteration_query-ID>
<Iteration_query-def>DSAD-090629_plate11A01a.g1 CHROMAT_FILE: DSAD-090629_plate11A01a.g1 PHD_FILE: DSAD-090629_plate11A01a.g1.phd.1 CHEM: term DYE: big TIME: Thu Sep 17 15:33:59 2009 TEMPLATE: DSAD-090629_plate11A01a DIRECTION: rev</Iteration_query-def>
<Iteration_query-len>1024</Iteration_query-len>
<Iteration_stat>
<Statistics>
<Statistics_db-num>68007</Statistics_db-num>
<Statistics_db-len>19518578</Statistics_db-len>
<Statistics_hsp-len>0</Statistics_hsp-len>
<Statistics_eff-space>0</Statistics_eff-space>
<Statistics_kappa>0.041</Statistics_kappa>
<Statistics_lambda>0.267</Statistics_lambda>
<Statistics_entropy>0.14</Statistics_entropy>
</Statistics>
</Iteration_stat>
<Iteration_message>No hits found</Iteration_message>
</Iteration>
<Iteration>
<Iteration>
<Iteration_iter-num>6</Iteration_iter-num>
<Iteration_query-ID>lcl|6_0</Iteration_query-ID>
<Iteration_query-def>DSAD-090629_plate11A05a.g1 CHROMAT_FILE: DSAD-090629_plate11A05a.g1 PHD_FILE: DSAD-090629_plate11A05a.g1.phd.1 CHEM: term DYE: big TIME: Thu Sep 17 15:33:59 2009 TEMPLATE: DSAD-090629_plate11A05a DIRECTION: rev</Iteration_query-def>
<Iteration_query-len>1068</Iteration_query-len>
<Iteration_hits>
<Hit>
<Hit_num>1</Hit_num>
<Hit_id>gnl|BL_ORD_ID|23609</Hit_id>
<Hit_def>gi|38707884|ref|NP_945016.1| Putative ribose-phosphate pyrophosphokinase [Enterobacteria phage Felix 01]</Hit_def>
<Hit_accession>23609</Hit_accession>
<Hit_len>293</Hit_len>
<Hit_hsps>
<Hsp>
<Hsp_num>1</Hsp_num>
<Hsp_bit-score>49.2914</Hsp_bit-score>
<Hsp_score>116</Hsp_score>
<Hsp_evalue>5.15408e-06</Hsp_evalue>
<Hsp_query-from>580</Hsp_query-from>
<Hsp_query-to>792</Hsp_query-to>
<Hsp_hit-from>202</Hsp_hit-from>
<Hsp_hit-to>273</Hsp_hit-to>
<Hsp_query-frame>-1</Hsp_query-frame>
<Hsp_identity>26</Hsp_identity>
<Hsp_positive>45</Hsp_positive>
<Hsp_gaps>2</Hsp_gaps>
<Hsp_align-len>73</Hsp_align-len>
<Hsp_qseq>MHIIGDVE--GRTCILVDDMVDTAGTLCHAAKALKERGAAKVYAYCTHPVLSGRAIENIENSVLDELVVTNTI</Hsp_qseq>
<Hsp_hseq>MRILDDVDLTDKTVMILDDICDGGRTFVEAAKHLREAGAKRVELYVTHGIFS-KDVENLLDNGIDHIYTTNSL</Hsp_hseq>
<Hsp_midline>M I+ DV+ +T +++DD+ D T AAK L+E GA +V Y TH + S + +EN+ ++ +D + TN++</Hsp_midline>
</Hsp>
</Hit_hsps>
</Hit>
<Hit>
<Hit_num>2</Hit_num>
<Hit_id>gnl|BL_ORD_ID|2466</Hit_id>
<Hit_def>gi|51557505|ref|YP_068339.1| large tegument protein [Suid herpesvirus 1]</Hit_def>
<Hit_accession>2466</Hit_accession>
<Hit_len>3084</Hit_len>
<Hit_hsps>
<Hsp>
<Hsp_num>1</Hsp_num>
<Hsp_bit-score>48.9062</Hsp_bit-score>
<Hsp_score>115</Hsp_score>
<Hsp_evalue>6.70494e-06</Hsp_evalue>
<Hsp_query-from>369</Hsp_query-from>
<Hsp_query-to>875</Hsp_query-to>
<Hsp_hit-from>2312</Hsp_hit-from>
<Hsp_hit-to>2465</Hsp_hit-to>
<Hsp_query-frame>-2</Hsp_query-frame>
<Hsp_identity>52</Hsp_identity>
<Hsp_positive>70</Hsp_positive>
<Hsp_gaps>4</Hsp_gaps>
<Hsp_align-len>173</Hsp_align-len>
<Hsp_qseq>APESQEPGASTWRSSTSVVKKGQPSQK*CTSSVTSKAVPASWSTTWSTLPAPCATPPKR*KSAAPPRSTPTAPTRCCPAAPSRTSRIPSWTSWWSPTPSRCPLRRSPARVFASSTSPR-SSPKRSAASATKNRSAP---CSAKRNWPDHTAPPRAGLFALPPEAGRKPQGGLV</Hsp_qseq>
<Hsp_hseq>APPAQKPPAQPATAAATTAPKATPQTQPPTRAQTQTAPPPPSAAT-----AAAQVPPQ------PPSSQPAAKPRGAPPAPPAPP--PPSAQTTLPRPAAPPAPPPPS---AQTTLPRPAPPPPSAPAATPTPPAPGPAPSAKKSDGDRIVEPKAG---APPDVRDAKFGGKV</Hsp_hseq>
<Hsp_midline>AP +Q+P A ++ + K P + T + T A P + T A PP+ PP S P A R P AP P P P+ P P+ A +T PR + P SA +AT AP SAK++ D P+AG PP+ GG V</Hsp_midline>
</Hsp>
</Hit_hsps>
</Hit>
</Iteration_hits>
<Iteration_stat>
<Statistics>
<Statistics_db-num>68007</Statistics_db-num>
<Statistics_db-len>19518578</Statistics_db-len>
<Statistics_hsp-len>0</Statistics_hsp-len>
<Statistics_eff-space>0</Statistics_eff-space>
<Statistics_kappa>0.041</Statistics_kappa>
<Statistics_lambda>0.267</Statistics_lambda>
<Statistics_entropy>0.14</Statistics_entropy>
</Statistics>
</Iteration_stat>
</Iteration>
基本上每个查询搜索都会创建一个 Iteration 元素。每次迭代可以有多次命中,而多次命中又可以有多个 HSP。我只想获得第一个命中,并且它是每次迭代中的第一个 HSP。如果 BLAST 没有发现命中,我想忽略迭代。 我编写了这个简单的代码:
#!/usr/bin/env python
from elementtree.ElementTree import parse
from elementtree import ElementTree as ET
file = open("/Applications/blast/blanes_viral_nr_results.xml", "r")
save_file = open("/Applications/blast/Blast_parse_ET.txt", 'w')
tree = parse(file)
elem = tree.getroot()
print elem
Per_ID = ()
save_file.write('>%s\t%s\t%s\t%s\t%s\t%s\t\n\n\n\n' % ("It_Num\t", "It_ID\t", "Hit_Def\t", "Num\t", "ID\t", "ACC\t"))
iteration = tree.findall('BlastOutput_iterations/Iteration')
for iteration in iteration:
for hit in iteration.findall('Iteration_hits/Hit'):
It_Num = iteration.findtext('Iteration_iter-num')
It_ID = iteration.findtext('Iteration_query-def')
Hit_Def = hit.findtext('Hit_def')
Num = hit.findtext('Hit_num')
ID = hit.findtext('Hit_id')
DEF = hit.findtext('Hit_def')
ACC = hit.findtext('Hit_accession')
save_file.write('>%s\t%s\t%s\t%s\t%s\t%s\t' % (It_Num, It_ID[12:26], Hit_Def[1:10], Num, ID, ACC,))
for hsp in hit.findall('Hit_hsps'):
HSPN = hsp.findtext('Hsp/Hsp_num')
identities = hsp.findtext('Hsp/Hsp_identity')
#print 'id: ', identities.rjust(4),
length = hsp.findtext('Hsp/Hsp_align-len')
#print 'len:', length.rjust(4),
Per_ID = int(identities) * 100.0 / int(length)
#print hsp.findtext('Hsp/Hsp_qseq')[:50]
#print hsp.findtext('Hsp/Hsp_midline')[:50]
#print hsp.findtext('Hsp/Hsp_hseq')[:50]
save_file.write('%s\t%s\t%s\%st\n' % ('***', '%', HSPN, Per_ID))
save_file.write('n\n' % ())
任何帮助将不胜感激!
Im trying to extract only the first hit from an NCBI xml BLAST file. next I would like to get only the first HSP. at the final stage I would like to get these based on best score.
to make things clear here a sample of the xml file:
<?xml version="1.0"?>
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd">
<BlastOutput>
<BlastOutput_program>blastx</BlastOutput_program>
<BlastOutput_version>blastx 2.2.22 [Sep-27-2009]</BlastOutput_version>
<BlastOutput_reference>~Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, ~Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), ~"Gapped BLAST and PSI-BLAST: a new generation of protein database search~programs", Nucleic Acids Res. 25:3389-3402.</BlastOutput_reference>
<BlastOutput_db>/Applications/blast/db/viral1.protein.faa</BlastOutput_db>
<BlastOutput_query-ID>lcl|1_0</BlastOutput_query-ID>
<BlastOutput_query-def>DSAD-090629_plate11A01a.g1 CHROMAT_FILE: DSAD-090629_plate11A01a.g1 PHD_FILE: DSAD-090629_plate11A01a.g1.phd.1 CHEM: term DYE: big TIME: Thu Sep 17 15:33:59 2009 TEMPLATE: DSAD-090629_plate11A01a DIRECTION: rev</BlastOutput_query-def>
<BlastOutput_query-len>1024</BlastOutput_query-len>
<BlastOutput_param>
<Parameters>
<Parameters_matrix>BLOSUM62</Parameters_matrix>
<Parameters_expect>1e-05</Parameters_expect>
<Parameters_gap-open>11</Parameters_gap-open>
<Parameters_gap-extend>1</Parameters_gap-extend>
<Parameters_filter>F</Parameters_filter>
</Parameters>
</BlastOutput_param>
<BlastOutput_iterations>
<Iteration>
<Iteration_iter-num>1</Iteration_iter-num>
<Iteration_query-ID>lcl|1_0</Iteration_query-ID>
<Iteration_query-def>DSAD-090629_plate11A01a.g1 CHROMAT_FILE: DSAD-090629_plate11A01a.g1 PHD_FILE: DSAD-090629_plate11A01a.g1.phd.1 CHEM: term DYE: big TIME: Thu Sep 17 15:33:59 2009 TEMPLATE: DSAD-090629_plate11A01a DIRECTION: rev</Iteration_query-def>
<Iteration_query-len>1024</Iteration_query-len>
<Iteration_stat>
<Statistics>
<Statistics_db-num>68007</Statistics_db-num>
<Statistics_db-len>19518578</Statistics_db-len>
<Statistics_hsp-len>0</Statistics_hsp-len>
<Statistics_eff-space>0</Statistics_eff-space>
<Statistics_kappa>0.041</Statistics_kappa>
<Statistics_lambda>0.267</Statistics_lambda>
<Statistics_entropy>0.14</Statistics_entropy>
</Statistics>
</Iteration_stat>
<Iteration_message>No hits found</Iteration_message>
</Iteration>
<Iteration>
<Iteration>
<Iteration_iter-num>6</Iteration_iter-num>
<Iteration_query-ID>lcl|6_0</Iteration_query-ID>
<Iteration_query-def>DSAD-090629_plate11A05a.g1 CHROMAT_FILE: DSAD-090629_plate11A05a.g1 PHD_FILE: DSAD-090629_plate11A05a.g1.phd.1 CHEM: term DYE: big TIME: Thu Sep 17 15:33:59 2009 TEMPLATE: DSAD-090629_plate11A05a DIRECTION: rev</Iteration_query-def>
<Iteration_query-len>1068</Iteration_query-len>
<Iteration_hits>
<Hit>
<Hit_num>1</Hit_num>
<Hit_id>gnl|BL_ORD_ID|23609</Hit_id>
<Hit_def>gi|38707884|ref|NP_945016.1| Putative ribose-phosphate pyrophosphokinase [Enterobacteria phage Felix 01]</Hit_def>
<Hit_accession>23609</Hit_accession>
<Hit_len>293</Hit_len>
<Hit_hsps>
<Hsp>
<Hsp_num>1</Hsp_num>
<Hsp_bit-score>49.2914</Hsp_bit-score>
<Hsp_score>116</Hsp_score>
<Hsp_evalue>5.15408e-06</Hsp_evalue>
<Hsp_query-from>580</Hsp_query-from>
<Hsp_query-to>792</Hsp_query-to>
<Hsp_hit-from>202</Hsp_hit-from>
<Hsp_hit-to>273</Hsp_hit-to>
<Hsp_query-frame>-1</Hsp_query-frame>
<Hsp_identity>26</Hsp_identity>
<Hsp_positive>45</Hsp_positive>
<Hsp_gaps>2</Hsp_gaps>
<Hsp_align-len>73</Hsp_align-len>
<Hsp_qseq>MHIIGDVE--GRTCILVDDMVDTAGTLCHAAKALKERGAAKVYAYCTHPVLSGRAIENIENSVLDELVVTNTI</Hsp_qseq>
<Hsp_hseq>MRILDDVDLTDKTVMILDDICDGGRTFVEAAKHLREAGAKRVELYVTHGIFS-KDVENLLDNGIDHIYTTNSL</Hsp_hseq>
<Hsp_midline>M I+ DV+ +T +++DD+ D T AAK L+E GA +V Y TH + S + +EN+ ++ +D + TN++</Hsp_midline>
</Hsp>
</Hit_hsps>
</Hit>
<Hit>
<Hit_num>2</Hit_num>
<Hit_id>gnl|BL_ORD_ID|2466</Hit_id>
<Hit_def>gi|51557505|ref|YP_068339.1| large tegument protein [Suid herpesvirus 1]</Hit_def>
<Hit_accession>2466</Hit_accession>
<Hit_len>3084</Hit_len>
<Hit_hsps>
<Hsp>
<Hsp_num>1</Hsp_num>
<Hsp_bit-score>48.9062</Hsp_bit-score>
<Hsp_score>115</Hsp_score>
<Hsp_evalue>6.70494e-06</Hsp_evalue>
<Hsp_query-from>369</Hsp_query-from>
<Hsp_query-to>875</Hsp_query-to>
<Hsp_hit-from>2312</Hsp_hit-from>
<Hsp_hit-to>2465</Hsp_hit-to>
<Hsp_query-frame>-2</Hsp_query-frame>
<Hsp_identity>52</Hsp_identity>
<Hsp_positive>70</Hsp_positive>
<Hsp_gaps>4</Hsp_gaps>
<Hsp_align-len>173</Hsp_align-len>
<Hsp_qseq>APESQEPGASTWRSSTSVVKKGQPSQK*CTSSVTSKAVPASWSTTWSTLPAPCATPPKR*KSAAPPRSTPTAPTRCCPAAPSRTSRIPSWTSWWSPTPSRCPLRRSPARVFASSTSPR-SSPKRSAASATKNRSAP---CSAKRNWPDHTAPPRAGLFALPPEAGRKPQGGLV</Hsp_qseq>
<Hsp_hseq>APPAQKPPAQPATAAATTAPKATPQTQPPTRAQTQTAPPPPSAAT-----AAAQVPPQ------PPSSQPAAKPRGAPPAPPAPP--PPSAQTTLPRPAAPPAPPPPS---AQTTLPRPAPPPPSAPAATPTPPAPGPAPSAKKSDGDRIVEPKAG---APPDVRDAKFGGKV</Hsp_hseq>
<Hsp_midline>AP +Q+P A ++ + K P + T + T A P + T A PP+ PP S P A R P AP P P P+ P P+ A +T PR + P SA +AT AP SAK++ D P+AG PP+ GG V</Hsp_midline>
</Hsp>
</Hit_hsps>
</Hit>
</Iteration_hits>
<Iteration_stat>
<Statistics>
<Statistics_db-num>68007</Statistics_db-num>
<Statistics_db-len>19518578</Statistics_db-len>
<Statistics_hsp-len>0</Statistics_hsp-len>
<Statistics_eff-space>0</Statistics_eff-space>
<Statistics_kappa>0.041</Statistics_kappa>
<Statistics_lambda>0.267</Statistics_lambda>
<Statistics_entropy>0.14</Statistics_entropy>
</Statistics>
</Iteration_stat>
</Iteration>
basically each query search creates an Iteration element. each iteration can have multiple hit which in turn can have multiple HSPs. I would like to get only the first hit and it's first HSP from each iteration. if the BLAST found no hits I would like to ignore the iteration.
I worked up this simple code:
#!/usr/bin/env python
from elementtree.ElementTree import parse
from elementtree import ElementTree as ET
file = open("/Applications/blast/blanes_viral_nr_results.xml", "r")
save_file = open("/Applications/blast/Blast_parse_ET.txt", 'w')
tree = parse(file)
elem = tree.getroot()
print elem
Per_ID = ()
save_file.write('>%s\t%s\t%s\t%s\t%s\t%s\t\n\n\n\n' % ("It_Num\t", "It_ID\t", "Hit_Def\t", "Num\t", "ID\t", "ACC\t"))
iteration = tree.findall('BlastOutput_iterations/Iteration')
for iteration in iteration:
for hit in iteration.findall('Iteration_hits/Hit'):
It_Num = iteration.findtext('Iteration_iter-num')
It_ID = iteration.findtext('Iteration_query-def')
Hit_Def = hit.findtext('Hit_def')
Num = hit.findtext('Hit_num')
ID = hit.findtext('Hit_id')
DEF = hit.findtext('Hit_def')
ACC = hit.findtext('Hit_accession')
save_file.write('>%s\t%s\t%s\t%s\t%s\t%s\t' % (It_Num, It_ID[12:26], Hit_Def[1:10], Num, ID, ACC,))
for hsp in hit.findall('Hit_hsps'):
HSPN = hsp.findtext('Hsp/Hsp_num')
identities = hsp.findtext('Hsp/Hsp_identity')
#print 'id: ', identities.rjust(4),
length = hsp.findtext('Hsp/Hsp_align-len')
#print 'len:', length.rjust(4),
Per_ID = int(identities) * 100.0 / int(length)
#print hsp.findtext('Hsp/Hsp_qseq')[:50]
#print hsp.findtext('Hsp/Hsp_midline')[:50]
#print hsp.findtext('Hsp/Hsp_hseq')[:50]
save_file.write('%s\t%s\t%s\%st\n' % ('***', '%', HSPN, Per_ID))
save_file.write('n\n' % ())
any help would be greatly appriciated!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
虽然构建您自己的解析器可能很“有趣”,但已经有一个可以解析 BLAST xml 文件的包...如果您愿意,它甚至可以为您执行本地 BLAST 实例的中间调用。
主要站点在这里:
http://biopython.org/wiki/Biopython
和 XML BLAST 解析器在这里:
http://biopython.org/DIST/docs/tutorial/Tutorial.html# htoc82
类似于:
应该可以。我通常喜欢 BioPython 解析器和编写器,但我不喜欢其类结构组织。所以我通常只使用解析器并将我需要的信息提取到我自己的结构中。
YMMV
希望有帮助。
While building your own parser may be "fun" there is already a package out there that can parse BLAST xml files ... it can even do the intermediate calling of a local BLAST instance for you if you so desire.
The main site is here:
http://biopython.org/wiki/Biopython
and the XML BLAST parser is here:
http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc82
Something like:
Should work. I generally love the BioPython parsers and writers but I do not like the class-structure organization. So I generally just use the parser and extract the info I need into my own structure.
YMMV
Hope that helps.
我会赞同 JudoWill 的建议——使用 BioPython 解析器更聪明地工作,而不是更辛苦。
这应该会让您更进一步:
这将在第一个查询之后停止(返回第一个也是最佳匹配)。由于我怀疑您可能需要同一文件中其他查询的结果,因此只需从最后一行删除
break
即可。I'll second what JudoWill suggested - work smarter, not harder with the BioPython parser.
This should get you a little further:
This will stop after the first Query (returning the first and best match). Since I suspect you may want results for other queries within the same file, just remove the
break
from the last line.如果我了解您的基本要求,那么您希望获得查询蛋白质/核苷酸序列的最高命中/HSP。为什么不在您的系统上安装带有格式化 nr/nt 数据库的独立blast。
键入选项,
在 MS Excel 中打开 xml 输出文件,您可以在其中看到blast输出的表格形式,其中每个查询序列的顶部单次命中
If I understand your basic requirement then you want to get the top hit/HSP of the query protein/nucleotide sequence.Why dont you install the standalone blast on your system with formatted nr/nt database.
Type the options
open the xml output file in MS excel where you can see a tabulated form of the blastoutput with top single hit to each query sequence