使用Python检索丢失的序列-'split'命令不起作用
我有一组使用软件找到的(蛋白质)序列,但它们的长度比数据库中原始序列的长度短。我下载了整个数据库,现在我有了这些不完整的序列集找到的以及从中找到序列的原始数据库。
软件结果示例:
>tr|E7EWP2|E7EWP2_HUMAN Uncharacterized protein OS=Homo sapiens GN=TRIO PE=4 SV=2
KEFIMAELIQTEKAYVRDLRECMDTYLWEMTSGVE
数据库中的序列:
>tr|E7EWP2|E7EWP2_HUMAN Uncharacterized protein OS=Homo sapiens GN=TRIO PE=4 SV=2
ARRKEFIMAELIQTEKAYVRDLRECMDTYLWEMTSGVEEIP
所以缺失的残基是“ARR”,最后是“EIP”,我有大约 70 个这样的不完整序列?我想编写一个Python程序,可以自动从数据库中检索完整的序列。 我对Python真的很陌生,当然我会尝试编写自己的代码,我想知道是否有任何库或类似biopython模块的东西可以做到这一点。 我的计划是从结果中获取间隔,展开它们并在原始数据库中选择它,但我不知道如何进一步进行。
我想获得 list_seq = [ARR,KEFIMAELIQTEKAYVRDLRECMDTYLWEMTSGVE,EIP]
以便我可以进一步使用 list_seq[0] r.strip(3)
和 list_seq[ 1] l.strip[3]
这样我就能得到完整的序列。但 list_seq 不起作用。
提前致谢
I have a set of (protein)sequences that has been found using a software but they are shorter in length than that of the original ones in the database.I downloaded the entire database ,and now i have these set of incomplete sequences that have been found and the original database from which the sequences have been found.
Example result from software:
>tr|E7EWP2|E7EWP2_HUMAN Uncharacterized protein OS=Homo sapiens GN=TRIO PE=4 SV=2
KEFIMAELIQTEKAYVRDLRECMDTYLWEMTSGVE
Sequence in the database:
>tr|E7EWP2|E7EWP2_HUMAN Uncharacterized protein OS=Homo sapiens GN=TRIO PE=4 SV=2
ARRKEFIMAELIQTEKAYVRDLRECMDTYLWEMTSGVEEIP
So the missing residues are 'ARR' and in the end 'EIP', I have around 70 incomplete sequences like this? I would like to write a Python program that can automatically retrieve the complete sequences from the database.
I am really new to python ,ofcourse i will try to write my own code ,i would like to know if there are any libraries or something like biopython modules that can do this.
My plan is to take the intervals from my result,expand them and select it on the original database,but i do not know how to proceed further.
i would like to get list_seq = [ARR,KEFIMAELIQTEKAYVRDLRECMDTYLWEMTSGVE,EIP]
so that i can further use list_seq[0] r.strip(3)
and list_seq[1] l.strip[3]
so that i get the complete sequence. but list_seq does not work.
Thanks in advance
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以使用 BioPython SeqIO 中的
index
方法。index
方法通过蛋白质id
索引数据库记录,并且不会将完整数据库加载到内存中,从而允许使用完整的大型数据库进行高效搜索(或者您可以使用传统的数据库)像sqlite一样首先存储您的记录并对其执行搜索):该程序仅打印有差异的记录。从此时起,您可以将它们保存在列表中或写入 fasta 文件
You can use the
index
method from BioPython SeqIO. Theindex
method indexes database records by proteinid
and doesn't load the full database in memory thus allowing efficient search with complete, big databases (alternatively you could use a conventional dbase like sqlite to first store your records and perform searches on it):This program just print the records with differences. From this point you can save them on a list or write to a fasta file