使用Python检索丢失的序列-'split'命令不起作用

发布于 2024-12-24 02:49:32 字数 828 浏览 2 评论 0原文

我有一组使用软件找到的（蛋白质）序列，但它们的长度比数据库中原始序列的长度短。我下载了整个数据库，现在我有了这些不完整的序列集找到的以及从中找到序列的原始数据库。

软件结果示例：

>tr|E7EWP2|E7EWP2_HUMAN  Uncharacterized protein OS=Homo sapiens GN=TRIO PE=4 SV=2
KEFIMAELIQTEKAYVRDLRECMDTYLWEMTSGVE

数据库中的序列：

>tr|E7EWP2|E7EWP2_HUMAN  Uncharacterized protein OS=Homo sapiens GN=TRIO PE=4 SV=2
ARRKEFIMAELIQTEKAYVRDLRECMDTYLWEMTSGVEEIP

所以缺失的残基是“ARR”，最后是“EIP”，我有大约 70 个这样的不完整序列？我想编写一个Python程序，可以自动从数据库中检索完整的序列。我对Python真的很陌生，当然我会尝试编写自己的代码，我想知道是否有任何库或类似biopython模块的东西可以做到这一点。我的计划是从结果中获取间隔，展开它们并在原始数据库中选择它，但我不知道如何进一步进行。

我想获得 list_seq = [ARR,KEFIMAELIQTEKAYVRDLRECMDTYLWEMTSGVE,EIP] 以便我可以进一步使用 list_seq[0] r.strip(3) 和 list_seq[ 1] l.strip[3] 这样我就能得到完整的序列。但 list_seq 不起作用。

提前致谢

原文

I have a set of (protein)sequences that has been found using a software but they are shorter in length than that of the original ones in the database.I downloaded the entire database ,and now i have these set of incomplete sequences that have been found and the original database from which the sequences have been found.

Example result from software:

>tr|E7EWP2|E7EWP2_HUMAN  Uncharacterized protein OS=Homo sapiens GN=TRIO PE=4 SV=2
KEFIMAELIQTEKAYVRDLRECMDTYLWEMTSGVE

Sequence in the database:

>tr|E7EWP2|E7EWP2_HUMAN  Uncharacterized protein OS=Homo sapiens GN=TRIO PE=4 SV=2
ARRKEFIMAELIQTEKAYVRDLRECMDTYLWEMTSGVEEIP

So the missing residues are 'ARR' and in the end 'EIP', I have around 70 incomplete sequences like this? I would like to write a Python program that can automatically retrieve the complete sequences from the database.
I am really new to python ,ofcourse i will try to write my own code ,i would like to know if there are any libraries or something like biopython modules that can do this.
My plan is to take the intervals from my result,expand them and select it on the original database,but i do not know how to proceed further.

i would like to get list_seq = [ARR,KEFIMAELIQTEKAYVRDLRECMDTYLWEMTSGVE,EIP] so that i can further use list_seq[0] r.strip(3) and list_seq[1] l.strip[3] so that i get the complete sequence. but list_seq does not work.

Thanks in advance

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

じ违心 2024-12-31 02:49:32

您可以使用 BioPython SeqIO 中的 index 方法。 index 方法通过蛋白质 id 索引数据库记录，并且不会将完整数据库加载到内存中，从而允许使用完整的大型数据库进行高效搜索（或者您可以使用传统的数据库）像sqlite一样首先存储您的记录并对其执行搜索）：

from Bio import SeqIO

db1 = "dbase.fasta"
db2 = "my_collection.fasta"

dbase_dict = SeqIO.index(db1, "fasta")
my_record_dict = SeqIO.index(db2, "fasta")

for record in my_record_dict:
    if record in dbase_dict:
        rec_dbase = dbase_dict[record]
        rec_mine = my_record_dict[record]
        if rec_dbase.seq != rec_mine.seq:
            print rec_dbase

该程序仅打印有差异的记录。从此时起，您可以将它们保存在列表中或写入 fasta 文件

You can use the index method from BioPython SeqIO. The index method indexes database records by protein id and doesn't load the full database in memory thus allowing efficient search with complete, big databases (alternatively you could use a conventional dbase like sqlite to first store your records and perform searches on it):

from Bio import SeqIO

db1 = "dbase.fasta"
db2 = "my_collection.fasta"

dbase_dict = SeqIO.index(db1, "fasta")
my_record_dict = SeqIO.index(db2, "fasta")

for record in my_record_dict:
    if record in dbase_dict:
        rec_dbase = dbase_dict[record]
        rec_mine = my_record_dict[record]
        if rec_dbase.seq != rec_mine.seq:
            print rec_dbase

This program just print the records with differences. From this point you can save them on a list or write to a fasta file

回复收藏 0 原文

~没有更多了~