使用biopython的SeqIO模块发现独特的克隆
我正在研究 DNA 的下一代测序 (NGS) 分析。我正在使用 SeqIO Biopython 模块来解析 Fasta 格式的 DNA 库。我只想过滤唯一的克隆(唯一的记录)。为此,我使用以下 python 代码。
seen=[]
unique_clones=[]
records=list(SeqIO.parse('DNA_library', 'fasta'))
for record in records:
if str(record.seq) not in seen:
seen.append(str(record.seq))
unique_clones.append(record)
SeqIO.write(unique_clones, 'unique_clones.fasta', 'fasta')
一个 DNA 文库大约有 100 万条记录,而我有 100 多个文库需要分析。使用此代码过滤 1 个库需要 2 个多小时。这段代码提取唯一克隆的速度似乎非常慢。还有其他方法可以过滤唯一记录吗?
对于没有任何生物信息学工作经验的Python编码员
fasta格式的克隆有两个参数,ID(>id-number)和记录(ATCG),如下所示:
>id-No-1
ATCGGGCTAAATTCGACTGCAGT
>id-No-2
ATCGGGCTAAATTCGACTGCAGT
我只是想要根据记录过滤唯一克隆并想要打印唯一克隆(id 和记录)。
如果我的问题或解释不够清楚,请告诉我。
I am working on Next Generation Sequencing (NGS) analysis of DNA. I am using SeqIO Biopython module to parse the DNA libraries in Fasta format. I want to filter the unique clones (unique records) only. I am using the following python code for this purpose.
seen=[]
unique_clones=[]
records=list(SeqIO.parse('DNA_library', 'fasta'))
for record in records:
if str(record.seq) not in seen:
seen.append(str(record.seq))
unique_clones.append(record)
SeqIO.write(unique_clones, 'unique_clones.fasta', 'fasta')
one DNA library has around 1 million records, and I have more than 100 libraries to be analyzed. 1 Library is taking more than 2 Hours to be filtered by this code. This code seems very slow to extract unique clones. Is there any other way to filter unique records??
for python coders who don't have any experience in working with bioinformatics
clones in fasta format has two parameters, ID (>id-number) and record(ATCG) and looks like below:
>id-No-1
ATCGGGCTAAATTCGACTGCAGT
>id-No-2
ATCGGGCTAAATTCGACTGCAGT
I just want to filter unique clones based on their record and want to print unique clones (id and record).
Please let me know if I am not enough clear in my question or explanation.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我没有你的文件,所以我无法测试你将获得的实际性能增益,但这里有一些对我来说很慢的事情:
records=list(SeqIO.parse('DNA_library', 'fasta'))
将记录转换为记录列表,这可能听起来无伤大雅,但如果您有数百万条记录,则会变得昂贵。根据 docs,SeqIO.parse(...)
返回一个迭代器,因此您可以直接迭代它。set
而不是list
。使用in
执行成员资格检查时,列表必须迭代每个元素,而集合则在恒定时间内执行操作(更多信息此处)。通过这些更改,您的代码将变为:
I don't have your files so I cannot test the actual performance gain you'll get, but here are some things that stick out as slow to me:
records=list(SeqIO.parse('DNA_library', 'fasta'))
converts the records into a list of records, which may sound inoffensive but becomes costly if you have millions of records. According to the docs,SeqIO.parse(...)
returns an iterator so you can simply iterate over it directly.set
instead of alist
when keeping track of seen records. When performing membership checking usingin
, lists must iterate through every element while sets perform the operation in constant time (more info here).With those changes, your code becomes:
不需要像上面提到的那样使用
set
,因为该代码中的 cpython 机器会检查这里是否存在if seq not in saw_records:
以及这里的seen_records。添加(seq)
(sys.stdin)。所以我会尝试这个:如果这对你来说仍然太慢,请发表评论,因为据我所知,这对 Cython 来说是有用的。当您在 python 中处理大数据时,建议在其 C/C++ 包和包中处理它们。不要使用 python 的循环和类似的东西。
There is no need to use
set
as it was mentioned above because cpython machine in that code check the presence hereif seq not in seen_records:
and then hereseen_records.add(seq)
(sys.stdin). So I would try this:If this is still too slow for you, comment about because this is work for Cython as I see it. When you work with big data in python it is recommended to handle them inside its C/C++ packages & not to use python's loops and things like that.