使用biopython的SeqIO模块发现独特的克隆

发布于 2025-01-12 02:57:10 字数 779 浏览 0 评论 0原文

我正在研究 DNA 的下一代测序 (NGS) 分析。我正在使用 SeqIO Biopython 模块来解析 Fasta 格式的 DNA 库。我只想过滤唯一的克隆（唯一的记录）。为此，我使用以下 python 代码。

seen=[]
unique_clones=[]
records=list(SeqIO.parse('DNA_library', 'fasta'))
for record in records:
  if str(record.seq) not in seen:
    seen.append(str(record.seq))
    unique_clones.append(record)
SeqIO.write(unique_clones, 'unique_clones.fasta', 'fasta')

一个 DNA 文库大约有 100 万条记录，而我有 100 多个文库需要分析。使用此代码过滤 1 个库需要 2 个多小时。这段代码提取唯一克隆的速度似乎非常慢。还有其他方法可以过滤唯一记录吗？

对于没有任何生物信息学工作经验的Python编码员

fasta格式的克隆有两个参数，ID（>id-number）和记录（ATCG），如下所示：

>id-No-1
ATCGGGCTAAATTCGACTGCAGT

>id-No-2
ATCGGGCTAAATTCGACTGCAGT

我只是想要根据记录过滤唯一克隆并想要打印唯一克隆（id 和记录）。

如果我的问题或解释不够清楚，请告诉我。

原文

I am working on Next Generation Sequencing (NGS) analysis of DNA. I am using SeqIO Biopython module to parse the DNA libraries in Fasta format. I want to filter the unique clones (unique records) only. I am using the following python code for this purpose.

seen=[]
unique_clones=[]
records=list(SeqIO.parse('DNA_library', 'fasta'))
for record in records:
  if str(record.seq) not in seen:
    seen.append(str(record.seq))
    unique_clones.append(record)
SeqIO.write(unique_clones, 'unique_clones.fasta', 'fasta')

one DNA library has around 1 million records, and I have more than 100 libraries to be analyzed. 1 Library is taking more than 2 Hours to be filtered by this code. This code seems very slow to extract unique clones. Is there any other way to filter unique records??

for python coders who don't have any experience in working with bioinformatics

clones in fasta format has two parameters, ID (>id-number) and record(ATCG) and looks like below:

>id-No-1
ATCGGGCTAAATTCGACTGCAGT

>id-No-2
ATCGGGCTAAATTCGACTGCAGT

I just want to filter unique clones based on their record and want to print unique clones (id and record).

Please let me know if I am not enough clear in my question or explanation.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

宫墨修音 2025-01-19 02:57:10

我没有你的文件，所以我无法测试你将获得的实际性能增益，但这里有一些对我来说很慢的事情：

行 records=list(SeqIO.parse('DNA_library', 'fasta')) 将记录转换为记录列表，这可能听起来无伤大雅，但如果您有数百万条记录，则会变得昂贵。根据 docs，SeqIO.parse(...) 返回一个迭代器，因此您可以直接迭代它。
跟踪查看的记录时，使用set而不是list。使用 in 执行成员资格检查时，列表必须迭代每个元素，而集合则在恒定时间内执行操作（更多信息此处）。

通过这些更改，您的代码将变为：

seen_records = set()
records_to_keep = []

for record in SeqIO.parse('DNA_library', 'fasta'):
  seq = str(record.seq)
  if seq not in seen_records:
    seen_records.add(seq)
    records_to_keep.append(record)

SeqIO.write(records_to_keep, 'unique_clones.fasta', 'fasta')

I don't have your files so I cannot test the actual performance gain you'll get, but here are some things that stick out as slow to me:

the line records=list(SeqIO.parse('DNA_library', 'fasta')) converts the records into a list of records, which may sound inoffensive but becomes costly if you have millions of records. According to the docs, SeqIO.parse(...) returns an iterator so you can simply iterate over it directly.
Use a set instead of a list when keeping track of seen records. When performing membership checking using in, lists must iterate through every element while sets perform the operation in constant time (more info here).

With those changes, your code becomes:

seen_records = set()
records_to_keep = []

for record in SeqIO.parse('DNA_library', 'fasta'):
  seq = str(record.seq)
  if seq not in seen_records:
    seen_records.add(seq)
    records_to_keep.append(record)

SeqIO.write(records_to_keep, 'unique_clones.fasta', 'fasta')

回复收藏 0 原文

梅窗月明清似水 2025-01-19 02:57:10

不需要像上面提到的那样使用 set，因为该代码中的 cpython 机器会检查这里是否存在 if seq not in saw_records: 以及这里的 seen_records。添加（seq）（sys.stdin）。所以我会尝试这个：

from Bio.SeqRecord import SeqRecord

seen_seqs = list()
IDs = list()

for record in SeqIO.parse('DNA_library', 'fasta'):
    if record.seq not in seen_seqs:
        seen_seqs.append(record.seq)
        IDs.append(record.id)

SeqIO.write((SeqRecord(seq, id=ID, name="", description="") for seq, ID in zip(seen_seqs, IDs)),
            'unique_clones.fasta',
            'fasta')

如果这对你来说仍然太慢，请发表评论，因为据我所知，这对 Cython 来说是有用的。当您在 python 中处理大数据时，建议在其 C/C++ 包和包中处理它们。不要使用 python 的循环和类似的东西。

There is no need to use set as it was mentioned above because cpython machine in that code check the presence here if seq not in seen_records: and then here seen_records.add(seq) (sys.stdin). So I would try this:

from Bio.SeqRecord import SeqRecord

seen_seqs = list()
IDs = list()

for record in SeqIO.parse('DNA_library', 'fasta'):
    if record.seq not in seen_seqs:
        seen_seqs.append(record.seq)
        IDs.append(record.id)

SeqIO.write((SeqRecord(seq, id=ID, name="", description="") for seq, ID in zip(seen_seqs, IDs)),
            'unique_clones.fasta',
            'fasta')

If this is still too slow for you, comment about because this is work for Cython as I see it. When you work with big data in python it is recommended to handle them inside its C/C++ packages & not to use python's loops and things like that.

回复收藏 0 原文

~没有更多了~