我有一个包含数百万个序列的 fasta 文件。我只想提取 .txt 文件中名称匹配的内容，我该如何执行此操作？

发布于 2025-01-12 06:44:07 字数 2984 浏览 0 评论 0原文

我一直在对大约 1.5m 的读取 fasta 文件（“V1_6D_contigs_5kbp.fa”）进行排序，以确定哪些读取可能是“病毒式”的。该文件中的读数表示为 Vx_Cz - 其中 x 是 1-6，具体取决于它来自哪个试验组，z 是来自 1-~1.5meg V1_C10810 或 V3_C587937 的重叠群编号/名称...

通过不同的生物信息学管道我生成了一个 .txt 文件，其中包含预测的重叠群名称列表（长 2699 个） (<0.05) 具有病毒性。我现在需要使用这个预测重叠群列表来提取并生成一个仅包含这些重叠群的新 fasta 文件。

我的代码背后的理论思想是，它打开 .txt 文件（每个重要重叠群的名称）和原始 fasta 文件，遍历 .txt 文件的每一行并将该行（重叠群名称）设置为变量。然后，它应该循环遍历包含所有序列信息的原始 fasta 文件，如果重叠群名称与 record.id（原始文件中的重叠群名称）匹配，则应该将完整记录信息导出到新文件。

我认为我已经很接近了，但我当前的迭代似乎只运行一个或另一个循环，正如我所期望的那样。

请参阅下面的代码。我在下面添加了我尝试过的每个程序的错误注释。

我正在使用 Python，包括 SeqIO Biopython 应用程序。

#!/usr/bin/env python

from Bio import SeqIO

Lines = []
Counter = 0 

With open(‘VIBRANT_contigs.txt’) as f:
 lines = f.readlines()
 original_file = r"/home/ggb_joshcole/Documents/VxCx_deduped/V1_6D_contigs_5kbp.fa"
 corrected_file = r"/home/ggb_joshcole/Documents/output.fna"
  with open(original_file) as original, open(corrected_file,'w') as corrected:
   records = SeqIO.parse(original_file,'fasta')
    for record in records: 
     id = record.id
     print(“this is the id”, id)
     for line in lines:
      contig = line
      print(“This is the contig”, contig)
       if contig == id:
        SeqIO.write(record, corrected, ‘fasta’)  
         Counter = (counter + 1)

这仅打印“这是 id”的列表，id

#!/usr/bin/env python

Lines = []
With open(‘VIBRANT_contigs.txt’) as f:
 lines = f.readlines()
     for line in lines:
      contig = line
      print(contig)

这可以工作并打印 Vibrant_contigs.txt 中的重叠群列表

>>> from Bio import SeqIO
>>> lines = []
>>> with open('VIBRANT_contigs.txt') as f:
...  lines = f.readlines()
...  original_file = r"/home/ggb_joshcole/Documents/VxCx_deduped/V1_6D_contigs_5kbp.fa"
...  corrected_file = r"/home/ggb_joshcole/Documents/output.fna"
...  with open(original_file) as original, open(corrected_file,'w') as corrected:
...   records = SeqIO.parse(original_file,'fasta')
...   for line in lines:
...    contig = line
...    for record in records:
...     id = record.id
...     if contig == id:
...      print("we got a hit")

不打印任何


>>> from Bio import SeqIO
>>> lines = []
>>> with open('VIBRANT_contigs.txt') as f:
...  lines = f.readlines()
...  original_file = r"/home/ggb_joshcole/Documents/VxCx_deduped/V1_6D_contigs_5kbp.fa"
...  corrected_file = r"/home/ggb_joshcole/Documents/output.fna"
...  with open(original_file) as original, open(corrected_file,'w') as corrected:
...   records = SeqIO.parse(original_file,'fasta')
...   for line in lines:
...    contig = line
...    print("this is the contig", contig)
...    for record in records:
...     id = record.id
...     print("this is the id", id)
...     if contig == id:
...      print("we got a hit")

内容打印“这是重叠群 V1_C1005406”（列表中的第一个重叠群）

然后继续打印所有ids

然后只需打印所有重叠群

任何帮助将不胜感激，因为我非常困难。

感谢您抽出时间。

原文

I have been sorting through a ~1.5m read fasta file ('V1_6D_contigs_5kbp.fa') to determine which of the reads are likely to be 'viral' in origin. The reads in this file are denoted as Vx_Cz - where x is 1-6, depending on which trial group it came from, and z is the contig number/name from 1-~1.5m. e.g V1_C10810 or V3_C587937...

Through varying bioinformatic pipelines I have produced a .txt file with a list (2699 long) of the contig names that are predicted (<0.05) to be viral. I now need to use this list of predicted contigs to extract and produce a new fasta file that contains only these contigs.

The theoretical idea behind my code is that it opens the .txt file (names of each significant contig) and the original fasta file, goes through each line of the .txt file and sets the line (contig name) as a variable. It should then loop through the original fasta file which contains all the sequence information and if the contig name matches the record.id (contig name from original file) it should then export the full record information to a new file.

I think I am close, but my current iterations seems to run only one or the the other loop as I expect them to.

Please see the code below. I have added notes below to what runs wrong with each program I have tried.

I am using Python, including SeqIO the Biopython application.

#!/usr/bin/env python

from Bio import SeqIO

Lines = []
Counter = 0 

With open(‘VIBRANT_contigs.txt’) as f:
 lines = f.readlines()
 original_file = r"/home/ggb_joshcole/Documents/VxCx_deduped/V1_6D_contigs_5kbp.fa"
 corrected_file = r"/home/ggb_joshcole/Documents/output.fna"
  with open(original_file) as original, open(corrected_file,'w') as corrected:
   records = SeqIO.parse(original_file,'fasta')
    for record in records: 
     id = record.id
     print(“this is the id”, id)
     for line in lines:
      contig = line
      print(“This is the contig”, contig)
       if contig == id:
        SeqIO.write(record, corrected, ‘fasta’)  
         Counter = (counter + 1)

This only prints a list of “this is the id”, id

#!/usr/bin/env python

Lines = []
With open(‘VIBRANT_contigs.txt’) as f:
 lines = f.readlines()
     for line in lines:
      contig = line
      print(contig)

This works and prints the list of contigs within Vibrant_contigs.txt

>>> from Bio import SeqIO
>>> lines = []
>>> with open('VIBRANT_contigs.txt') as f:
...  lines = f.readlines()
...  original_file = r"/home/ggb_joshcole/Documents/VxCx_deduped/V1_6D_contigs_5kbp.fa"
...  corrected_file = r"/home/ggb_joshcole/Documents/output.fna"
...  with open(original_file) as original, open(corrected_file,'w') as corrected:
...   records = SeqIO.parse(original_file,'fasta')
...   for line in lines:
...    contig = line
...    for record in records:
...     id = record.id
...     if contig == id:
...      print("we got a hit")

Prints nothing


>>> from Bio import SeqIO
>>> lines = []
>>> with open('VIBRANT_contigs.txt') as f:
...  lines = f.readlines()
...  original_file = r"/home/ggb_joshcole/Documents/VxCx_deduped/V1_6D_contigs_5kbp.fa"
...  corrected_file = r"/home/ggb_joshcole/Documents/output.fna"
...  with open(original_file) as original, open(corrected_file,'w') as corrected:
...   records = SeqIO.parse(original_file,'fasta')
...   for line in lines:
...    contig = line
...    print("this is the contig", contig)
...    for record in records:
...     id = record.id
...     print("this is the id", id)
...     if contig == id:
...      print("we got a hit")

Prints ‘this is the contig V1_C1005406’ (the first contig in the list)

Then proceeds to print all the ids

Then just print all the contigs

Any help would be greatly appreciated as I am very stuck.

Thank you for your time.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

嘿咻 2025-01-19 06:44:07

在相当多的拼写错误中，主要问题是 lines=f.readlines() 中的行仍将包含换行符 \n，因此永远不会匹配 id来自SeqIO，解决方案是使用简单的strip()调用：

from Bio import SeqIO

Lines = []
Counter = 0

with open("ids_to_extract.txt") as f:
    lines = f.readlines()
    original_file = r"input_sequences.fasta"
    corrected_file = r"extracted_sequences.fasta"
    with open(original_file) as original, open(corrected_file,'w') as corrected:
        records = SeqIO.parse(original_file,'fasta')
        for record in records:
            id = record.id
            print("this is the id", id)
            for line in lines:
                contig = line.strip() # <--- fixed here
                print("This is the contig", contig)
                if contig == id:
                    SeqIO.write(record, corrected, 'fasta')
                    Counter = (Counter + 1)

输入数据如下：

input_sequences.fasta：

>SRR10971381.1 1 length=151
CAAAGTCAAGCTCCCTTCTGCCTTTACACTCTTCGAGCGATTTCCGTCCGCTCTGAGGGAACCTTTGGGCGCCTCCGTTACTCTTTTGGAGGCGACCGCCCCAGTCAAACTGCCCGCCTAAGACTGTCCGGCCGGTCNTTACTCGGCNCGT
>SRR10971381.2 2 length=151
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>SRR10971381.3 3 length=47
TACGATGTCGTTCCGGAAGGTAAGGGCGTGGATGAGACTAAGATCGG
>SRR10971381.4 4 length=149
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
[....]

ids_to_extract：（SeqIo 在空格处分割，因此 id 应该只到第一个空格）

SRR10971381.1
SRR10971381.2
SRR10971381.3
SRR10971381.4

也就是说，这段代码仍然非常低效。对于每个重叠群，您将迭代文件的所有行。更简单的方法是生成您需要的所有 id 的 set，然后只需检查 id 是否在 set 中，从而增加检查的平均运行时间 <代码>O(n) 到 O(1) （参考):

from Bio import SeqIO

Lines = []
Counter = 0

#Pre-Read all ids to extract into a set
with open("ids_to_extract.txt") as f:
    ids_to_extract = {line.strip() for line in f.readlines()}

original_file = r"input_sequences.fasta"
corrected_file = r"extracted_sequences.fasta"
with open(original_file) as original, open(corrected_file,'w') as corrected:
    records = SeqIO.parse(original_file,'fasta')
    for record in records:
        recordId = record.id
        # Now a simplle `in` check
        if recordId in ids_to_extract:
            SeqIO.write(record, corrected, 'fasta')
            Counter = (Counter + 1)

另请注意，还有其他几种快速执行此操作的简单选项：

# using Heng Li's seqkit
seqkit grep -f ids_to_extract.txt input_sequences.fasta
# using bioawk
bioawk -c fastx -v file="ids_to_extract.txt" 'BEGIN{while((getline k < file)>0)i[k]=1}{if(i[$name])print ">"$name"\n"$seq}'  input_sequences.fasta > extracted_sequences.fasta

Among quite a few typos, the main issue is that the line from lines=f.readlines() will still contain the newline character \n and will therefore never match the id from SeqIO, the solution is to use a simple strip() call:

from Bio import SeqIO

Lines = []
Counter = 0

with open("ids_to_extract.txt") as f:
    lines = f.readlines()
    original_file = r"input_sequences.fasta"
    corrected_file = r"extracted_sequences.fasta"
    with open(original_file) as original, open(corrected_file,'w') as corrected:
        records = SeqIO.parse(original_file,'fasta')
        for record in records:
            id = record.id
            print("this is the id", id)
            for line in lines:
                contig = line.strip() # <--- fixed here
                print("This is the contig", contig)
                if contig == id:
                    SeqIO.write(record, corrected, 'fasta')
                    Counter = (Counter + 1)

The input data is like this:

input_sequences.fasta:

>SRR10971381.1 1 length=151
CAAAGTCAAGCTCCCTTCTGCCTTTACACTCTTCGAGCGATTTCCGTCCGCTCTGAGGGAACCTTTGGGCGCCTCCGTTACTCTTTTGGAGGCGACCGCCCCAGTCAAACTGCCCGCCTAAGACTGTCCGGCCGGTCNTTACTCGGCNCGT
>SRR10971381.2 2 length=151
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>SRR10971381.3 3 length=47
TACGATGTCGTTCCGGAAGGTAAGGGCGTGGATGAGACTAAGATCGG
>SRR10971381.4 4 length=149
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
[....]

ids_to_extract: (SeqIo splits at whitespaces, so the id should only be up to the first whitespaces)

SRR10971381.1
SRR10971381.2
SRR10971381.3
SRR10971381.4

That said, this code is still very inefficient. For every contig, you are iterating over all lines of the file. Much easier would be to generate a set of all ids that you need and then just check if it the id is in the set, increasing the average runtime for the check from O(n) to O(1) (reference):

from Bio import SeqIO

Lines = []
Counter = 0

#Pre-Read all ids to extract into a set
with open("ids_to_extract.txt") as f:
    ids_to_extract = {line.strip() for line in f.readlines()}

original_file = r"input_sequences.fasta"
corrected_file = r"extracted_sequences.fasta"
with open(original_file) as original, open(corrected_file,'w') as corrected:
    records = SeqIO.parse(original_file,'fasta')
    for record in records:
        recordId = record.id
        # Now a simplle `in` check
        if recordId in ids_to_extract:
            SeqIO.write(record, corrected, 'fasta')
            Counter = (Counter + 1)

Note also that there are several other fast an easy options to perform this operation:

# using Heng Li's seqkit
seqkit grep -f ids_to_extract.txt input_sequences.fasta
# using bioawk
bioawk -c fastx -v file="ids_to_extract.txt" 'BEGIN{while((getline k < file)>0)i[k]=1}{if(i[$name])print ">"$name"\n"$seq}'  input_sequences.fasta > extracted_sequences.fasta

回复收藏 0 原文

心凉 2025-01-19 06:44:07

seqtk subseq <original.fa> <contigs.txt> > out.fa

seqtk subseq <original.fa> <contigs.txt> > out.fa

回复收藏 0 原文

~没有更多了~

关于作者

鸠书

暂无简介

文章

27 人气

关注发私信

友情链接

文江博客

我有一个包含数百万个序列的 fasta 文件。我只想提取 .txt 文件中名称匹配的内容，我该如何执行此操作？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

燃烧我的卡路李先生

qq_2gSKZM

∞梦里开花

qq_IklFPL

迷途知返

深海不蓝

友情链接

我有一个包含数百万个序列的 fasta 文件。我只想提取 .txt 文件中名称匹配的内容，我该如何执行此操作？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

燃烧我的卡路李先生

qq_2gSKZM

∞梦里开花

qq_IklFPL

迷途知返

深海不蓝

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。