我有一个包含数百万个序列的 fasta 文件。我只想提取 .txt 文件中名称匹配的内容,我该如何执行此操作?
我一直在对大约 1.5m 的读取 fasta 文件(“V1_6D_contigs_5kbp.fa”)进行排序,以确定哪些读取可能是“病毒式”的。该文件中的读数表示为 Vx_Cz - 其中 x 是 1-6,具体取决于它来自哪个试验组,z 是来自 1-~1.5meg V1_C10810 或 V3_C587937 的重叠群编号/名称...
通过不同的生物信息学管道我生成了一个 .txt 文件,其中包含预测的重叠群名称列表(长 2699 个) (<0.05) 具有病毒性。我现在需要使用这个预测重叠群列表来提取并生成一个仅包含这些重叠群的新 fasta 文件。
我的代码背后的理论思想是,它打开 .txt 文件(每个重要重叠群的名称)和原始 fasta 文件,遍历 .txt 文件的每一行并将该行(重叠群名称)设置为变量。然后,它应该循环遍历包含所有序列信息的原始 fasta 文件,如果重叠群名称与 record.id(原始文件中的重叠群名称)匹配,则应该将完整记录信息导出到新文件。
我认为我已经很接近了,但我当前的迭代似乎只运行一个或另一个循环,正如我所期望的那样。
请参阅下面的代码。我在下面添加了我尝试过的每个程序的错误注释。
我正在使用 Python,包括 SeqIO Biopython 应用程序。
#!/usr/bin/env python
from Bio import SeqIO
Lines = []
Counter = 0
With open(‘VIBRANT_contigs.txt’) as f:
lines = f.readlines()
original_file = r"/home/ggb_joshcole/Documents/VxCx_deduped/V1_6D_contigs_5kbp.fa"
corrected_file = r"/home/ggb_joshcole/Documents/output.fna"
with open(original_file) as original, open(corrected_file,'w') as corrected:
records = SeqIO.parse(original_file,'fasta')
for record in records:
id = record.id
print(“this is the id”, id)
for line in lines:
contig = line
print(“This is the contig”, contig)
if contig == id:
SeqIO.write(record, corrected, ‘fasta’)
Counter = (counter + 1)
这仅打印“这是 id”的列表,id
#!/usr/bin/env python
Lines = []
With open(‘VIBRANT_contigs.txt’) as f:
lines = f.readlines()
for line in lines:
contig = line
print(contig)
这可以工作并打印 Vibrant_contigs.txt 中的重叠群列表
>>> from Bio import SeqIO
>>> lines = []
>>> with open('VIBRANT_contigs.txt') as f:
... lines = f.readlines()
... original_file = r"/home/ggb_joshcole/Documents/VxCx_deduped/V1_6D_contigs_5kbp.fa"
... corrected_file = r"/home/ggb_joshcole/Documents/output.fna"
... with open(original_file) as original, open(corrected_file,'w') as corrected:
... records = SeqIO.parse(original_file,'fasta')
... for line in lines:
... contig = line
... for record in records:
... id = record.id
... if contig == id:
... print("we got a hit")
不打印任何
>>> from Bio import SeqIO
>>> lines = []
>>> with open('VIBRANT_contigs.txt') as f:
... lines = f.readlines()
... original_file = r"/home/ggb_joshcole/Documents/VxCx_deduped/V1_6D_contigs_5kbp.fa"
... corrected_file = r"/home/ggb_joshcole/Documents/output.fna"
... with open(original_file) as original, open(corrected_file,'w') as corrected:
... records = SeqIO.parse(original_file,'fasta')
... for line in lines:
... contig = line
... print("this is the contig", contig)
... for record in records:
... id = record.id
... print("this is the id", id)
... if contig == id:
... print("we got a hit")
内容 打印“这是重叠群 V1_C1005406”(列表中的第一个重叠群)
然后继续打印所有ids
然后只需打印所有重叠群
任何帮助将不胜感激,因为我非常困难。
感谢您抽出时间。
I have been sorting through a ~1.5m read fasta file ('V1_6D_contigs_5kbp.fa') to determine which of the reads are likely to be 'viral' in origin. The reads in this file are denoted as Vx_Cz - where x is 1-6, depending on which trial group it came from, and z is the contig number/name from 1-~1.5m. e.g V1_C10810 or V3_C587937...
Through varying bioinformatic pipelines I have produced a .txt file with a list (2699 long) of the contig names that are predicted (<0.05) to be viral. I now need to use this list of predicted contigs to extract and produce a new fasta file that contains only these contigs.
The theoretical idea behind my code is that it opens the .txt file (names of each significant contig) and the original fasta file, goes through each line of the .txt file and sets the line (contig name) as a variable. It should then loop through the original fasta file which contains all the sequence information and if the contig name matches the record.id (contig name from original file) it should then export the full record information to a new file.
I think I am close, but my current iterations seems to run only one or the the other loop as I expect them to.
Please see the code below. I have added notes below to what runs wrong with each program I have tried.
I am using Python, including SeqIO the Biopython application.
#!/usr/bin/env python
from Bio import SeqIO
Lines = []
Counter = 0
With open(‘VIBRANT_contigs.txt’) as f:
lines = f.readlines()
original_file = r"/home/ggb_joshcole/Documents/VxCx_deduped/V1_6D_contigs_5kbp.fa"
corrected_file = r"/home/ggb_joshcole/Documents/output.fna"
with open(original_file) as original, open(corrected_file,'w') as corrected:
records = SeqIO.parse(original_file,'fasta')
for record in records:
id = record.id
print(“this is the id”, id)
for line in lines:
contig = line
print(“This is the contig”, contig)
if contig == id:
SeqIO.write(record, corrected, ‘fasta’)
Counter = (counter + 1)
This only prints a list of “this is the id”, id
#!/usr/bin/env python
Lines = []
With open(‘VIBRANT_contigs.txt’) as f:
lines = f.readlines()
for line in lines:
contig = line
print(contig)
This works and prints the list of contigs within Vibrant_contigs.txt
>>> from Bio import SeqIO
>>> lines = []
>>> with open('VIBRANT_contigs.txt') as f:
... lines = f.readlines()
... original_file = r"/home/ggb_joshcole/Documents/VxCx_deduped/V1_6D_contigs_5kbp.fa"
... corrected_file = r"/home/ggb_joshcole/Documents/output.fna"
... with open(original_file) as original, open(corrected_file,'w') as corrected:
... records = SeqIO.parse(original_file,'fasta')
... for line in lines:
... contig = line
... for record in records:
... id = record.id
... if contig == id:
... print("we got a hit")
Prints nothing
>>> from Bio import SeqIO
>>> lines = []
>>> with open('VIBRANT_contigs.txt') as f:
... lines = f.readlines()
... original_file = r"/home/ggb_joshcole/Documents/VxCx_deduped/V1_6D_contigs_5kbp.fa"
... corrected_file = r"/home/ggb_joshcole/Documents/output.fna"
... with open(original_file) as original, open(corrected_file,'w') as corrected:
... records = SeqIO.parse(original_file,'fasta')
... for line in lines:
... contig = line
... print("this is the contig", contig)
... for record in records:
... id = record.id
... print("this is the id", id)
... if contig == id:
... print("we got a hit")
Prints ‘this is the contig V1_C1005406’ (the first contig in the list)
Then proceeds to print all the ids
Then just print all the contigs
Any help would be greatly appreciated as I am very stuck.
Thank you for your time.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
在相当多的拼写错误中,主要问题是
lines=f.readlines()
中的行仍将包含换行符\n
,因此永远不会匹配 id来自SeqIO
,解决方案是使用简单的strip()
调用:输入数据如下:
input_sequences.fasta:
ids_to_extract: (
SeqIo
在空格处分割,因此 id 应该只到第一个空格)也就是说,这段代码仍然非常低效。对于每个重叠群,您将迭代文件的所有行。更简单的方法是生成您需要的所有 id 的
set
,然后只需检查 id 是否在set
中,从而增加检查的平均运行时间 <代码>O(n) 到O(1)
(参考):另请注意,还有其他几种快速执行此操作的简单选项:
Among quite a few typos, the main issue is that the line from
lines=f.readlines()
will still contain the newline character\n
and will therefore never match the id fromSeqIO
, the solution is to use a simplestrip()
call:The input data is like this:
input_sequences.fasta:
ids_to_extract: (
SeqIo
splits at whitespaces, so the id should only be up to the first whitespaces)That said, this code is still very inefficient. For every contig, you are iterating over all lines of the file. Much easier would be to generate a
set
of all ids that you need and then just check if it the id is in theset
, increasing the average runtime for the check fromO(n)
toO(1)
(reference):Note also that there are several other fast an easy options to perform this operation: