Create a vector where each location holds a name and the sequence
Go through the file line by line
If the line starts with > then add an element to the end of the vector and save the line.substring(1) to the element as the protein name. Initialize the sequence in the element to equal "".
If the line.length == 0 then it is blank and do nothing
Else the line doesn't start with > then it is part of the sequence so go current vector element.sequence += line. Thus way each line between >protein2 and >protein3 is concatenated and saved to the sequence of protein2
var reader = new StreamReader("C:\myfile.fasta");
while(true)
{
var line = reader.ReadLine();
if(string.IsNullOrEmpty(line))
break;
if(line.StartsWith(">"))
StoreProteinName(line);
else
StoreSequence(line);
}
I think maybe a little more detail about the exact file structure could be helpful. Just looking at what you have (and a quick peek at the samples on wikipedia) suggest that the name of the protein is prepended with a >, followed by at least one line break, so that would be a good place to start.
You could split the file on newline, and look for a > character to determine the name.
From there it is a little less clear because I'm not sure if the sequence data is all in one line (no linebreaks) or if it could have linebreaks. If there are none, then you should be able to just store that sequence information, and move on to the next protein name. Something like this:
var reader = new StreamReader("C:\myfile.fasta");
while(true)
{
var line = reader.ReadLine();
if(string.IsNullOrEmpty(line))
break;
if(line.StartsWith(">"))
StoreProteinName(line);
else
StoreSequence(line);
}
If it were me, I would probably use TDD and some sample data to build out a simple parser, and then keep plugging in samples until I felt I had covered all of major variances in the format.
Can you use a language other than C#? There are excellent libraries for dealing with FASTA files and other biological sequence in Perl, Python, Ruby, Java, and R (off the top of my head). They're usually branded Bio* (i.e. BioPerl, BioJava, etc)
发布评论
评论(3)
为此,一种方法是:
保存名称和序列
逐行浏览文件
>
开头,则添加到向量末尾的元素
并将 line.substring(1) 保存到
元素作为蛋白质名称。
初始化序列
元素等于
""
。空白且不执行任何操作
> 开头
那么它是序列的一部分所以
去当前向量element.sequence
+= 线。因此,
> Protein2 和
> Protein3 之间的每一行都是
连接并保存到
蛋白质2
的序列To do this one way is to:
holds a name and the sequence
Go through the file line by line
>
then addan element to the end of the vector
and save the line.substring(1) to
the element as the protein name.
Initialize the sequence in the
element to equal
""
.blank and do nothing
>
then it is part of the sequence so
go current vector element.sequence
+= line. Thus way each line between
>protein2
and>protein3
isconcatenated and saved to the
sequence of
protein2
我认为有关确切文件结构的更多详细信息可能会有所帮助。只需查看您拥有的内容(并快速浏览 wikipedia 上的示例)即可得出该名称蛋白质的前面带有
>
,后跟至少一个换行符,因此这将是一个很好的起点。您可以按换行符拆分文件,然后查找
>
字符来确定名称。从那里开始,情况就不太清楚了,因为我不确定序列数据是否全部在一行中(没有换行符)或者是否可以有换行符。如果没有,那么您应该能够存储该序列信息,然后继续处理下一个蛋白质名称。像这样的事情:
如果是我,我可能会使用 TDD 和一些示例数据来构建一个简单的解析器,然后继续插入示例,直到我觉得我已经涵盖了格式中的所有主要差异。
I think maybe a little more detail about the exact file structure could be helpful. Just looking at what you have (and a quick peek at the samples on wikipedia) suggest that the name of the protein is prepended with a
>
, followed by at least one line break, so that would be a good place to start.You could split the file on newline, and look for a
>
character to determine the name.From there it is a little less clear because I'm not sure if the sequence data is all in one line (no linebreaks) or if it could have linebreaks. If there are none, then you should be able to just store that sequence information, and move on to the next protein name. Something like this:
If it were me, I would probably use TDD and some sample data to build out a simple parser, and then keep plugging in samples until I felt I had covered all of major variances in the format.
可以使用 C# 以外的语言吗? Perl、Python、Ruby、Java 和 R 中有一些优秀的库可用于处理 FASTA 文件和其他生物序列(我突然想到了)。它们通常被标记为 Bio*(即 BioPerl、BioJava 等)。
如果您对 C 或 C++ 感兴趣,请在 Biostar 上查看此问题的答案:
http://biostar.stackexchange.com/questions/1516/cc-生物信息学库
帮自己一个忙,如果没有必要,就不要重新发明轮子。
Can you use a language other than C#? There are excellent libraries for dealing with FASTA files and other biological sequence in Perl, Python, Ruby, Java, and R (off the top of my head). They're usually branded Bio* (i.e. BioPerl, BioJava, etc)
If you're interested in C or C++, check out the answers to this question over at Biostar:
http://biostar.stackexchange.com/questions/1516/c-c-libraries-for-bioinformatics
Do yourself a favor, and don't reinvent the wheel if you don't have to.