在 C# 中读取 FASTA 文件的最佳方法

发布于 2024-09-06 19:08:24 字数 1431 浏览 5 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

妞丶爷亲个 2024-09-13 19:08:24

为此,一种方法是:

  1. 创建一个向量,其中每个位置
    保存名称和序列
  2. 逐行浏览文件

    • 如果该行以 > 开头,则添加
      到向量末尾的元素
      并将 line.substring(1) 保存到
      元素作为蛋白质名称。
      初始化序列
      元素等于 ""
    • 如果 line.length == 0 那么它是
      空白且不执行任何操作
    • 否则该行不以 > 开头
      那么它是序列的一部分所以
      去当前向量element.sequence
      += 线。因此, > Protein2 和 > Protein3 之间的每一行都是
      连接并保存到
      蛋白质2的序列

To do this one way is to:

  1. Create a vector where each location
    holds a name and the sequence
  2. Go through the file line by line

    • If the line starts with > then add
      an element to the end of the vector
      and save the line.substring(1) to
      the element as the protein name.
      Initialize the sequence in the
      element to equal "".
    • If the line.length == 0 then it is
      blank and do nothing
    • Else the line doesn't start with >
      then it is part of the sequence so
      go current vector element.sequence
      += line. Thus way each line between >protein2 and >protein3 is
      concatenated and saved to the
      sequence of protein2
尬尬 2024-09-13 19:08:24

我认为有关确切文件结构的更多详细信息可能会有所帮助。只需查看您拥有的内容(并快速浏览 wikipedia 上的示例)即可得出该名称蛋白质的前面带有 >,后跟至少一个换行符,因此这将是一个很好的起点。

您可以按换行符拆分文件,然后查找 > 字符来确定名称。

从那里开始,情况就不太清楚了,因为我不确定序列数据是否全部在一行中(没有换行符)或者是否可以有换行符。如果没有,那么您应该能够存储该序列信息,然后继续处理下一个蛋白质名称。像这样的事情:

var reader = new StreamReader("C:\myfile.fasta");
while(true)
{
    var line = reader.ReadLine();
    if(string.IsNullOrEmpty(line))
        break;
    if(line.StartsWith(">"))
        StoreProteinName(line);
    else
        StoreSequence(line);
}

如果是我,我可能会使用 TDD 和一些示例数据来构建一个简单的解析器,然后继续插入示例,直到我觉得我已经涵盖了格式中的所有主要差异。

I think maybe a little more detail about the exact file structure could be helpful. Just looking at what you have (and a quick peek at the samples on wikipedia) suggest that the name of the protein is prepended with a >, followed by at least one line break, so that would be a good place to start.

You could split the file on newline, and look for a > character to determine the name.

From there it is a little less clear because I'm not sure if the sequence data is all in one line (no linebreaks) or if it could have linebreaks. If there are none, then you should be able to just store that sequence information, and move on to the next protein name. Something like this:

var reader = new StreamReader("C:\myfile.fasta");
while(true)
{
    var line = reader.ReadLine();
    if(string.IsNullOrEmpty(line))
        break;
    if(line.StartsWith(">"))
        StoreProteinName(line);
    else
        StoreSequence(line);
}

If it were me, I would probably use TDD and some sample data to build out a simple parser, and then keep plugging in samples until I felt I had covered all of major variances in the format.

余生再见 2024-09-13 19:08:24

可以使用 C# 以外的语言吗? Perl、Python、Ruby、Java 和 R 中有一些优秀的库可用于处理 FASTA 文件和其他生物序列(我突然想到了)。它们通常被标记为 Bio*(即 BioPerl、BioJava 等)。

如果您对 C 或 C++ 感兴趣,请在 Biostar 上查看此问题的答案:
http://biostar.stackexchange.com/questions/1516/cc-生物信息学库

帮自己一个忙,如果没有必要,就不要重新发明轮子。

Can you use a language other than C#? There are excellent libraries for dealing with FASTA files and other biological sequence in Perl, Python, Ruby, Java, and R (off the top of my head). They're usually branded Bio* (i.e. BioPerl, BioJava, etc)

If you're interested in C or C++, check out the answers to this question over at Biostar:
http://biostar.stackexchange.com/questions/1516/c-c-libraries-for-bioinformatics

Do yourself a favor, and don't reinvent the wheel if you don't have to.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文