验证 C/C++ 中的 DNA
我正在迭代 DNA 序列,一次将 5-15 个碱基的块拉出到 C++ std::string 对象中。有时,我的字符串会包含非 ATCG 碱基,我想在发生这种情况时采取行动。例如,我可能会看到:
CTACGGTACGRCTA
因为有一个“R”,所以我想识别这种情况。我熟悉正则表达式,但人们似乎推荐了几个不同的库。我见过 Boost、TR1 等。有人可以建议一种不同的方式来捕获我的案例,或者告诉我应该使用哪个库以及为什么?
谢谢
I am iterating over DNA sequences pulling out chunks of 5-15 bases at a time into C++ std::string objects. Occasionally, my string will contain a non ATCG base, and I want to take an action when this happens. For example, I might see:
CTACGGTACGRCTA
Because there is an 'R', I want to recognize this case. I am familiar with regex, but people seem to recommend several different libraries. I've seen Boost, TR1, and others. Can someone please suggest either a different way to catch my cases or tell me which library I should use and why?
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
正则表达式对此来说太过分了。您可以使用
std::string::find_first_not_of()
。A regular expression is overkill for this. You can use
std::string::find_first_not_of()
.我想到了使用 C
strspn()
。Using C
strspn()
comes to mind.您当然可以使用正则表达式。但为什么不保持简单呢?
You can of course use regular expressions. But why not keep it simple?
如果您想使用正则表达式来解决此问题,这里是一个检查一个无效字符的正则表达式:
或者这是一个用于验证整个序列的正则表达式:
非常简单。
编辑:删除了不相关的材料。
If you would like to use a regex to solve this problem, here is one that checks for one invalid char:
Or here is a regex to validate an entire sequence:
Pretty simple.
Edit: Removed irrelevant material.
R 代表潜在的 DNA 对(“字母”)吗?
如果是这样,为了正确显示或准确解释整个序列,碱基对的排序至关重要。
在密码子中。确定R在哪个位置? RAA、ARA、AAR,了解这一点非常重要。然后通过定义它们的属性来处理它们。
如果它只是垃圾或可能是数据存储中剩余的数据。循环并删除。
Does R represent a potential DNA pair ('letter')?
If so, the ordering of base pairs is critical in order to correctly display or accurately interpret the entire sequence as a whole.
In a codon. Determine in which place the R is in? RAA, ARA, AAR, knowing this is very important. Then handle these by defining their attributes.
If its just junk or left over data from perhaps data storage. Loop through and remove.