正则表达式蛋白质消化
因此,我正在用一种酶(出于你的好奇心,Asp-N)消化蛋白质序列,该酶在单字母编码序列中 B 或 D 编码的蛋白质之前进行切割。我的实际分析使用 String#scan 进行捕获。我试图找出为什么以下正则表达式不能正确地消化它......
(\w*?)(?=[BD])|(.*\b)
其中先行词 (.*\b)
存在以捕获序列的结尾。 对于:
MTMDKPSQYDKIEAELQDICNDVLELLDSKGDYFRYLSEVASGDN
这应该给出类似:[MTM, DKPSQY, DKIEAELQ, DICN, DVLELL, DSKG, ... ]
,但会错过序列中的每个 D。
我一直在使用 http://www.rubular.com 进行故障排除,它在 1.8.7 上运行,尽管我还在 1.9.2 上测试了这个正则表达式,但没有成功。据我了解,两个版本的 ruby 都支持零宽度先行断言。我的正则表达式做错了什么?
So, I'm digesting a protein sequence with an enzyme (for your curiosity, Asp-N) which cleaves before the proteins coded by B or D in a single-letter coded sequence. My actual analysis uses String#scan
for the captures. I'm trying to figure out why the following regular expression doesn't digest it correctly...
(\w*?)(?=[BD])|(.*\b)
where the antecedent (.*\b)
exists to capture the end of the sequence.
For:
MTMDKPSQYDKIEAELQDICNDVLELLDSKGDYFRYLSEVASGDN
This should give something like: [MTM, DKPSQY, DKIEAELQ, DICN, DVLELL, DSKG, ... ]
but instead misses each D in the sequence.
I've been using http://www.rubular.com for troubleshooting, which runs on 1.8.7 although I've also tested this REGEX on 1.9.2 to no avail. It is my understanding that zero-width lookahead assertions are supported in both versions of ruby. What am I doing wrong with my regex?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
基本上,您想在每个 B 或 D 之前切断琴弦吗?
给你
Basically, you want to cut you string before each B or D?
Gives you
支持这一点的最简单方法是分割零宽度前瞻:
为了了解您的解决方案出了什么问题,让我们首先看看您的正则表达式与有效的正则表达式:
问题是,如果您可以捕获零个字符并且仍然匹配您的零宽度前瞻,您无需推进扫描指针即可成功。让我们看一个更简单但类似的测试用例:
String#scan
的简单实现可能会陷入无限循环,重复匹配第一个字符之前的指针。看起来,一旦发生匹配而没有使指针前进,算法就会强制将指针前进一个字符。这解释了您的情况的结果:The simplest way to support this is to split on the zero-width lookahead:
For understanding as to what was going wrong with your solution, let's look first at your regex versus one that works:
The problem is that if you can capture zero characters and still match your zero-width lookahead, you succeed without advancing the scanning pointer. Let's look at a simpler-but-similar test case:
A naive implementation of
String#scan
might get stuck in an infinite loop, repeatedly matching with the pointer before the first character. It appears that once a match occurs without advancing the pointer the algorithm forcibly advances the pointer by one character. This explains the results in your case: