正则表达式蛋白质消化

发布于 2024-11-08 11:02:00 字数 574 浏览 0 评论 0原文

因此，我正在用一种酶（出于你的好奇心，Asp-N）消化蛋白质序列，该酶在单字母编码序列中 B 或 D 编码的蛋白质之前进行切割。我的实际分析使用 String#scan 进行捕获。我试图找出为什么以下正则表达式不能正确地消化它......

(\w*?)(?=[BD])|(.*\b)

其中先行词 (.*\b) 存在以捕获序列的结尾。对于：

MTMDKPSQYDKIEAELQDICNDVLELLDSKGDYFRYLSEVASGDN

这应该给出类似：[MTM, DKPSQY, DKIEAELQ, DICN, DVLELL, DSKG, ... ]，但会错过序列中的每个 D。

我一直在使用 http://www.rubular.com 进行故障排除，它在 1.8.7 上运行，尽管我还在 1.9.2 上测试了这个正则表达式，但没有成功。据我了解，两个版本的 ruby 都支持零宽度先行断言。我的正则表达式做错了什么？

原文

So, I'm digesting a protein sequence with an enzyme (for your curiosity, Asp-N) which cleaves before the proteins coded by B or D in a single-letter coded sequence. My actual analysis uses String#scan for the captures. I'm trying to figure out why the following regular expression doesn't digest it correctly...

(\w*?)(?=[BD])|(.*\b)

where the antecedent (.*\b) exists to capture the end of the sequence.
For:

MTMDKPSQYDKIEAELQDICNDVLELLDSKGDYFRYLSEVASGDN

This should give something like: [MTM, DKPSQY, DKIEAELQ, DICN, DVLELL, DSKG, ... ] but instead misses each D in the sequence.

I've been using http://www.rubular.com for troubleshooting, which runs on 1.8.7 although I've also tested this REGEX on 1.9.2 to no avail. It is my understanding that zero-width lookahead assertions are supported in both versions of ruby. What am I doing wrong with my regex?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

情话已封尘 2024-11-15 11:02:00

基本上，您想在每个 B 或 D 之前切断琴弦吗？

"...".split(/(?=[BD])/)

给你

["MTM", "DKPSQY", "DKIEAELQ", "DICN", "DVLELL", "DSKG", "DYFRYLSEVASG", "DN"]

Basically, you want to cut you string before each B or D?

"...".split(/(?=[BD])/)

Gives you

["MTM", "DKPSQY", "DKIEAELQ", "DICN", "DVLELL", "DSKG", "DYFRYLSEVASG", "DN"]

回复收藏 0 原文

杀手六號 2024-11-15 11:02:00

支持这一点的最简单方法是分割零宽度前瞻：

s = "MTMDKPSQYDKIEAELQDICNDVLELLDSKG"
p s.split /(?=[BD])/
#=> ["MTM", "DKPSQY", "DKIEAELQ", "DICN", "DVLELL", "DSKG"]

为了了解您的解决方案出了什么问题，让我们首先看看您的正则表达式与有效的正则表达式：

p s.scan(/.*?(?=[BD]|$)/)
#=> ["MTM", "", "KPSQY", "", "KIEAELQ", "", "ICN", "", "VLELL", "", "SKG", ""]

p s.scan(/.+?(?=[BD]|$)/)
#=> ["MTM", "DKPSQY", "DKIEAELQ", "DICN", "DVLELL", "DSKG"]

问题是，如果您可以捕获零个字符并且仍然匹配您的零宽度前瞻，您无需推进扫描指针即可成功。让我们看一个更简单但类似的测试用例：

s = "abcd"
p s.scan //      # Match any position, without advancing
#=> ["", "", "", "", ""]

p s.scan /(?=.)/ # Anywhere that is followed by a character, without advancing
#=> ["", "", "", ""]

String#scan 的简单实现可能会陷入无限循环，重复匹配第一个字符之前的指针。看起来，一旦发生匹配而没有使指针前进，算法就会强制将指针前进一个字符。这解释了您的情况的结果：

首先它匹配 B 或 D 之前的所有字符，
然后匹配 B 或 D 之前的零宽度位置，而不移动字符指针，
因此算法移动指针经过 B 或 D，然后继续。

The simplest way to support this is to split on the zero-width lookahead:

s = "MTMDKPSQYDKIEAELQDICNDVLELLDSKG"
p s.split /(?=[BD])/
#=> ["MTM", "DKPSQY", "DKIEAELQ", "DICN", "DVLELL", "DSKG"]

For understanding as to what was going wrong with your solution, let's look first at your regex versus one that works:

p s.scan(/.*?(?=[BD]|$)/)
#=> ["MTM", "", "KPSQY", "", "KIEAELQ", "", "ICN", "", "VLELL", "", "SKG", ""]

p s.scan(/.+?(?=[BD]|$)/)
#=> ["MTM", "DKPSQY", "DKIEAELQ", "DICN", "DVLELL", "DSKG"]

The problem is that if you can capture zero characters and still match your zero-width lookahead, you succeed without advancing the scanning pointer. Let's look at a simpler-but-similar test case:

s = "abcd"
p s.scan //      # Match any position, without advancing
#=> ["", "", "", "", ""]

p s.scan /(?=.)/ # Anywhere that is followed by a character, without advancing
#=> ["", "", "", ""]

A naive implementation of String#scan might get stuck in an infinite loop, repeatedly matching with the pointer before the first character. It appears that once a match occurs without advancing the pointer the algorithm forcibly advances the pointer by one character. This explains the results in your case:

First it matches all the characters up to a B or D,
then it matches the zero-width position right before the B or D, without moving the character pointer,
as a result the algorithm moves the pointer past the B or D, and continues on after that.

回复收藏 0 原文

~没有更多了~