是否存在酶促裂解的正则表达式?
是否存在用于(理论上)tryptic 裂解 蛋白质序列?胰蛋白酶的切割规则是:在 R 或 K 之后,但不在 P 之前。
示例:
序列 VGTKCCTKPESERMPCTEDYLSLILNR
的切割应产生这 3 个序列 (肽s):
VGTK
CCTKPESER
MPCTEDYLSLILNR
请注意,第二个肽中 K 之后没有切割(因为 P 在 K 之后)。
在 Perl 中(也可以在 C#、Python 或 Ruby 中使用):
my $seq = 'VGTRCCTKPESERMPCTEDYLSLILNR';
my @peptides = split /someRegularExpression/, $seq;
我使用了这种解决方法(其中剪切标记 = 首先插入到序列中,如果 P 紧接在剪切标记之后,则再次删除):
my $seq = 'VGTRCCTKPESERMPCTEDYLSLILNR';
$seq =~ s/([RK])/$1=/g; #Main cut rule.
$seq =~ s/=P/P/g; #The exception.
my @peptides = split( /=/, $seq);
但这需要修改一个可能很长并且可能有数百万个序列的字符串。有没有一种方法可以将正则表达式与 split 一起使用?如果是,正则表达式是什么?
测试平台:Windows XP 64位。 ActivePerl 64 位。来自 perl -v:为 MSWin32-x64-多线程构建的 v5.10.0。
Does a regular expression exist for (theoretical) tryptic cleavage of protein sequences? The cleavage rule for trypsin is: after R or K, but not before P.
Example:
Cleavage of the sequence VGTKCCTKPESERMPCTEDYLSLILNR
should result in these 3 sequences (peptides):
VGTK
CCTKPESER
MPCTEDYLSLILNR
Note that there is no cleavage after K in the second peptide (because P comes after K).
In Perl (it could just as well have been in C#, Python or Ruby):
my $seq = 'VGTRCCTKPESERMPCTEDYLSLILNR';
my @peptides = split /someRegularExpression/, $seq;
I have used this work-around (where a cut marker, =, is first inserted in the sequence and removed again if P is immediately after the cut maker):
my $seq = 'VGTRCCTKPESERMPCTEDYLSLILNR';
$seq =~ s/([RK])/$1=/g; #Main cut rule.
$seq =~ s/=P/P/g; #The exception.
my @peptides = split( /=/, $seq);
But this requires modification to a string that can potentially be very long and there can be millions of sequences. Is there a way where a regular expression can be used with split? If yes, what would the regular expression be?
Test platform: Windows XP 64 bit. ActivePerl 64 bit. From perl -v: v5.10.0 built for MSWin32-x64-multi-thread.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您确实需要结合使用正向前瞻和负向前瞻。正确的(Perl)语法如下:
You indeed need to use the combination of a positive lookbehind and a negative lookahead. The correct (Perl) syntax is as follows:
您可以使用 环顾断言 来排除这种情况。像这样的东西应该有效:
You could use look-around assertions to exclude that cases. Something like this should work:
您可以使用向前查找和向后查找来匹配这些内容,同时仍然获得正确的位置。
应该最终在 R 或 K 之后且后面没有 P 的点上进行分裂。
You can use lookaheads and lookbehinds to match this stuff while still getting the correct position.
Should end up splitting on a point after an R or K that is not followed by a P.
在 Python 中,您可以使用 finditer 方法返回非重叠模式匹配,包括开始和跨度信息。然后,您可以存储字符串偏移量,而不是重建字符串。
In Python you can use the
finditer
method to return non-overlapping pattern matches including start and span information. You can then store the string offsets instead of rebuilding the string.