正则表达式 - 获取两个不包含单词的单词之间的字符串
我一直在环顾四周,但无法实现这一点。我不完全是菜鸟。
我需要获取由(包括)START 和 END 分隔但不包含 START 的文本。基本上我找不到一种方法来否定整个单词而不使用高级的东西。
示例字符串:
abcSTARTabcSTARTabcENDabc
预期结果:
STARTabcEND
不好:
STARTabcSTARTabcEND
我无法使用向后搜索功能。我在这里测试我的正则表达式: www.regextester.com
感谢您的任何建议。
I've been looking around and could not make this happen. I am not totally noob.
I need to get text delimited by (including) START and END that doesn't contain START. Basically I can't find a way to negate a whole word without using advanced stuff.
Example string:
abcSTARTabcSTARTabcENDabc
The expected result:
STARTabcEND
Not good:
STARTabcSTARTabcEND
I can't use backward search stuff. I am testing my regex here: www.regextester.com
Thanks for any advice.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
尝试一下
在 Regexr 上在线查看
(?!.*START)
是一个否定向前看。它确保单词“START”不在.*?
后面,是所有字符的非贪婪匹配,直到下一个“END”。它是必需的,因为负前瞻只是向前看,而不捕获任何内容(零长度断言)更新:
我想得更多,上面的解决方案匹配直到第一个“END”。如果不需要(因为您从内容中排除了 START),则使用贪婪版本,
这将匹配到最后一个“END”。
Try this
See it here online on Regexr
(?!.*START)
is a negative lookahead. It ensures that the word "START" is not following.*?
is a non greedy match of all characters till the next "END". Its needed, because the negative lookahead is just looking ahead and not capturing anything (zero length assertion)Update:
I thought a bit more, the solution above is matching till the first "END". If this is not wanted (because you are excluding START from the content) then use the greedy version
this will match till the last "END".
将适用于任意数量的
START...END
对。用 Python 演示:如果您只关心
START
和END
之间的内容,请使用:请参见此处:
will work with any number of
START...END
pairs. To demonstrate in Python:If you only care for the content between
START
andEND
, use this:See it here:
真正行人的解决方案是 START(([^S]|S*S[^ST]|ST[^A]|STA[^R]|STAR[^T])*(S(T(AR ?)?)?)?)END。现代正则表达式风格有负面断言,可以更优雅地做到这一点,但我将您对“向后搜索”的评论解释为可能意味着您不能或不想使用此功能。
更新:为了完整起见,请注意上面的内容对于结束分隔符是贪婪的。要仅捕获最短的可能字符串,请扩展否定以覆盖结束分隔符 --
START(([^ES]|E*E[^ENS]|EN[^DS]|S*S[^STE ]|ST[^AE]|STA[^RE]|STAR[^TE])*(S(T(AR?)?)?|EN?)?)END
。不过,在大多数文化中,这有可能超出酷刑阈值。错误修复:此答案的先前版本有一个错误,因为
SSTART
可能是匹配的一部分(第二个S
将匹配 <代码>[^T]等)。我修复了这个问题,但通过在[^ST]
中添加S
并在非可选S
之前添加S*
code> 允许任意重复S
否则。The really pedestrian solution would be
START(([^S]|S*S[^ST]|ST[^A]|STA[^R]|STAR[^T])*(S(T(AR?)?)?)?)END
. Modern regex flavors have negative assertions which do this more elegantly, but I interpret your comment about "backwards search" to perhaps mean you cannot or don't want to use this feature.Update: Just for completeness, note that the above is greedy with respect to the end delimiter. To only capture the shortest possible string, extend the negation to also cover the end delimiter --
START(([^ES]|E*E[^ENS]|EN[^DS]|S*S[^STE]|ST[^AE]|STA[^RE]|STAR[^TE])*(S(T(AR?)?)?|EN?)?)END
. This risks to exceed the torture threshold in most cultures, though.Bug fix: A previous version of this answer had a bug, in that
SSTART
could be part of the match (the secondS
would match[^T]
, etc). I fixed this but by the addition ofS
in[^ST]
and addingS*
before the non-optionalS
to allow for arbitrary repetitions ofS
otherwise.我可以建议对 Tim Pietzcker 的解决方案进行可能的改进吗?
在我看来,
START(?:(?!START).)*?END
更好,以便仅捕获紧随其后的START
END 之间没有任何START
或END
。我正在使用 .NET,蒂姆的解决方案也将匹配诸如START END END
之类的内容。至少就我个人而言,这是不希望的。May I suggest a possible improvement on the solution of Tim Pietzcker?
It seems to me that
START(?:(?!START).)*?END
is better in order to only catch aSTART
immediately followed by anEND
without anySTART
orEND
in between. I am using .NET and Tim's solution would match also something likeSTART END END
. At least in my personal case this is not wanted.[编辑:我留下这篇文章是为了获取有关捕获组的信息,但我给出的主要解决方案不正确。
<罢工><代码>(?:START)((?:[^S]|S[^T]|ST[^A]|STA[^R]|STAR[^T])*)(?:END )
正如评论中指出的那样,这是行不通的;我忘记了被忽略的字符不能被删除,因此您需要诸如 ...
|STA(?![^R])|
之类的东西来仍然允许该字符成为 END 的一部分,因此在诸如 STARTSTAEND 之类的事情上失败;所以这显然是一个更好的选择;下面应该显示使用捕获组的正确方法...]使用带有捕获组的“零宽度负先行”运算符“?!”给出的答案是:
(?:START)(( ?!.*START).*)(?:END)
使用 $1 捕获内部文本进行替换。如果您想捕获 START 和 END 标记,您可以执行(START)((?!.*START).*)(END)
,这会给出 $1=START $2=text 和 $3=END或通过添加/删除()
或?:
进行各种其他排列。这样,如果您使用它进行搜索和替换,您可以执行类似 BEGIN$1FINISH 的操作。因此,如果您从以下位置开始:
abcSTARTdefSTARTghiENDjkl
,您将获得
ghi
作为捕获组 1,并且替换为 BEGIN$1FINISH 将为您提供以下内容:abcSTARTdefBEGINghiFINISHjkl
>仅当正确配对时,您才可以更改 START/END 令牌。
每个
(x)
都是一个组,但我已经为每个组添加了(?:x)
,除了中间的组,中间的组将其标记为非捕获组;我留下的唯一一个没有?:
的是中间的;但是,如果您想移动它们或您有什么,您也可以捕获 BEGIN/END 标记。请参阅 Java 正则表达式文档有关 Java 正则表达式的完整详细信息。
[EDIT: I have left this post for the information on capture groups but the main solution I gave was not correct.
(?:START)((?:[^S]|S[^T]|ST[^A]|STA[^R]|STAR[^T])*)(?:END)
as pointed out in the comments would not work; I was forgetting that the ignored characters could not be dropped and thus you would need something such as ...
|STA(?![^R])|
to still allow that character to be part of END, thus failing on something such as STARTSTAEND; so it's clearly a better choice; the following should show the proper way to use the capture groups...]The answer given using the 'zero-width negative lookahead' operator "?!", with capture groups, is:
(?:START)((?!.*START).*)(?:END)
which captures the inner text using $1 for the replace. If you want to have the START and END tags captured you could do(START)((?!.*START).*)(END)
which gives $1=START $2=text and $3=END or various other permutations by adding/removing()
s or?:
s.That way if you are using it to do search and replace, you can do, something like BEGIN$1FINISH. So, if you started with:
abcSTARTdefSTARTghiENDjkl
you would get
ghi
as capture group 1, and replacing with BEGIN$1FINISH would give you the following:abcSTARTdefBEGINghiFINISHjkl
which would allow you to change your START/END tokens only when paired properly.
Each
(x)
is a group, but I have put(?:x)
for each of the ones except the middle which marks it as a non-capturing group; the only one I left without a?:
was the middle; however, you could also conceivably capture the BEGIN/END tokens as well if you wanted to move them around or what-have-you.See the Java regex documentation for full details on Java regexes.