正则表达式 - 获取两个不包含单词的单词之间的字符串

发布于 2025-01-13 23:01:23 字数 448 浏览 1 评论 0原文

我一直在环顾四周,但无法实现这一点。我不完全是菜鸟。

我需要获取由(包括)START 和 END 分隔但不包含 START 的文本。基本上我找不到一种方法来否定整个单词而不使用高级的东西。

示例字符串:

abcSTARTabcSTARTabcENDabc

预期结果:

STARTabcEND

不好:

STARTabcSTARTabcEND

我无法使用向后搜索功能。我在这里测试我的正则表达式: www.regextester.com

感谢您的任何建议。

I've been looking around and could not make this happen. I am not totally noob.

I need to get text delimited by (including) START and END that doesn't contain START. Basically I can't find a way to negate a whole word without using advanced stuff.

Example string:

abcSTARTabcSTARTabcENDabc

The expected result:

STARTabcEND

Not good:

STARTabcSTARTabcEND

I can't use backward search stuff. I am testing my regex here: www.regextester.com

Thanks for any advice.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

花开柳相依 2025-01-20 23:01:23

尝试一下

START(?!.*START).*?END

在 Regexr 上在线查看

(?!.*START) 是一个否定向前看。它确保单词“START”不在

.*? 后面,是所有字符的非贪婪匹配,直到下一个“END”。它是必需的,因为负前瞻只是向前看,而不捕获任何内容(零长度断言)

更新:

我想得更多,上面的解决方案匹配直到第一个“END”。如果不需要(因为您从内容中排除了 START),则使用贪婪版本,

START(?!.*START).*END

这将匹配到最后一个“END”。

Try this

START(?!.*START).*?END

See it here online on Regexr

(?!.*START) is a negative lookahead. It ensures that the word "START" is not following

.*? is a non greedy match of all characters till the next "END". Its needed, because the negative lookahead is just looking ahead and not capturing anything (zero length assertion)

Update:

I thought a bit more, the solution above is matching till the first "END". If this is not wanted (because you are excluding START from the content) then use the greedy version

START(?!.*START).*END

this will match till the last "END".

一影成城 2025-01-20 23:01:23
START(?:(?!START).)*END

将适用于任意数量的 START...END 对。用 Python 演示:

>>> import re
>>> a = "abcSTARTdefENDghiSTARTjlkENDopqSTARTrstSTARTuvwENDxyz"
>>> re.findall(r"START(?:(?!START).)*END", a)
['STARTdefEND', 'STARTjlkEND', 'STARTuvwEND']

如果您只关心 STARTEND 之间的内容,请使用:

(?<=START)(?:(?!START).)*(?=END)

请参见此处:

>>> re.findall(r"(?<=START)(?:(?!START).)*(?=END)", a)
['def', 'jlk', 'uvw']
START(?:(?!START).)*END

will work with any number of START...END pairs. To demonstrate in Python:

>>> import re
>>> a = "abcSTARTdefENDghiSTARTjlkENDopqSTARTrstSTARTuvwENDxyz"
>>> re.findall(r"START(?:(?!START).)*END", a)
['STARTdefEND', 'STARTjlkEND', 'STARTuvwEND']

If you only care for the content between START and END, use this:

(?<=START)(?:(?!START).)*(?=END)

See it here:

>>> re.findall(r"(?<=START)(?:(?!START).)*(?=END)", a)
['def', 'jlk', 'uvw']
冷弦 2025-01-20 23:01:23

真正行人的解决方案是 START(([^S]|S*S[^ST]|ST[^A]|STA[^R]|STAR[^T])*(S(T(AR ?)?)?)?)END。现代正则表达式风格有负面断言,可以更优雅地做到这一点,但我将您对“向后搜索”的评论解释为可能意味着您不能或不想使用此功能。

更新:为了完整起见,请注意上面的内容对于结束分隔符是贪婪的。要仅捕获最短的可能字符串,请扩展否定以覆盖结束分隔符 -- START(([^ES]|E*E[^ENS]|EN[^DS]|S*S[^STE ]|ST[^AE]|STA[^RE]|STAR[^TE])*(S(T(AR?)?)?|EN?)?)END。不过,在大多数文化中,这有可能超出酷刑阈值。

错误修复:此答案的先前版本有一个错误,因为 SSTART 可能是匹配的一部分(第二个 S 将匹配 <代码>[^T]等)。我修复了这个问题,但通过在 [^ST] 中添加 S 并在非可选 S 之前添加 S* code> 允许任意重复 S 否则。

The really pedestrian solution would be START(([^S]|S*S[^ST]|ST[^A]|STA[^R]|STAR[^T])*(S(T(AR?)?)?)?)END. Modern regex flavors have negative assertions which do this more elegantly, but I interpret your comment about "backwards search" to perhaps mean you cannot or don't want to use this feature.

Update: Just for completeness, note that the above is greedy with respect to the end delimiter. To only capture the shortest possible string, extend the negation to also cover the end delimiter -- START(([^ES]|E*E[^ENS]|EN[^DS]|S*S[^STE]|ST[^AE]|STA[^RE]|STAR[^TE])*(S(T(AR?)?)?|EN?)?)END. This risks to exceed the torture threshold in most cultures, though.

Bug fix: A previous version of this answer had a bug, in that SSTART could be part of the match (the second S would match [^T], etc). I fixed this but by the addition of S in [^ST] and adding S* before the non-optional S to allow for arbitrary repetitions of S otherwise.

寄离 2025-01-20 23:01:23

我可以建议对 Tim Pietzcker 的解决方案进行可能的改进吗?
在我看来, START(?:(?!START).)*?END 更好,以便仅捕获紧随其后的 START END 之间没有任何 STARTEND。我正在使用 .NET,蒂姆的解决方案也将匹配诸如 START END END 之类的内容。至少就我个人而言,这是不希望的。

May I suggest a possible improvement on the solution of Tim Pietzcker?
It seems to me that START(?:(?!START).)*?END is better in order to only catch a START immediately followed by an END without any START or END in between. I am using .NET and Tim's solution would match also something like START END END. At least in my personal case this is not wanted.

梦里兽 2025-01-20 23:01:23

[编辑:我留下这篇文章是为了获取有关捕获组的信息,但我给出的主要解决方案不正确。
<罢工><代码>(?:START)((?:[^S]|S[^T]|ST[^A]|STA[^R]|STAR[^T])*)(?:END )
正如评论中指出的那样,这是行不通的;我忘记了被忽略的字符不能被删除,因此您需要诸如 ...|STA(?![^R])| 之类的东西来仍然允许该字符成为 END 的一部分,因此在诸如 STARTSTAEND 之类的事情上失败;所以这显然是一个更好的选择;下面应该显示使用捕获组的正确方法...]

使用带有捕获组的“零宽度负先行”运算符“?!”给出的答案是: (?:START)(( ?!.*START).*)(?:END) 使用 $1 捕获内部文本进行替换。如果您想捕获 START 和 END 标记,您可以执行 (START)((?!.*START).*)(END) ,这会给出 $1=START $2=text 和 $3=END或通过添加/删除 ()?: 进行各种其他排列。

这样,如果您使用它进行搜索和替换,您可以执行类似 BEGIN$1FINISH 的操作。因此,如果您从以下位置开始:

abcSTARTdefSTARTghiENDjkl

您将获得 ghi 作为捕获组 1,并且替换为 BEGIN$1FINISH 将为您提供以下内容:

abcSTARTdefBEGINghiFINISHjkl >

仅当正确配对时,您才可以更改 START/END 令牌。

每个 (x) 都是一个组,但我已经为每个组添加了 (?:x) ,除了中间的组,中间的组将其标记为非捕获组;我留下的唯一一个没有 ?: 的是中间的;但是,如果您想移动它们或您有什么,您也可以捕获 BEGIN/END 标记。

请参阅 Java 正则表达式文档有关 Java 正则表达式的完整详细信息。

[EDIT: I have left this post for the information on capture groups but the main solution I gave was not correct.
(?:START)((?:[^S]|S[^T]|ST[^A]|STA[^R]|STAR[^T])*)(?:END)
as pointed out in the comments would not work; I was forgetting that the ignored characters could not be dropped and thus you would need something such as ...|STA(?![^R])| to still allow that character to be part of END, thus failing on something such as STARTSTAEND; so it's clearly a better choice; the following should show the proper way to use the capture groups...]

The answer given using the 'zero-width negative lookahead' operator "?!", with capture groups, is: (?:START)((?!.*START).*)(?:END) which captures the inner text using $1 for the replace. If you want to have the START and END tags captured you could do (START)((?!.*START).*)(END) which gives $1=START $2=text and $3=END or various other permutations by adding/removing ()s or ?:s.

That way if you are using it to do search and replace, you can do, something like BEGIN$1FINISH. So, if you started with:

abcSTARTdefSTARTghiENDjkl

you would get ghi as capture group 1, and replacing with BEGIN$1FINISH would give you the following:

abcSTARTdefBEGINghiFINISHjkl

which would allow you to change your START/END tokens only when paired properly.

Each (x) is a group, but I have put (?:x) for each of the ones except the middle which marks it as a non-capturing group; the only one I left without a ?: was the middle; however, you could also conceivably capture the BEGIN/END tokens as well if you wanted to move them around or what-have-you.

See the Java regex documentation for full details on Java regexes.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文