当前位置：文江博客话题详情

如何使用正则表达式捕获除 cat、dog、fish 之外的所有非空字母序列？

发布于 2024-09-25 02:35:23 字数 31 浏览 1 评论 0原文

请解释为什么这个表达式在复杂的情况下仍然有意义。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

蛮可爱 2024-10-02 02:35:23

如果您实际上使用 grep，则可以使用 -v 选项仅选择不匹配的行：

grep -v \(cat\|dog\|fish\|^$\)

该模式将选择空行和包含“cat”的行”、“狗”和“鱼”。

好吧，您没有使用 grep。根据 http://www.regular-expressions.info/refadv.html ，如果你的正则表达式引擎支持它，你想要 ?!:

`(?!正则表达式)`
零宽度负前瞻。与正向前瞻相同，只是只有在前瞻内的正则表达式无法匹配时，整体匹配才会成功。
`t(?!s)` 匹配 `streets` 中的第一个 `t`。

If you are actually using grep, you could use the -v option to select only the lines that don't match:

grep -v \(cat\|dog\|fish\|^$\)

The pattern will select empty lines and lines containing "cat", "dog" and "fish".

Okay, you're not using grep. According to http://www.regular-expressions.info/refadv.html , if your regex engine supports it, you want ?!:

`(?!regex)`
Zero-width negative lookahead. Identical to positive lookahead, except that the overall match will only succeed if the regex inside the lookahead fails to match.
`t(?!s)` matches the first `t` in `streets`.

回复收藏 0 原文

计㈡愣 2024-10-02 02:35:23

让我们探讨如何构建排除特定短语的模式。

我们将从一个简单的 .* 开始，它匹配任何字符（使用点），零次或多次（星号）。此模式将匹配任何字符串，包括空字符串¹。

但是，由于存在我们不想匹配的特定短语，因此我们可以尝试使用否定lookaround 阻止它匹配我们不想要的内容。环视是一个零宽度断言，这意味着正则表达式引擎需要满足断言才能有匹配，但断言不消耗任何字符（或者换句话说，它不消耗任何字符）不要提前字符串中的位置）。在这种特定情况下，我们将使用lookahead，它告诉正则表达式引擎向前查找当前位置以匹配断言（还有lookbehinds< /em>，自然会查看当前位置的后面）。所以我们将尝试 (?!cat|dog|fish).*。

然而，当我们针对 catdogfish 尝试此模式时，它与 atdogfish 匹配！这是怎么回事？让我们看看当引擎尝试在 catdogfish 上使用我们的模式时会发生什么。

引擎从左到右工作，从字符串中第一个字符之前开始。在第一次尝试时，先行断言从该点开始的下一个字符不是 cat、dog 或 fish，但因为它们实际上是cat，引擎从此时开始无法匹配，并前进到第二个字符之前。这里断言成功，因为后面的下一个字符不满足断言（atf 与 cat 或 dog 和 atfi< /code> 与 fish 不匹配）。现在断言成功了，引擎可以匹配 .*，因为默认情况下正则表达式是贪婪（这意味着它们将捕获尽可能多的字符串），点星将消耗字符串的其余部分。

您可能想知道为什么在第一个断言成功后不再检查环视。这是因为点星号被视为一个单独的标记，而环视则作为一个整体对其进行处理。让我们更改一下，以便环视每次重复断言一次：(?:(?!cat|dog|fish).)*。

(?:…) 称为非捕获 群组。一般来说，正则表达式中的内容是通过括号进行分组的，但这些括号是捕获的，这意味着内容被保存到反向引用（或子匹配）中。由于这里不需要子匹配，因此我们可以使用非捕获组，其工作方式与普通组相同，但没有跟踪反向引用的开销。

当我们针对 catdogfish 运行新模式时，我们现在得到三个匹配项²：at、og 和 是的！让我们看看这次正则表达式引擎内部发生了什么。

引擎再次在第一个字符之前启动。它进入将重复的组 ((?!cat|dog|fish).) 并发现断言失败，因此移动到下一个位置 (a) 。断言成功，引擎前进到t。断言再次成功，引擎再次向前移动。此时，断言失败（因为接下来的三个字符是 dog），并且引擎返回 at 作为匹配项，因为这是最大的字符串（到目前为止，并且引擎从左到右工作），与模式匹配。

接下来，即使我们已经有了一场比赛，引擎仍将继续。它将向前移动到下一个字符 (o)，并再次选取与模式匹配的两个字符 (og)。最后，字符串末尾的 ish 也会发生同样的情况。一旦引擎到达字符串的末尾，它就不再需要做任何事情，它会返回它拾取的三个匹配项。

所以这个模式仍然不完美，因为它会匹配包含我们不允许的短语的字符串部分。为了防止这种情况，我们需要在模式中引入锚到我们的模式中：^ (?:(?!cat|dog|fish).)*$

锚点也是零宽度断言，断言引擎所在的位置必须是字符串中的特定位置。在我们的例子中，^ 匹配字符串的开头，$ 匹配字符串的结尾。现在，当我们将模式与catdogfish进行匹配时，这些小匹配都无法再被拾取，因为它们都不匹配锚点位置。

所以最终的表达式将是 ^(?:(?!cat|dog|fish).)*$。

^{¹ 但是，默认情况下，点不匹配换行符，除非 /s （或“单行”）修饰符在正则表达式上启用。}
^{² 我在这里假设模式在“全局”模式下工作，这使得模式匹配尽可能多的次数。如果没有全局模式，该模式将仅返回第一个匹配项 at。}

Let's explore how we can build up a pattern which excludes specific phrases.

We'll start with a simple .*, which matches any character (using the dot), zero or more times (star). This pattern will match any string, including an empty string¹.

However, since there are specific phrases we don't want to match, we can try to use a negative lookaround to stop it from matching what we don't want. A lookaround is a zero-width assertion, which means that the regex engine needs to satisfy the assertion for there to be a match, but the assertion does not consume any characters (or in other words, it doesn't advance the position in the string). In this specific case, we will use a lookahead, which tells the regex engine to look ahead of the current position to match the assertion (there are also lookbehinds, which, naturally, look behind the current position). So we'll try (?!cat|dog|fish).*.

When we try this pattern against catdogfish, though, it matches atdogfish! What's going on here? Let's take a look at what happens when the engine tries to use our pattern on catdogfish.

The engine works from left to right, starting from before the first character in our string. On it's first attempt, the lookahead asserts that the next characters from that point are not cat, dog, or fish, but since they actually are cat, the engine cannot match from this point, and advances to before the second character. Here the assertion succeeds, because the next characters following do not satisfy the assertion (atf does not match cat or dog and atfi does not match fish). Now that the assertion succeeds, the engine can match .*, and since by default regular expressions are greedy (which means that they will capture as much of your string as possible), the dot-star will consume the rest of the string.

You might be wondering why the lookaround isn't checked again after the first assertion succeeds. That is because the dot-star is taken as one single token, with the lookaround working on it as a whole. Let's change that so that the lookaround asserts once per repetition: (?:(?!cat|dog|fish).)*.

The (?:…) is called a non-capturing group. In general, things in regular expressions are grouped by parentheses, but these parentheses are capturing, which means that the contents are saved into a backreference (or submatch). Since we don't need a submatch here, we can use a non-capturing group, which works the same as a normal group, but without the overhead of keeping track of a backreference.

When we run our new pattern against catdogfish, we now get three matches²: at, og and ish! Let's take a look at what's going on this time inside the regex engine.

Again the engine starts before the first character. It enters the group that will be repeated ((?!cat|dog|fish).) and sees that the assertion fails, so moves onto the next position (a). The assertion succeeds, and the engine moves forwards to t. Again the assertion succeeds, and the engine moves forwards again. At this point, the assertion fails (because the next three characters are dog), and the engine returns at as a match, because that is the biggest string (so far, and the engine works from left to right), that matches the pattern.

Next, even though we've already got a match, the engine will continue. It will move forwards to the next character (o), and again pick up two characters that match the pattern (og). Finally, the same thing will happen for the ish at the end of the string. Once the engine hits the end of the string, there is nothing more for it to do, and it returns the three matches it picked up.

So this pattern still isn't perfect, because it will match parts of a string that contain our disallowed phrases. In order to prevent this, we need to introduce anchors into our pattern: ^(?:(?!cat|dog|fish).)*$

Anchors are also zero-width assertions, that assert that the position the engine is in must be a specific location in the string. In our case, ^ matches the beginning of the string, and $ matches the end of the string. Now when we match our pattern against catdogfish, none of those small matches can be picked up anymore, because none of them match the anchor positions.

So the final expression would be ^(?:(?!cat|dog|fish).)*$.

^{¹ However, the dot doesn't match newline characters by default, unless the /s (or "single line") modifier is enabled on the regex.}
^{² I'm making the assumption here that the pattern is working in "global" mode, which makes the pattern match as many times as possible. Without global mode, the pattern would only return the first match, at.}

回复收藏 0 原文