正则表达式中的非捕获组是什么？

半葬歌 2024-09-22 06:34:52

让我尝试用一个例子来解释这一点。

考虑以下文本：

http://stackoverflow.com/
https://stackoverflow.com/questions/tagged/regex

现在，如果我将下面的正则表达式应用于它（为了清楚起见，我没有转义斜杠；使用它时，斜杠必须转义为 \/ ）

(https?|ftp)://([^/\r\n]+)(/[^\r\n]*)?      // slashes not escaped for clarity
(https?|ftp):\/\/([^/\r\n]+)(\/[^\r\n]*)?   // slashes escaped

...。 .. 我会得到以下结果：

Match "http://stackoverflow.com/"
     Group 1: "http"
     Group 2: "stackoverflow.com"
     Group 3: "/"

Match "https://stackoverflow.com/questions/tagged/regex"
     Group 1: "https"
     Group 2: "stackoverflow.com"
     Group 3: "/questions/tagged/regex"

但我不关心协议——我只想要 URL 的主机和路径。因此，我更改了正则表达式以包含非捕获组 (?:)。

(?:https?|ftp):\/\/([^/\r\n]+)(\/[^\r\n]*)?   // slashes escaped

现在，我的结果如下所示：

Match "http://stackoverflow.com/"
     Group 1: "stackoverflow.com"
     Group 2: "/"

Match "https://stackoverflow.com/questions/tagged/regex"
     Group 1: "stackoverflow.com"
     Group 2: "/questions/tagged/regex"

看到了吗？第一组尚未被俘获。解析器使用它来匹配文本，但稍后在最终结果中忽略它。

编辑：

根据要求，让我也尝试解释一下组。

嗯，团体有很多用途。它们可以帮助您从更大的匹配（也可以命名）中提取准确的信息，它们可以让您重新匹配以前的匹配组，并且可以用于替换。让我们尝试一些例子，好吗？

假设您有某种 XML 或 HTML（请注意正则表达式可能不是完成这项工作的最佳工具，但作为一个例子很不错）。您想要解析标签，所以您可以执行类似的操作（我添加了空格以使其更易于理解）：

   \<(?<TAG>.+?)\> [^<]*? \</\k<TAG>\>
or
   \<(.+?)\> [^<]*? \</\1\>

第一个正则表达式有一个命名组（TAG），而第二个正则表达式使用一个公共组。两个正则表达式都执行相同的操作：它们使用第一组中的值（标记的名称）来匹配结束标记。区别在于第一个使用名称来匹配值，第二个使用组索引（从 1 开始）。

现在让我们尝试一些替代品。考虑以下文本：

Lorem ipsum dolor sit amet consectetuer feugiat fames malesuada pretium egestas.

现在，让我们在其上使用这个愚蠢的正则表达式：

\b(\S)(\S)(\S)(\S*)\b

此正则表达式匹配至少包含 3 个字符的单词，并使用组来分隔前三个字母。结果是这样的：

Match "Lorem"
     Group 1: "L"
     Group 2: "o"
     Group 3: "r"
     Group 4: "em"
Match "ipsum"
     Group 1: "i"
     Group 2: "p"
     Group 3: "s"
     Group 4: "um"
...

Match "consectetuer"
     Group 1: "c"
     Group 2: "o"
     Group 3: "n"
     Group 4: "sectetuer"
...

因此，如果我们在其上应用替换字符串：

$1_$3$2_$4

...，我们将尝试使用第一组，添加下划线，使用第三组，然后使用第二组，添加另一个下划线，然后第四组。结果字符串如下所示。

L_ro_em i_sp_um d_lo_or s_ti_ a_em_t c_no_sectetuer f_ue_giat f_ma_es m_la_esuada p_er_tium e_eg_stas.

您也可以使用命名组进行替换，使用 ${name}。

要使用正则表达式，我建议 http://regex101.com/，它提供了大量有关如何使用的详细信息正则表达式有效；它还提供了一些正则表达式引擎可供选择。

Let me try to explain this with an example.

Consider the following text:

http://stackoverflow.com/
https://stackoverflow.com/questions/tagged/regex

Now, if I apply the regex below over it (I did not escape the slashes for clarity; when using it, slashes would have to be escaped to \/ )...

(https?|ftp)://([^/\r\n]+)(/[^\r\n]*)?      // slashes not escaped for clarity
(https?|ftp):\/\/([^/\r\n]+)(\/[^\r\n]*)?   // slashes escaped

... I would get the following result:

Match "http://stackoverflow.com/"
     Group 1: "http"
     Group 2: "stackoverflow.com"
     Group 3: "/"

Match "https://stackoverflow.com/questions/tagged/regex"
     Group 1: "https"
     Group 2: "stackoverflow.com"
     Group 3: "/questions/tagged/regex"

But I don't care about the protocol -- I just want the host and path of the URL. So, I change the regex to include the non-capturing group (?:).

(?:https?|ftp):\/\/([^/\r\n]+)(\/[^\r\n]*)?   // slashes escaped

Now, my result looks like this:

Match "http://stackoverflow.com/"
     Group 1: "stackoverflow.com"
     Group 2: "/"

Match "https://stackoverflow.com/questions/tagged/regex"
     Group 1: "stackoverflow.com"
     Group 2: "/questions/tagged/regex"

See? The first group has not been captured. The parser uses it to match the text, but ignores it later, in the final result.

EDIT:

As requested, let me try to explain groups too.

Well, groups serve many purposes. They can help you to extract exact information from a bigger match (which can also be named), they let you rematch a previous matched group, and can be used for substitutions. Let's try some examples, shall we?

Imagine you have some kind of XML or HTML (be aware that regex may not be the best tool for the job, but it is nice as an example). You want to parse the tags, so you could do something like this (I have added spaces to make it easier to understand):

   \<(?<TAG>.+?)\> [^<]*? \</\k<TAG>\>
or
   \<(.+?)\> [^<]*? \</\1\>

The first regex has a named group (TAG), while the second one uses a common group. Both regexes do the same thing: they use the value from the first group (the name of the tag) to match the closing tag. The difference is that the first one uses the name to match the value, and the second one uses the group index (which starts at 1).

Let's try some substitutions now. Consider the following text:

Lorem ipsum dolor sit amet consectetuer feugiat fames malesuada pretium egestas.

Now, let's use this dumb regex over it:

\b(\S)(\S)(\S)(\S*)\b

This regex matches words with at least 3 characters, and uses groups to separate the first three letters. The result is this:

Match "Lorem"
     Group 1: "L"
     Group 2: "o"
     Group 3: "r"
     Group 4: "em"
Match "ipsum"
     Group 1: "i"
     Group 2: "p"
     Group 3: "s"
     Group 4: "um"
...

Match "consectetuer"
     Group 1: "c"
     Group 2: "o"
     Group 3: "n"
     Group 4: "sectetuer"
...

So, if we apply the substitution string:

$1_$3$2_$4

... over it, we are trying to use the first group, add an underscore, use the third group, then the second group, add another underscore, and then the fourth group. The resulting string would be like the one below.

L_ro_em i_sp_um d_lo_or s_ti_ a_em_t c_no_sectetuer f_ue_giat f_ma_es m_la_esuada p_er_tium e_eg_stas.

You can use named groups for substitutions too, using ${name}.

To play around with regexes, I recommend http://regex101.com/, which offers a good amount of details on how the regex works; it also offers a few regex engines to choose from.

正则表达式中的非捕获组是什么？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（18）

编辑：

EDIT:

关于作者

相关话题

热门标签

推荐作者

知足的幸福

我一向站在原地

慕烟庭风

秉忠贞之诚 守退让之实

小兔几

mb_3y7WUgWY

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

秉忠贞之诚守退让之实