如何在 JavaScript 正则表达式中捕获任意数量的组?

发布于 2024-09-15 00:55:38 字数 266 浏览 14 评论 0原文

我希望这行 JavaScript:

"foo bar baz".match(/^(\s*\w+)+$/)

返回类似以下内容:

["foo bar baz", "foo", " bar", " baz"]

但它只返回最后捕获的匹配项:

["foo bar baz", " baz"]

有没有办法获取所有捕获的匹配项?

I would expect this line of JavaScript:

"foo bar baz".match(/^(\s*\w+)+$/)

to return something like:

["foo bar baz", "foo", " bar", " baz"]

but instead it returns only the last captured match:

["foo bar baz", " baz"]

Is there a way to get all the captured matches?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

凉墨 2024-09-22 00:55:38

当您重复捕获组时,在大多数情况下,仅保留最后一次捕获;任何以前的捕获都会被覆盖。在某些风格中,例如.NET,您可以获得所有中间捕获,但 Javascript 的情况并非如此。

也就是说,在 Javascript 中,如果您有一个包含 N 个捕获组的模式,则每次匹配只能捕获 N 个字符串,即使其中一些组是重复的。

所以一般来说,取决于您需要做什么:

  • 如果是一个选项,则按分隔符拆分而
  • 不是匹配 /(pattern)+/,可能匹配 /pattern/g,可能在 exec 循环中
    • 请注意,这两者并不完全等同,但它可能是一个选项
  • 进行多级匹配:
    • 在一场比赛中捕获重复的组
    • 然后运行另一个正则表达式来分解该匹配

参考文献


示例

下面是匹配 在文本中,使用 exec 循环,然后拆分 ; 以获得单个单词 (另请参阅 ideone.com):

var text = "a;b;<c;d;e;f>;g;h;i;<no no no>;j;k;<xx;yy;zz>";

var r = /<(\w+(;\w+)*)>/g;

var match;
while ((match = r.exec(text)) != null) {
  print(match[1].split(";"));
}
// c,d,e,f
// xx,yy,zz

使用的模式是:

      _2__
     /    \
<(\w+(;\w+)*)>
 \__________/
      1

此匹配 等。组 2 重复以捕获任意数量的单词,但它只能保留最后一次捕获。整个单词列表由第 1 组捕获;然后该字符串在分号分隔符上分割

相关问题

When you repeat a capturing group, in most flavors, only the last capture is kept; any previous capture is overwritten. In some flavor, e.g. .NET, you can get all intermediate captures, but this is not the case with Javascript.

That is, in Javascript, if you have a pattern with N capturing groups, you can only capture exactly N strings per match, even if some of those groups were repeated.

So generally speaking, depending on what you need to do:

  • If it's an option, split on delimiters instead
  • Instead of matching /(pattern)+/, maybe match /pattern/g, perhaps in an exec loop
    • Do note that these two aren't exactly equivalent, but it may be an option
  • Do multilevel matching:
    • Capture the repeated group in one match
    • Then run another regex to break that match apart

References


Example

Here's an example of matching <some;words;here> in a text, using an exec loop, and then splitting on ; to get individual words (see also on ideone.com):

var text = "a;b;<c;d;e;f>;g;h;i;<no no no>;j;k;<xx;yy;zz>";

var r = /<(\w+(;\w+)*)>/g;

var match;
while ((match = r.exec(text)) != null) {
  print(match[1].split(";"));
}
// c,d,e,f
// xx,yy,zz

The pattern used is:

      _2__
     /    \
<(\w+(;\w+)*)>
 \__________/
      1

This matches <word>, <word;another>, <word;another;please>, etc. Group 2 is repeated to capture any number of words, but it can only keep the last capture. The entire list of words is captured by group 1; this string is then split on the semicolon delimiter.

Related questions

偷得浮生 2024-09-22 00:55:38

这个怎么样? “foo bar baz”.match(/(\w+)+/g)

How's about this? "foo bar baz".match(/(\w+)+/g)

偏爱你一生 2024-09-22 00:55:38

除非您对如何拆分字符串有更复杂的要求,否则您可以拆分它们,然后返回初始字符串:

var data = "foo bar baz";
var pieces = data.split(' ');
pieces.unshift(data);

Unless you have a more complicated requirement for how you're splitting your strings, you can split them, and then return the initial string with them:

var data = "foo bar baz";
var pieces = data.split(' ');
pieces.unshift(data);
大姐,你呐 2024-09-22 00:55:38

尝试使用“g”:

"foo bar baz".match(/\w+/g)

try using 'g':

"foo bar baz".match(/\w+/g)
爱情眠于流年 2024-09-22 00:55:38

我已经通读了这个问题及其所有答案,我觉得它们留下了很多含糊之处。因此,为了澄清问题:

1. String.prototype.match

我期望这行 JavaScript:

"foo bar baz".match(/^(\s*\w+)+$/) 返回类似以下内容:
["foo bar baz", "foo", " bar", " baz"]

为了获得所需的输出,您需要显式捕获所有三个组(特别是因为您将模式与 ^...$,表示您希望整个字符串在一次匹配中通过测试)。 String.prototype.match 将返回该格式 ([FULL_MATCH, CAP1, CAP2, CAP3, ... CAPn]) - 它是“exec”响应(如此称呼是因为String.prototype.match 是 RegExp.prototype.exec 的实现 - 当表达式在没有设置 /global 标志的情况下匹配目标字符串时。考虑一下:

// EX.1
"foo bar baz".match(/^(\w+) (\w+) (\w+)$/)

...这会产生...

// RES.1
['foo bar baz', 'foo', 'bar', 'baz']

为什么?因为您声明性地捕获了 3 个不同的
将此与以下内容进行比较:

// EX.2 (Note the /g lobal flag and absence of the `^` and `

我已经通读了这个问题及其所有答案,我觉得它们留下了很多含糊之处。因此,为了澄清问题:

1. String.prototype.match

我期望这行 JavaScript:

"foo bar baz".match(/^(\s*\w+)+$/) 返回类似以下内容:
["foo bar baz", "foo", " bar", " baz"]

为了获得所需的输出,您需要显式捕获所有三个组(特别是因为您将模式与 ^...$,表示您希望整个字符串在一次匹配中通过测试)。 String.prototype.match 将返回该格式 ([FULL_MATCH, CAP1, CAP2, CAP3, ... CAPn]) - 它是“exec”响应(如此称呼是因为String.prototype.match 是 RegExp.prototype.exec 的实现 - 当表达式在没有设置 /global 标志的情况下匹配目标字符串时。考虑一下:

// EX.1
"foo bar baz".match(/^(\w+) (\w+) (\w+)$/)

...这会产生...

// RES.1
['foo bar baz', 'foo', 'bar', 'baz']

为什么?因为您声明性地捕获了 3 个不同的
将此与以下内容进行比较:

) "foo bar baz".match(/(\w+)/g)

...这会产生...

// RES.2
['foo', ' bar', ' baz']

这是因为 .match 方法具有双重职责。在非/g全局匹配中,它将返回匹配作为返回数组中的第一个元素,并将每个捕获组作为同一数组中的附加节点。这是因为 match() 只是 RegExp.prototype.exec 的语法糖实现。

因此,Res.2 中的结果格式是 /global 标志的结果,向编译器指示“每次找到此模式的匹配项时,返回它并继续”从那场比赛结束开始,看看是否还有更多”。由于 RegExp.prototype.exec 返回第一个匹配项,但您提供了“/go 并继续直到我们用完”标志,因此您会看到一个包含第一个匹配项的集合,多个次。

2. String.prototype.matchAll

如果您确实需要完整的“exec match”语法并希望捕获所有 n 匹配项,并且您愿意使用 /global 标志(并且您必须要使其工作),您需要 matchAll。它返回一个 RegExpStringIterator,这是一种特殊的集合,需要使用 /global。让我们在 EX.2 中使用 matchAll 重新运行相同的查询:

// EX.3a
"foo bar baz".matchAll(/(\w+)/g)
// RES.3a
RegExpStringIterator

因为它返回一个迭代器,为了真正处理数据,我们将使用扩展运算符 (...<迭代器>)。由于我们随后需要一些东西将其展开到 INTO 中,因此我们将把所有内容包装在数组构造函数简写中 ([...]) 我们得到:

// EX.3b
[..."foo bar baz".matchAll(/(\w+)/g)]
// RES.3b
[
   ['foo', 'foo'], ['bar', 'bar'], ['baz', 'baz']
]

如您所见,3 个匹配,每个都是 [,] 数组。

实际上,这一切都归结为这样一个事实:match 返回的是匹配,而不是捕获。如果匹配恰好是一个捕获(或包含多个),那么当您将它们包裹在括号中时,它足以帮助您将它们分开。事实上, "foo bar baz".match(/\w+/g) (注意存在 /global 并且缺少捕获括号)仍然会产生 [ 'foo'、'bar'、'baz']。它找到了 3 个匹配项,但您没有指定所需的组,因此它会自行查找它们。

我认为,所有这些在很大程度上都是由于对 RegExp 如何返回结果的巨大误解造成的。即,

3. GROUP 不是 MATCH

这里的部分歧义在于术语。一个可以在同一个匹配中包含多个捕获。一个人不能在一组中拥有多个匹配。考虑一个维恩图:
维恩图显示单个匹配中包含的 3 个组

使用语法可能更容易直观地看到。假设我们使用了正则表达式:

EX.4
"foo bar baz".match(/(?<GROUP1>\w+) (?<GROUP2>\w+) (?<GROUP3>\w+)/).groups

与 Ex.1 相比,我唯一改变的是我为每个捕获组(匹配的部分)分配了一个名称(? 语法)由括号内包含的模式定义 - (...))。因为它们被命名,所以输出响应数组有一个附加属性:.groups:

// RES.4
[
    "foo bar baz",
    "foo",
    "bar",
    "baz"
],
    groups: {
        "GROUP1": "foo",
        "GROUP2": "bar",
        "GROUP3": "baz"
    }

因为我们显式命名了所有三个捕获组,所以我们可以看到我们有一个 MATCH(包含完整匹配和所有 3 个捕获组内容的数组),以及我们的命名捕获对象。

所以,最后,让我们尝试使用额外信息进行初步尝试:

// EX.5
"foo bar baz".match(/^(?<GROUPn>\s*\w+)+$/)
// RES.5
[
    "foo bar baz",
    " baz"
],
    groups: {
        "GROUPn": "baz"
    }

等等,给出了什么?

因为您仅声明性地指定了一个捕获组(我已将其标记为“GROUPn”),所以您只提供了一个用于捕获的“槽”。

简而言之:这并不是说您的表达式不是捕获所有三个元素...这是“槽” - 用于存储返回值的变量,因为它在响应中返回给您被覆盖两次。所有这些都说明:无法在一个捕获组中存储多个捕获(至少在 ECMA 的 RegExp 引擎中)。

您当然可以存储多个匹配(如果这就是您所需要的,嘿,太棒了),但有时人们无法在应用结果集之前对其进行迭代。

作为最后一个示例:

// EX.6a
console.log("foo bar baz".replace(/(...) (...) (...)/, "The first word is '$1'.\nThe second word is '$2'.\nThe third word is '$3'."))
// RES.6a
"The first word is 'foo'.
The second word is 'bar'.
The third word is 'baz'."

在本例中,我们需要在同一匹配中进行所有三个捕获,因此我们可以在同一替换操作中直接引用每个捕获。如果我们用你的表达尝试这个:

// EX.6b
console.log("foo bar baz".replace(/^(\s*\w+)+$/, "The first word is '$1'.\nThe second word is '$2'.\nThe third word is '$3'."))

...我们最终会得到令人困惑的不准确...

// RES.6b
The first word is ' baz'.
The second word is '$2'.
The third word is '$3'.

希望这对将来的人有帮助。

I've read through this question and all its answers, and I feel they leave a great deal of ambiguity. So, in the interest of clearing things up:

1. String.prototype.match

I would expect this line of JavaScript:

"foo bar baz".match(/^(\s*\w+)+$/) to return something like:
["foo bar baz", "foo", " bar", " baz"]

In order to get the desired output, you need to explicitly capture all three groups (particularly because you're bracketing the pattern with ^...$, indicating you want the whole string to pass the test in ONE match). String.prototype.match will return that format ([FULL_MATCH, CAP1, CAP2, CAP3, ... CAPn]) - it's "exec" response (so called because the String.prototype.match is an implementation of RegExp.prototype.exec) - when an expression matches the target string without a /global flag set. Consider:

// EX.1
"foo bar baz".match(/^(\w+) (\w+) (\w+)$/)

...which yields...

// RES.1
['foo bar baz', 'foo', 'bar', 'baz']

Why? Because you're declaratively capturing 3 distinct groups.
Compare this to:

// EX.2 (Note the /g lobal flag and absence of the `^` and `

I've read through this question and all its answers, and I feel they leave a great deal of ambiguity. So, in the interest of clearing things up:

1. String.prototype.match

I would expect this line of JavaScript:

"foo bar baz".match(/^(\s*\w+)+$/) to return something like:
["foo bar baz", "foo", " bar", " baz"]

In order to get the desired output, you need to explicitly capture all three groups (particularly because you're bracketing the pattern with ^...$, indicating you want the whole string to pass the test in ONE match). String.prototype.match will return that format ([FULL_MATCH, CAP1, CAP2, CAP3, ... CAPn]) - it's "exec" response (so called because the String.prototype.match is an implementation of RegExp.prototype.exec) - when an expression matches the target string without a /global flag set. Consider:

// EX.1
"foo bar baz".match(/^(\w+) (\w+) (\w+)$/)

...which yields...

// RES.1
['foo bar baz', 'foo', 'bar', 'baz']

Why? Because you're declaratively capturing 3 distinct groups.
Compare this to:

) "foo bar baz".match(/(\w+)/g)

...which yields...

// RES.2
['foo', ' bar', ' baz']

This is because the .match method serves double-duty. In a non-/global match, it will return the match as the first element in the returned array, and each capture group as an additional node in that same array. This is because match() is simply a syntactic sugar implementation of RegExp.prototype.exec.

The format of the results, therefore, in Res.2 are the consequence of the /global flag, indicating to the compiler, "each time you find a match of this pattern, return it, and resume from the end of that match to see if there are more". Since RegExp.prototype.exec returns the FIRST occurrence, but you've provided a "/go ahead and continue until we run out" flag, you're seeing a collection containing the first occurrence, multiple times.

2. String.prototype.matchAll

If you DID need the full, "exec match" syntax and wanted the captures for ALL n matches, and you're willing to use the /global flag (and you'll HAVE to for this to work), you need matchAll. It returns a RegExpStringIterator, a specialized sort of collection that REQUIRES the use of the /global. Let's re-run the same query in EX.2 with a matchAll:

// EX.3a
"foo bar baz".matchAll(/(\w+)/g)
// RES.3a
RegExpStringIterator

Because it hands back an Iterator, to actually get our grubby mitts on the data, we'll use a spread operator (...<ITERATOR>). Since we then need something to spread it INTO, we'll wrap the whole lot in the Array constructor shorthand ([...<ITERATOR>]) We get:

// EX.3b
[..."foo bar baz".matchAll(/(\w+)/g)]
// RES.3b
[
   ['foo', 'foo'], ['bar', 'bar'], ['baz', 'baz']
]

As you can see, 3 matches, each an array of [<MATCH>, <CAPGROUP>].

Really, this all boils down to the fact that match is returning, well, matches, NOT captures. If a match happens to BE a capture (or contain multiples) it's helpful enough to break those out for you when you wrap them in parens. Indeed, "foo bar baz".match(/\w+/g) (note presence of /global and absence of capture parens) will still yield ['foo', 'bar', 'baz']. It found 3 matches, you didn't specify you wanted groups, so it exec'd its way into finding them.

ALL of which, I believe, is in large part due to a huge misconception about how RegExp returns results. Namely,

3. A GROUP is not a MATCH

Part of the ambiguity here is the terminology. One can have MULTIPLE capture groups contained within the same match. One cannot have multiple matches in one group. Think a venn diagram:
Venn Diagram showing 3 groups contained within a single match

This may be easier to visual using the syntax. Say we used the regex:

EX.4
"foo bar baz".match(/(?<GROUP1>\w+) (?<GROUP2>\w+) (?<GROUP3>\w+)/).groups

The only thing I've changed from Ex.1 is I've assigned a name (the ?<NAME> syntax) to each capture group (the portion of the match defined by the pattern contained within the parentheticals - (...)). Because they're named, out response array has an additional attribute: .groups:

// RES.4
[
    "foo bar baz",
    "foo",
    "bar",
    "baz"
],
    groups: {
        "GROUP1": "foo",
        "GROUP2": "bar",
        "GROUP3": "baz"
    }

Because we explicitly named all three capture GROUPS, we can see we have a single MATCH (the array containing the full match and all 3 capture groups' contents), along with our object of named captures.

So, finally, let's try your initial attempt with the extra info:

// EX.5
"foo bar baz".match(/^(?<GROUPn>\s*\w+)+$/)
// RES.5
[
    "foo bar baz",
    " baz"
],
    groups: {
        "GROUPn": "baz"
    }

Wait, what gives?

Because you've only declaratively specified a single capture group (which I've helpfully labelled "GROUPn"), you've provided only one "slot" for the capture to land in.

In short: it's not that your expression isn't capturing all three elements... it's that the "slot" - the variable being used to store that return value as it makes its way to you in the response is being overwritten twice. All of which is to say: One cannot store multiple captures in one capture group (at least in ECMA's RegExp engine).

You can certainly store multiple matches (and if that's all you need, hey, great) but there are times when one cannot iterate the result set before applying it.

Take, as a final example:

// EX.6a
console.log("foo bar baz".replace(/(...) (...) (...)/, "The first word is '$1'.\nThe second word is '$2'.\nThe third word is '$3'."))
// RES.6a
"The first word is 'foo'.
The second word is 'bar'.
The third word is 'baz'."

In this instance, we NEED ALL THREE captures in the SAME match, so we can directly reference each in the same replace operation. If we tried this with your expression:

// EX.6b
console.log("foo bar baz".replace(/^(\s*\w+)+$/, "The first word is '$1'.\nThe second word is '$2'.\nThe third word is '$3'."))

...we end up with the confusingly-inaccurate...

// RES.6b
The first word is ' baz'.
The second word is '$2'.
The third word is '$3'.

Hope this helps someone in the future.

疏忽 2024-09-22 00:55:38

您可以使用 LAZY 评估。
因此,不要使用 *(贪婪),而是尝试使用 ? (懒惰)

正则表达式:(\s*\w+)?

结果:

匹配 1:foo

匹配 2:bar

匹配 3:baz

You can use LAZY evaluation.
So, instead of using * (GREEDY), try using ? (LAZY)

REGEX: (\s*\w+)?

RESULT:

Match 1: foo

Match 2: bar

Match 3: baz

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文