如何在 JavaScript 正则表达式中捕获任意数量的组?
我希望这行 JavaScript:
"foo bar baz".match(/^(\s*\w+)+$/)
返回类似以下内容:
["foo bar baz", "foo", " bar", " baz"]
但它只返回最后捕获的匹配项:
["foo bar baz", " baz"]
有没有办法获取所有捕获的匹配项?
I would expect this line of JavaScript:
"foo bar baz".match(/^(\s*\w+)+$/)
to return something like:
["foo bar baz", "foo", " bar", " baz"]
but instead it returns only the last captured match:
["foo bar baz", " baz"]
Is there a way to get all the captured matches?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
当您重复捕获组时,在大多数情况下,仅保留最后一次捕获;任何以前的捕获都会被覆盖。在某些风格中,例如.NET,您可以获得所有中间捕获,但 Javascript 的情况并非如此。
也就是说,在 Javascript 中,如果您有一个包含 N 个捕获组的模式,则每次匹配只能捕获 N 个字符串,即使其中一些组是重复的。
所以一般来说,取决于您需要做什么:
/(pattern)+/
,可能匹配/pattern/g
,可能在exec
循环中参考文献
示例
下面是匹配
在文本中,使用exec
循环,然后拆分;
以获得单个单词 (另请参阅 ideone.com):使用的模式是:
此匹配
、
、
等。组 2 重复以捕获任意数量的单词,但它只能保留最后一次捕获。整个单词列表由第 1 组捕获;然后该字符串在分号分隔符上分割
。相关问题
When you repeat a capturing group, in most flavors, only the last capture is kept; any previous capture is overwritten. In some flavor, e.g. .NET, you can get all intermediate captures, but this is not the case with Javascript.
That is, in Javascript, if you have a pattern with N capturing groups, you can only capture exactly N strings per match, even if some of those groups were repeated.
So generally speaking, depending on what you need to do:
/(pattern)+/
, maybe match/pattern/g
, perhaps in anexec
loopReferences
Example
Here's an example of matching
<some;words;here>
in a text, using anexec
loop, and then splitting on;
to get individual words (see also on ideone.com):The pattern used is:
This matches
<word>
,<word;another>
,<word;another;please>
, etc. Group 2 is repeated to capture any number of words, but it can only keep the last capture. The entire list of words is captured by group 1; this string is thensplit
on the semicolon delimiter.Related questions
这个怎么样?
“foo bar baz”.match(/(\w+)+/g)
How's about this?
"foo bar baz".match(/(\w+)+/g)
除非您对如何拆分字符串有更复杂的要求,否则您可以拆分它们,然后返回初始字符串:
Unless you have a more complicated requirement for how you're splitting your strings, you can split them, and then return the initial string with them:
尝试使用“g”:
try using 'g':
我已经通读了这个问题及其所有答案,我觉得它们留下了很多含糊之处。因此,为了澄清问题:
1. String.prototype.match
为了获得所需的输出,您需要显式捕获所有三个组(特别是因为您将模式与
^...$
,表示您希望整个字符串在一次匹配中通过测试)。String.prototype.match
将返回该格式 ([FULL_MATCH, CAP1, CAP2, CAP3, ... CAPn]
) - 它是“exec”响应(如此称呼是因为String.prototype.match 是 RegExp.prototype.exec 的实现 - 当表达式在没有设置/g
lobal 标志的情况下匹配目标字符串时。考虑一下:...这会产生...
为什么?因为您声明性地捕获了 3 个不同的组。
将此与以下内容进行比较:
...这会产生...
这是因为
.match
方法具有双重职责。在非/g
全局匹配中,它将返回匹配作为返回数组中的第一个元素,并将每个捕获组作为同一数组中的附加节点。这是因为match()
只是RegExp.prototype.exec
的语法糖实现。因此,Res.2 中的结果格式是
/g
lobal 标志的结果,向编译器指示“每次找到此模式的匹配项时,返回它并继续”从那场比赛结束开始,看看是否还有更多”。由于 RegExp.prototype.exec 返回第一个匹配项,但您提供了“/g
o 并继续直到我们用完”标志,因此您会看到一个包含第一个匹配项的集合,多个次。2. String.prototype.matchAll
如果您确实需要完整的“exec match”语法并希望捕获所有 n 匹配项,并且您愿意使用
/global 标志(并且您必须要使其工作),您需要
matchAll
。它返回一个RegExpStringIterator
,这是一种特殊的集合,需要使用/g
lobal。让我们在 EX.2 中使用matchAll
重新运行相同的查询:因为它返回一个迭代器,为了真正处理数据,我们将使用扩展运算符 (
...<迭代器>
)。由于我们随后需要一些东西将其展开到 INTO 中,因此我们将把所有内容包装在数组构造函数简写中 ([...]
) 我们得到:如您所见,3 个匹配,每个都是
[,]
数组。实际上,这一切都归结为这样一个事实:
match
返回的是匹配,而不是捕获。如果匹配恰好是一个捕获(或包含多个),那么当您将它们包裹在括号中时,它足以帮助您将它们分开。事实上,"foo bar baz".match(/\w+/g)
(注意存在/g
lobal 并且缺少捕获括号)仍然会产生[ 'foo'、'bar'、'baz']
。它找到了 3 个匹配项,但您没有指定所需的组,因此它会自行查找它们。我认为,所有这些在很大程度上都是由于对 RegExp 如何返回结果的巨大误解造成的。即,
3. GROUP 不是 MATCH
这里的部分歧义在于术语。一个可以在同一个匹配中包含多个捕获组。一个人不能在一组中拥有多个匹配。考虑一个维恩图:
使用语法可能更容易直观地看到。假设我们使用了正则表达式:
与 Ex.1 相比,我唯一改变的是我为每个捕获组(匹配的部分)分配了一个名称(
?
语法)由括号内包含的模式定义 -(...)
)。因为它们被命名,所以输出响应数组有一个附加属性:.groups:因为我们显式命名了所有三个捕获组,所以我们可以看到我们有一个 MATCH(包含完整匹配和所有 3 个捕获组内容的数组),以及我们的命名捕获对象。
所以,最后,让我们尝试使用额外信息进行初步尝试:
等等,给出了什么?
因为您仅声明性地指定了一个捕获组(我已将其标记为“GROUPn”),所以您只提供了一个用于捕获的“槽”。
简而言之:这并不是说您的表达式不是捕获所有三个元素...这是“槽” - 用于存储返回值的变量,因为它在响应中返回给您被覆盖两次。所有这些都说明:无法在一个捕获组中存储多个捕获(至少在 ECMA 的 RegExp 引擎中)。
您当然可以存储多个匹配(如果这就是您所需要的,嘿,太棒了),但有时人们无法在应用结果集之前对其进行迭代。
作为最后一个示例:
在本例中,我们需要在同一匹配中进行所有三个捕获,因此我们可以在同一替换操作中直接引用每个捕获。如果我们用你的表达尝试这个:
...我们最终会得到令人困惑的不准确...
希望这对将来的人有帮助。
I've read through this question and all its answers, and I feel they leave a great deal of ambiguity. So, in the interest of clearing things up:
1. String.prototype.match
In order to get the desired output, you need to explicitly capture all three groups (particularly because you're bracketing the pattern with
^...$
, indicating you want the whole string to pass the test in ONE match).String.prototype.match
will return that format ([FULL_MATCH, CAP1, CAP2, CAP3, ... CAPn]
) - it's "exec" response (so called because the String.prototype.match is an implementation of RegExp.prototype.exec) - when an expression matches the target string without a/g
lobal flag set. Consider:...which yields...
Why? Because you're declaratively capturing 3 distinct groups.
Compare this to:
...which yields...
This is because the
.match
method serves double-duty. In a non-/g
lobal match, it will return the match as the first element in the returned array, and each capture group as an additional node in that same array. This is becausematch()
is simply a syntactic sugar implementation ofRegExp.prototype.exec
.The format of the results, therefore, in Res.2 are the consequence of the
/g
lobal flag, indicating to the compiler, "each time you find a match of this pattern, return it, and resume from the end of that match to see if there are more". Since RegExp.prototype.exec returns the FIRST occurrence, but you've provided a "/g
o ahead and continue until we run out" flag, you're seeing a collection containing the first occurrence, multiple times.2. String.prototype.matchAll
If you DID need the full, "exec match" syntax and wanted the captures for ALL n matches, and you're willing to use the
/g
lobal flag (and you'll HAVE to for this to work), you needmatchAll
. It returns aRegExpStringIterator
, a specialized sort of collection that REQUIRES the use of the/g
lobal. Let's re-run the same query in EX.2 with amatchAll
:Because it hands back an Iterator, to actually get our grubby mitts on the data, we'll use a spread operator (
...<ITERATOR>
). Since we then need something to spread it INTO, we'll wrap the whole lot in the Array constructor shorthand ([...<ITERATOR>]
) We get:As you can see, 3 matches, each an array of
[<MATCH>, <CAPGROUP>]
.Really, this all boils down to the fact that
match
is returning, well, matches, NOT captures. If a match happens to BE a capture (or contain multiples) it's helpful enough to break those out for you when you wrap them in parens. Indeed,"foo bar baz".match(/\w+/g)
(note presence of/g
lobal and absence of capture parens) will still yield['foo', 'bar', 'baz']
. It found 3 matches, you didn't specify you wanted groups, so it exec'd its way into finding them.ALL of which, I believe, is in large part due to a huge misconception about how RegExp returns results. Namely,
3. A GROUP is not a MATCH
Part of the ambiguity here is the terminology. One can have MULTIPLE capture groups contained within the same match. One cannot have multiple matches in one group. Think a venn diagram:
This may be easier to visual using the syntax. Say we used the regex:
The only thing I've changed from Ex.1 is I've assigned a name (the
?<NAME>
syntax) to each capture group (the portion of the match defined by the pattern contained within the parentheticals -(...)
). Because they're named, out response array has an additional attribute: .groups:Because we explicitly named all three capture GROUPS, we can see we have a single MATCH (the array containing the full match and all 3 capture groups' contents), along with our object of named captures.
So, finally, let's try your initial attempt with the extra info:
Wait, what gives?
Because you've only declaratively specified a single capture group (which I've helpfully labelled "GROUPn"), you've provided only one "slot" for the capture to land in.
In short: it's not that your expression isn't capturing all three elements... it's that the "slot" - the variable being used to store that return value as it makes its way to you in the response is being overwritten twice. All of which is to say: One cannot store multiple captures in one capture group (at least in ECMA's RegExp engine).
You can certainly store multiple matches (and if that's all you need, hey, great) but there are times when one cannot iterate the result set before applying it.
Take, as a final example:
In this instance, we NEED ALL THREE captures in the SAME match, so we can directly reference each in the same replace operation. If we tried this with your expression:
...we end up with the confusingly-inaccurate...
Hope this helps someone in the future.
您可以使用 LAZY 评估。
因此,不要使用 *(贪婪),而是尝试使用 ? (懒惰)
正则表达式:(\s*\w+)?
结果:
匹配 1:foo
匹配 2:bar
匹配 3:baz
You can use LAZY evaluation.
So, instead of using * (GREEDY), try using ? (LAZY)
REGEX: (\s*\w+)?
RESULT:
Match 1: foo
Match 2: bar
Match 3: baz