.* 和有什么区别？和.*正则表达式？

发布于 2024-09-06 02:50:19 字数 255 浏览 23 评论 0原文

我正在尝试使用正则表达式将字符串分成两部分。该字符串的格式如下：

text to extract<number>

我一直在使用 (.*?)< 和 <(.*?)> ，它们工作正常，但在读入后正则表达式一点，我刚刚开始想知道为什么我需要表达式中的 ? 。我只是在通过这个网站找到它们之后才这样做的，所以我不太确定有什么区别。

原文

I'm trying to split up a string into two parts using regex. The string is formatted as follows:

text to extract<number>

I've been using (.*?)< and <(.*?)> which work fine but after reading into regex a little, I've just started to wonder why I need the ? in the expressions. I've only done it like that after finding them through this site so I'm not exactly sure what the difference is.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

婴鹅 2024-09-13 02:50:19

关于贪婪与非贪婪

正则表达式中的默认重复是贪婪：他们尝试匹配尽可能多的重复，当这不起作用并且他们必须回溯时，他们会尝试少匹配一个一次重复，直到找到整个模式的匹配。因此，当匹配最终发生时，贪婪重复将匹配尽可能多的重复。

作为重复量词的 ? 将此行为更改为非贪婪，也称为不情愿（在例如 Java 中）（有时是“懒惰”）。相反，这种重复将首先尝试匹配尽可能少的代表，当这不起作用并且他们必须回溯时，他们开始一次再匹配一个代表。因此，当比赛最终发生时，不情愿的重复将尽可能匹配少数次。

参考文献

regular-expressions.info/Repetition - 懒惰而不是贪婪

示例 1：来自A 到 Z

让我们比较这两种模式：A.*Z 和 A.*?Z。

给定以下输入：

eeeAiiZuuuuAoooZeeee

模式产生以下匹配：

A.*Z 产生 1 个匹配：AiiZuuuuAoooZ (参见 rubular.com)
A.*?Z 产生 2 个匹配项：AiiZ 和 AoooZ （参见 rubular.com）

让我们首先关注 A. *Z 确实如此。当它匹配第一个 A 时，贪婪的 .* 首先尝试匹配尽可能多的 .。

eeeAiiZuuuuAoooZeeee
   \_______________/
    A.* matched, Z can't match

由于 Z 不匹配，引擎会回溯，并且 .* 必须少匹配一个 .：

eeeAiiZuuuuAoooZeeee
   \______________/
    A.* matched, Z still can't match

这种情况会多发生几次，直到最后我们得出这样的结果：

eeeAiiZuuuuAoooZeeee
   \__________/
    A.* matched, Z can now match

现在 Z 可以匹配，因此整体模式匹配：

eeeAiiZuuuuAoooZeeee
   \___________/
    A.*Z matched

相比之下，A.*?Z 中勉强的重复首先匹配尽可能少的 . 尽可能多，然后根据需要采取更多 .。这解释了为什么它在输入中找到两个匹配项。

以下是两个模式匹配内容的直观表示：

eeeAiiZuuuuAoooZeeee
   \__/r   \___/r      r = reluctant
    \____g____/        g = greedy

示例：替代方案

在许多应用程序中，上述输入中的两个匹配正是所需要的，因此使用不情愿的 .*? 而不是贪婪的.* 以防止过度匹配。然而，对于这种特定的模式，有一个更好的替代方案，即使用否定字符类。

对于上述输入 (如 ideone.com 上所示）。 [^Z] 是所谓的否定字符类：它匹配除 Z 之外的任何内容。

两种模式之间的主要区别在于性能：更严格的是，否定字符类只能匹配给定输入的一种方式。对于此模式，使用贪婪或不情愿的修饰符并不重要。事实上，在某些风格中，您可以做得更好，并使用所谓的所有格量词，它根本不会回溯。

参考文献

regular-expressions.info/Repetition - 懒惰的替代方案，否定字符类和所有格量词

示例 2：从 A 到 ZZ

此示例应该具有说明性：它显示了在给定相同输入的情况下，贪婪、不情愿和否定字符类模式如何进行不同的匹配。

eeAiiZooAuuZZeeeZZfff

这些是上述输入的匹配：

A[^Z]*ZZ 产生 1 个匹配：AuuZZ (如 ideone.com 上所示）
A.*?ZZ 产生 1 个匹配：AiiZooAuuZZ (如 ideone.com 上所示）
A.*ZZ 产生 1 个匹配项：AiiZooAuuZZeeeZZ（如 ideone.com 上所示）

以下是它们的可视化表示匹配：

         ___n
        /   \              n = negated character class
eeAiiZooAuuZZeeeZZfff      r = reluctant
  \_________/r   /         g = greedy
   \____________/g

相关主题

这些是 stackoverflow 上的问题和答案的链接，涵盖了一些可能感兴趣的主题。

一次贪婪的重复可能胜过另一个

On greedy vs non-greedy

Repetition in regex by default is greedy: they try to match as many reps as possible, and when this doesn't work and they have to backtrack, they try to match one fewer rep at a time, until a match of the whole pattern is found. As a result, when a match finally happens, a greedy repetition would match as many reps as possible.

The ? as a repetition quantifier changes this behavior into non-greedy, also called reluctant (in e.g. Java) (and sometimes "lazy"). In contrast, this repetition will first try to match as few reps as possible, and when this doesn't work and they have to backtrack, they start matching one more rept a time. As a result, when a match finally happens, a reluctant repetition would match as few reps as possible.

References

regular-expressions.info/Repetition - Laziness instead of Greediness

Example 1: From A to Z

Let's compare these two patterns: A.*Z and A.*?Z.

Given the following input:

eeeAiiZuuuuAoooZeeee

The patterns yield the following matches:

A.*Z yields 1 match: AiiZuuuuAoooZ (see on rubular.com)
A.*?Z yields 2 matches: AiiZ and AoooZ (see on rubular.com)

Let's first focus on what A.*Z does. When it matched the first A, the .*, being greedy, first tries to match as many . as possible.

eeeAiiZuuuuAoooZeeee
   \_______________/
    A.* matched, Z can't match

Since the Z doesn't match, the engine backtracks, and .* must then match one fewer .:

eeeAiiZuuuuAoooZeeee
   \______________/
    A.* matched, Z still can't match

This happens a few more times, until finally we come to this:

eeeAiiZuuuuAoooZeeee
   \__________/
    A.* matched, Z can now match

Now Z can match, so the overall pattern matches:

eeeAiiZuuuuAoooZeeee
   \___________/
    A.*Z matched

By contrast, the reluctant repetition in A.*?Z first matches as few . as possible, and then taking more . as necessary. This explains why it finds two matches in the input.

Here's a visual representation of what the two patterns matched:

eeeAiiZuuuuAoooZeeee
   \__/r   \___/r      r = reluctant
    \____g____/        g = greedy

Example: An alternative

In many applications, the two matches in the above input is what is desired, thus a reluctant .*? is used instead of the greedy .* to prevent overmatching. For this particular pattern, however, there is a better alternative, using negated character class.

The pattern A[^Z]*Z also finds the same two matches as the A.*?Z pattern for the above input (as seen on ideone.com). [^Z] is what is called a negated character class: it matches anything but Z.

The main difference between the two patterns is in performance: being more strict, the negated character class can only match one way for a given input. It doesn't matter if you use greedy or reluctant modifier for this pattern. In fact, in some flavors, you can do even better and use what is called possessive quantifier, which doesn't backtrack at all.

References

regular-expressions.info/Repetition - An Alternative to Laziness, Negated Character Classes and Possessive Quantifiers

Example 2: From A to ZZ

This example should be illustrative: it shows how the greedy, reluctant, and negated character class patterns match differently given the same input.

eeAiiZooAuuZZeeeZZfff

These are the matches for the above input:

A[^Z]*ZZ yields 1 match: AuuZZ (as seen on ideone.com)
A.*?ZZ yields 1 match: AiiZooAuuZZ (as seen on ideone.com)
A.*ZZ yields 1 match: AiiZooAuuZZeeeZZ (as seen on ideone.com)

Here's a visual representation of what they matched:

         ___n
        /   \              n = negated character class
eeAiiZooAuuZZeeeZZfff      r = reluctant
  \_________/r   /         g = greedy
   \____________/g

.* 和有什么区别？和.*正则表达式？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于贪婪与非贪婪

参考文献

示例 1：来自A 到 Z

示例：替代方案

参考文献

示例 2：从 A 到 ZZ

相关主题

一次贪婪的重复可能胜过另一个

On greedy vs non-greedy

References

Example 1: From A to Z

Example: An alternative

References

Example 2: From A to ZZ

Related topics

One greedy repetition can outgreed another

关于作者

相关话题

热门标签

推荐作者

櫻之舞

弥枳

m2429

寻找一个思念的角度

野却迷人

我怀念的。

友情链接

.* 和有什么区别？和.*正则表达式？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于贪婪与非贪婪

参考文献

示例 1：来自A 到 Z

示例：替代方案

参考文献

示例 2：从 A 到 ZZ

相关主题

一次贪婪的重复可能胜过另一个

On greedy vs non-greedy

References

Example 1: From A to Z

Example: An alternative

References

Example 2: From A to ZZ

Related topics

One greedy repetition can outgreed another

关于作者

相关话题

热门标签

推荐作者

櫻之舞

弥枳

m2429

寻找一个思念的角度

野却迷人

我怀念的。

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。