正则表达式修饰符(或标志)'m' 之间的区别 和?

发布于 2024-07-22 07:50:26 字数 418 浏览 5 评论 0原文

我经常忘记正则表达式修饰符 ms 以及它们的区别。 有什么好方法来记住它们?

据我了解,他们是:

'm' 表示多行,因此 ^$ 将匹配字符串的开头和结尾 多次字符串。 (按划分 通过 \n)

's' 是为了让点匹配偶数 换行符

通常,我只是使用

/some_pattern/ism

,但最好相应地使用它们(在我的情况下通常是“s”)。

你认为有什么好方法可以记住它们,而不是每次都忘记哪个是哪个?

I often forget about the regular expression modifiers m and s and their differences. What is a good way to remember them?

As I understand them, they are:

'm' is for multiline, so that ^ and $
will match beginning of string and end
of string multiple times. (as divided
by \n)

's' is so that the dot will match even
the newline character

Often, I just use

/some_pattern/ism

But it probably is better to use them accordingly (usually "s" in my cases).

What do you think can be a good way to remember them, instead of forgetting which is which every time?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

葬心 2024-07-29 07:50:26

使用正则表达式多年但仍然不了解这两个修饰符如何工作的人并不罕见。 正如您所观察到的,名称“多行”和“单行”并不是很有帮助。 它们听起来好像必须是相互排斥的,但实际上它们是完全独立的。 我建议您忽略这些名称并专注于它们的作用:m 更改锚点的行为(^$),并且 s 更改点 (.) 的行为。

Ruby 的作者是一位混淆了这些模式的杰出人物。 他基于 Perl 创建了自己的正则表达式实现,但他决定让 ^$ 始终为行锚点——也就是说,多行模式始终处于打开状态。 不幸的是,他还错误地将点匹配一切模式命名为多行。 因此,Ruby 没有 s 修饰符,但它的 m 修饰符可以实现 s 在其他风格中的作用。

至于总是使用 /ism,我建议不要这样做。 正如您所发现的,它基本上是无害的,但它向任何试图弄清楚正则表达式应该做什么的其他人(甚至将来的您自己)发出了令人困惑的信息。

It's not uncommon to find someone who's been using regexes for years who still doesn't understand how those two modifiers work. As you observed, the names "multiline" and "singleline" are not very helpful. They sound like they must be mutually exclusive, but they're completely independent. I suggest you ignore the names and concentrate on what they do: m changes the behavior of the anchors (^ and $), and s changes the behavior of the dot (.).

One prominent person who mixed up the modes is the author of Ruby. He created his own regex implementation based on Perl's, except he decided to have ^ and $ always be line anchors--that is, multiline mode is always on. Unfortunately, he also incorrectly named the dot-matches-everything mode multiline. So Ruby has no s modifier, but its m modifier does what s does in other flavors.

As for always using /ism, I recommend against it. It's mostly harmless, as you've discovered, but it sends a confusing message to anyone else who's trying to figure out what the regex was supposed to do (or even to yourself, in the future).

羁〃客ぐ 2024-07-29 07:50:26

我喜欢“man perlre”中的解释:

m将字符串视为m多行。
s 将字符串视为单行

对于多行,^ 和 $ 适用于单独的行(即换行符之前和之后)。
对于单行,^ 和 $ 适用于整个行,而 \n 只是成为您可以匹配的另一个字符。

[错误]通过按照您的描述同时使用 m 和 s,我希望第二个优先,因此您将始终处于 /ism 的多行模式。[/错误]< /b>

我读得不够深入:
“/s”和“/m”修饰符都覆盖 $* 设置。 也就是说,无论 $* 包含什么,没有“/m”的“/s”将强制“^”仅在字符串的开头匹配,而“$”仅在末尾匹配(或在换行符之前)字符串的末尾)。 一起,作为 /ms,他们让“.” 匹配任何字符,同时仍然允许“^”和“$”分别匹配字符串中换行符之后和之前的字符。

I like the explanation in 'man perlre':

m Treat string as multiple lines.
s Treat string as single line.

With multiple lines, ^ and $ apply to individual lines (i.e. just before and after newlines).
With a single line, ^ and $ apply to the whole, and \n just becomes another character you can match.

[Wrong]By using both m and s as you described, I would expect the second one to take precedence, so you would always be in multiline mode with /ism.[/Wrong]

I didn't read far enough:
The "/s" and "/m" modifiers both override the $* setting. That is, no matter what $* contains, "/s" without "/m" will force "^" to match only at the beginning of the string and "$" to match only at the end (or just before a newline at the end) of the string. Together, as /ms, they let the "." match any character whatsoever, while still allowing "^" and "$" to match, respectively, just after and just before newlines within the string.

情归归情 2024-07-29 07:50:26

2020 年更新:

我可以更清楚地写出它们是什么,以及记住它们的方法,并且我将其编写为与 JavaScript 相关:

  1. 传统上,JS 正则表达式没有 s 标志。 它只有 m 标志。 截至 2020 年 1 月,Firefox 仍然没有它,而 Chrome 已经有了。 NodeJS 有它。 它在 ES2018 规范中。
  2. s 也称为 dotallsingleline。 它实际上只是让 . 匹配任何 (ASCII) 字符,包括 \n\r\u2028< /code>(换行符),\u2029(段落符)。 当人们问你,. 匹配什么? 如果你回答“任何字符”,那么它并不完全正确。 它是除换行符、\r 以及 unicode 换行符和段落符之外的所有 (ASCII) 字符。 为了使其真正匹配所有 ASCII 字符,需要打开 s 标志。
  3. 为了解决 Firefox 或任何平台中缺少 s 标志的问题,它可以是 [^][\s\S]、< code>[\d\D] 等,或 (.|\s)
  4. 就这样。 这就是传统 JavaScript 中缺少的 s 标志。
  5. 现在是 m 标志。 它代表多行。 它确实非常简单:如果没有 m 标志,^$ 将仅匹配整个字符串的开头和结尾。 因此 "John Doe\nMary Lee".match(/^John Doe$/) 将不匹配,而 "John Doe\nMary Lee".match(/^John Doe$/m ) 将匹配。 就这样。 不要想得太复杂。 它只是改变 ^$ 的匹配方式。
  6. 那么“单行”和“多行”是相互排斥的吗? 不,他们不是。 例如,如果我想匹配 a 以及包括换行符和 f 在内的任何字符,但 a 必须位于行的开头并且 f 必须位于行尾,即使超过 2000 行文本,则 "abc \ndef\nha".match(/^a.*f$/ms) 是需要使用的。 . 匹配 \n^$ 匹配行首和行尾。

就是这样。 上面是在 NodeJS 和 Chrome 上测试的,它们已经支持 s 标志。 (并且 m 标志长期以来一直受到支持)。 请记住,您始终可以使用 [^] 修复 s 标志缺失问题。

现在,为什么是 msism 过去经常被使用? 因为很多时候,当我们有一个非常长的字符串(例如 2000 行 HTML)时,例如我们返回的某些网页内容,我们很少希望将 ^ 与开头匹配整个字符串和 $ 与整个字符串的结尾。 这就是我们使用 m 标志的原因。 现在,我们可能想要跨行匹配,因为(尽管不建议使用正则表达式来匹配 HTML),我们可以使用 /

.*?

/ 来表示非- 例如,标题的贪婪匹配。 我们不介意内容中的 \n,因为 HTML 的作者很可能有 \n (或没有)。 这就是我们使用“dotall”标志 s 的原因。

但是,如果您尝试从网页中提取一些信息,您可能不会关心某些内容是否位于行首或行尾(因为 HTML 文件中可以有空格(或作为缩进),并且它不会不会影响页面内容(通常,除非有

 等),因此您不需要使用 ^
lt; /code>,因此您可以忘记 m 标志。   如果您不介意使用 [^]*? 而不是 .*?,那么您也可以忘记 s 标志 - - 故事结局。

Perl Cookbook 用两句话说的:

/m/s 之间的区别很重要:/m 使 ^$ 匹配换行符旁边,而 /s 使 . 匹配换行符。 您甚至可以一起使用它们 - 它们不是相互排斥的选项。


也许这样,我永远不会忘记:

当我想要跨行匹配时(通常使用 .*? 来匹配跨多行无关紧要的东西),我自然会想到多行,因此,'m '。 好吧,“m”实际上不是那个,所以它是“s”。

(因为我已经很好地记住了“ism”......所以我总是能记住它不是“m”,那么它一定是“s”)。

其他蹩脚的尝试包括:

s 用于 DOTALL,它用于 DOT 匹配 ALL。
m 是多行——它是为了 ^$ 匹配很多次。

Update 2020:

I can write more clearly what they are, and a way to remember them, and I am writing it as related to JavaScript:

  1. traditionally, JS regex has no s flag. It only has the m flag. As of January 2020, Firefox still doesn't have it and Chrome has it. And NodeJS has it. It is in the ES2018 specs.
  2. The s is also called dotall or singleline. And it really is just for the . to match any (ASCII) character, including \n, \r, \u2028 (line break), \u2029 (paragraph break). When people ask you, what does . match? And if you answer "any character", then it is not entirely correct. It is all (ASCII) characters except the newline character, \r and the unicode line break and paragraph break. For it to match really all ASCII characters, it needs to have the s flag on.
  3. To overcome the missing of s flag in Firefox or in any platform, it can be [^], [\s\S], [\d\D], etc, or (.|\s).
  4. That's all. That's about the s flag that is missing in traditional JavaScript.
  5. Now the m flag. It stands for multiline. And it really is very simple: Without the m flag, the ^ and $ will match the beginning and end of the whole string only. So "John Doe\nMary Lee".match(/^John Doe$/) will not match, and "John Doe\nMary Lee".match(/^John Doe$/m) will match. That's all. Don't think about it in a too complicated way. It just changes how ^ and $ will match.
  6. So is "singleine" and "multiline" mutually exclusive? No, they are not. For example, if I want to match a and then whatever characters including newline, and f, but a must be at the beginning of a line and f must be at the end of line, even if out of 2000 lines of text, then "a b c \n d e f\nha".match(/^a.*f$/ms) is what needs to be used. Both . matching \n, and ^ and $ matching beginning of line and end of line.

That's it. The above was tested on NodeJS and Chrome, which already supports the s flag. (and the m flag has long been supported). And remember, you can always fix the s flag missing issue by using [^]

Now, why was ms or ism being used a lot in the past? Because a lot of times, when we have a really long string (e.g. 2000 lines of HTML), such as in the case of some web content we get back, we rarely want to match the ^ with beginning of the entire string and $ with the end of the entire string. So that's why we use the m flag. Now, we probably want to match across lines, because (although not recommended to use regex for matching HTML), we may use /<h1>.*?</h1>/ for a non-greedy match of a header, for example. We don't mind the \n in the content, because the author of the HTML can very well have a \n (or not). So that's why we use the "dotall" flag s.

But if you are trying to extract some info from a webpage, you probably won't care about if something is at the beginning of line or end of line (because HTML files can have spaces in them (or as indentation), and it doesn't affect the page content (usually, unless if there is <pre> etc)), so you won't need to use ^ or $, and therefore you can forget about the m flag. And if you don't mind using [^]*? instead of .*?, then you can forget about the s flag too -- end of story.

Perl Cookbook said it in two sentences:

The difference between /m and /s is important: /m makes ^ and $ match next to a newline, while /s makes . match newlines. You can even use them together - they're not mutually exclusive options.


maybe this way, i will never forget:

when i want to match across lines (usually using .*? to match something that doesn't matter if it span across multiple line), i will naturally think of multiline, and therefore, 'm'. Well, 'm' is actually not the one, so it is 's'.

(since i already remember 'ism' so well... so i can always remember it is not 'm', then it must be 's').

other lame attempt includes:

s is for DOTALL, it is for DOT to match ALL.
m is multiline -- it is for ^ and $ to match a lot of times.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文