将多行字符串的 PHP 正则表达式的意图转换为 Python/PERL

发布于 2024-11-28 22:09:13 字数 1078 浏览 0 评论 0原文

下面是一个 PHP 正则表达式,旨在匹配 PHP 或 JavaScript 源代码中的(多行)字符串(来自 这篇文章),但我怀疑它有问题。 与此等效的 Python(或 PERL)字面意思是什么?

~'(\\.|[^'])*'|"(\\.|[^"])*"~s
  • s 修饰符 表示点匹配所有字符,包括换行符;在Python中,这是 re.compile(..., re.DOTALL)
  • 我完全不明白前导 \\. 的意图?这会减少到 . 吗? PHP 中双反斜杠需要转义两次吗?
  • 允许在每个位置匹配\\.[^'](任何非引号字符)似乎完全矫枉过正对我来说,也许可以解释为什么这个人的正则表达式会崩溃。 [^'] 组是否尚未与带有 s 修饰符的 . 匹配的所有内容匹配,它肯定应该匹配换行符吗?

  • 要在Python中构建带有单引号和双引号的正则表达式的两个版本,可以使用这两个-step 方法

  • 注意,这个正则表达式的更简单版本也可以在此< a href="http://www.roscripts.com/PHP_regular_expressions_examples-136.html" rel="nofollow noreferrer">PHP 正则表达式示例列表,位于编程下:字符串。

Below is a PHP regex intended to match (multiline) strings inside PHP or JavaScript source code (from this post), but I suspect it's got issues.
What is the literal Python (or else PERL) equivalent of this?

~'(\\.|[^'])*'|"(\\.|[^"])*"~s
  • the s modifier means dot matches all characters, including newline; in Python that's re.compile(..., re.DOTALL)
  • I totally don't get the intent of the leading \\. ? Does that reduce to . ? Are double-backslashes need to escape it twice in PHP?
  • allowing in every position a match of either \\. or [^'] (any non-quote character) seems total overkill to me, maybe explains why this person's regex blows up. Does [^'] group not already match everything that . with s modifier does, surely it should match newlines?

  • for constructing two versions of the regex with single, and double, quotes in Python, can use this two-step approach

  • NB a simpler version of this regex can also be found in this list of PHP regex examples, under Programming: String.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

滥情哥ㄟ 2024-12-05 22:09:13

\\. 旨在匹配模式中的文字反斜杠,并吞下后面的字符。请注意,由于 PHP(和 Python)中的模式包含在字符串中,因此实际上需要在字符串中使用 \\\\.,这样它最终会成为 \\.< /code> 在正则表达式中。

匹配反斜杠并吞下后面的字符非常重要,因为它可用于转义引号,否则会提前结束匹配。

这种模式看起来应该可以正常工作,而且我想不出更简洁的方式来表达它。

它在 Python 中也应该可以正常工作(正如你所说,使用 re.DOTALL)。在 Python 中,您可以使用原始字符串表示法来节省反斜杠的额外转义,尽管您仍然需要转义单引号。这应该是等效的:

re.search(r'\'(\\.|[^\'])*\'|"(\\.|[^"])*"', str, re.点)

The \\. is meant to match a literal backslash in the pattern, and swallow the following character. Note that since patterns in PHP (and Python) are contained in strings, it would actually need to be \\\\. in the string, so that it ends up as \\. in the regex.

It's important to match the backslash and swallow the following character because it could be used to escape a quote which would otherwise end the match prematurely.

This pattern looks like it should work fine, and I can't think of a more succinct way to express it.

It should also work fine in Python (as you say, with re.DOTALL). In Python you could use the raw string notation to save the extra escaping of the backslash although you'd still need to escape the single quote. This should be equivalent:

re.search(r'\'(\\.|[^\'])*\'|"(\\.|[^"])*"', str, re.DOTALL)

述情 2024-12-05 22:09:13

正则表达式基本上没问题,只是它不处理转义引号(即 \"\')。这很容易修复:

'(?:\\.|[^'\\]+)*'|"(?:\\.|[^"\\]+)*"

这是一个“通用”正则表达式; 在 Python 中,您通常会以原始字符串的形式编写它:

r"""'(?:\\.|[^'\\]+)*'|"(?:\\.|[^"\\]+)*""""

在 PHP 中,您必须对反斜杠进行转义以使它们通过 PHP 的字符串处理:

'~\'(?:\\\\.|[^\'\\\\]+)*\'|"(?:\\\\.|[^"\\\\]+)*"~s'

大多数当前流行的语言都具有需要较少转义的字符串类型,支持对于正则表达式文字,或两者,以下是您的正则表达式作为 C# 逐字字符串的样子:

@"'(?:\\.|[^'\\]+)*'|""(?:\\.|[^""\\]+)*"""


但是,除了格式考虑之外,正则表达式本身应该适用于任何 Perl 派生的风格(以及许多其他风格) ps:请注意我是如何工作的 将 + 量词添加到您的字符类中,您关于一次匹配一个字符的直觉是正确的,添加 + 会产生巨大的差异;表现。但不要让这欺骗了你;当你处理正则表达式时,直觉似乎常常是错误的:/

The regex is mostly okay, except it doesn't handle escaped quotes (i.e., \" and \'). That's easy enough to fix:

'(?:\\.|[^'\\]+)*'|"(?:\\.|[^"\\]+)*"

That's a "generic" regex; in Python you would usually write it in the form of a raw string:

r"""'(?:\\.|[^'\\]+)*'|"(?:\\.|[^"\\]+)*""""

In PHP you have to escape the backslashes to get them past PHP's string processing:

'~\'(?:\\\\.|[^\'\\\\]+)*\'|"(?:\\\\.|[^"\\\\]+)*"~s'

Most of the currently-popular languages have either a string type that requires less escaping, support for regex literals, or both. Here's how your regex would look as a C# verbatim string:

@"'(?:\\.|[^'\\]+)*'|""(?:\\.|[^""\\]+)*"""

But, formatting considerations aside, the regex itself should work in any Perl-derived flavor (and many other flavors as well).


p.s.: Notice how I added the + quantifier to your character classes. Your intuition about matching one character at a time is correct; adding the + makes a huge difference in performance. But don't let that fool you; when you're dealing with regexes, intuition seems to wrong more often than not. :/

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文