sphinx画廊的正则表达式解释

发布于 2025-01-18 18:18:59 字数 1862 浏览 3 评论 0原文

我正在调试Sphinx Gallery Tooltip Generation,其中涉及以下代码:

def extract_intro_and_title(filename, docstring):
    """Extract and clean the first paragraph of module-level docstring."""
    # lstrip is just in case docstring has a '\n\n' at the beginning
    paragraphs = docstring.lstrip().split('\n\n')
    # remove comments and other syntax like `.. _link:`
    paragraphs = [p for p in paragraphs
                  if not p.startswith('.. ') and len(p) > 0]
    if len(paragraphs) == 0:
        raise ExtensionError(
            "Example docstring should have a header for the example title. "
            "Please check the example file:\n {}\n".format(filename))
    # Title is the first paragraph with any ReSTructuredText title chars
    # removed, i.e. lines that consist of (3 or more of the same) 7-bit
    # non-ASCII chars.
    # This conditional is not perfect but should hopefully be good enough.
    title_paragraph = paragraphs[0]
    match = re.search(r'^(?!([\W _])\1{3,})(.+)', title_paragraph,
                      re.MULTILINE)

    if match is None:
        raise ExtensionError(
            'Could not find a title in first paragraph:\n{}'.format(
                title_paragraph))
    title = match.group(0).strip()
    # Use the title if no other paragraphs are provided
    intro_paragraph = title if len(paragraphs) < 2 else paragraphs[1]
    # Concatenate all lines of the first paragraph and truncate at 95 chars
    intro = re.sub('\n', ' ', intro_paragraph)
    intro = _sanitize_rst(intro)
    if len(intro) > 95:
        intro = intro[:95] + '...'
    return intro, title

我不明白的行是:

match = re.search(r'^(?!([\W _])\1{3,})(.+)', title_paragraph,
                  re.MULTILINE)

有人可以向我解释吗?

I am debugging sphinx gallery tooltip generation which involves following code:

def extract_intro_and_title(filename, docstring):
    """Extract and clean the first paragraph of module-level docstring."""
    # lstrip is just in case docstring has a '\n\n' at the beginning
    paragraphs = docstring.lstrip().split('\n\n')
    # remove comments and other syntax like `.. _link:`
    paragraphs = [p for p in paragraphs
                  if not p.startswith('.. ') and len(p) > 0]
    if len(paragraphs) == 0:
        raise ExtensionError(
            "Example docstring should have a header for the example title. "
            "Please check the example file:\n {}\n".format(filename))
    # Title is the first paragraph with any ReSTructuredText title chars
    # removed, i.e. lines that consist of (3 or more of the same) 7-bit
    # non-ASCII chars.
    # This conditional is not perfect but should hopefully be good enough.
    title_paragraph = paragraphs[0]
    match = re.search(r'^(?!([\W _])\1{3,})(.+)', title_paragraph,
                      re.MULTILINE)

    if match is None:
        raise ExtensionError(
            'Could not find a title in first paragraph:\n{}'.format(
                title_paragraph))
    title = match.group(0).strip()
    # Use the title if no other paragraphs are provided
    intro_paragraph = title if len(paragraphs) < 2 else paragraphs[1]
    # Concatenate all lines of the first paragraph and truncate at 95 chars
    intro = re.sub('\n', ' ', intro_paragraph)
    intro = _sanitize_rst(intro)
    if len(intro) > 95:
        intro = intro[:95] + '...'
    return intro, title

The line which I do not understand is:

match = re.search(r'^(?!([\W _])\1{3,})(.+)', title_paragraph,
                  re.MULTILINE)

Can someone explain it to me please?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

手长情犹 2025-01-25 18:18:59

首先:

>>> import re
>>> help(re.search)
Help on function search in module re:

search(pattern, string, flags=0)
    Scan through string looking for a match to the pattern, returning
    a Match object, or None if no match was found.
(END)

这告诉我们,re.search 采用一个模式、一个字符串和默认为 0 的可选标志。

这本身可能没有多大帮助。

传递的标志是re.MULTILINE。这告诉正则表达式引擎将 ^$ 视为每行的开头和结尾。默认情况下,这些适用于字符串的开头和结尾,无论字符串由多少行组成。

正在匹配的模式正在寻找以下内容:

^ - 模式必须从每行的开头开始

(?!([\W _])\1{3,} ) - 前四个字符不能是:非单词字符 (\W)、空格 () 或下划线 (_)。这是使用与字符组 (([\W _])) 匹配的负前瞻 ((?! ... ))括号,表示捕获组 1。此匹配必须重复 3 次或以上 (\1{3,})。 \1 表示捕获组 1 的内容,{3,} 表示至少 3 次。该匹配加上 3 次重复匹配强制前 4 个字符不能是重复的非单词字符。此匹配不消耗任何字符,仅在条件为真时匹配位置。

作为旁注,\W\w 匹配,后者是 [A-Za-z0-9_] 的简写。这意味着 \W[^A-Za-z0-9_] 的简写

(.+) - 如果上一个位置匹配成功,如果该行包含 1 个或多个字符,则整行将在捕获组 2 中匹配。

https://regex101.com/r/3p73lf/1 来探索正则表达式的行为。

To start:

>>> import re
>>> help(re.search)
Help on function search in module re:

search(pattern, string, flags=0)
    Scan through string looking for a match to the pattern, returning
    a Match object, or None if no match was found.
(END)

That tells us that re.search takes a pattern, a string, and optional flags that default to 0.

That probably doesn't help much on its own.

The flag being passed is re.MULTILINE. That tells the regular expression engine to treat ^ and $ as the beginning and end of each line. The default, those apply to the beginning and end of the string, regardless of how many lines make up the string.

The pattern that's being matched is looking for for the following:

^ - the pattern must start at the beginning of each line

(?!([\W _])\1{3,}) - the first four characters can't be: non-word characters (\W), spaces () or underscores (_). This is using a negative lookahead ((?! ... )) matching a character group (([\W _])) in parentheses, meaning capture group 1. This match has to repeat 3 or more times (\1{3,}). \1 signaling the contents of capture group 1, and {3,} meaning at least 3 times. The match plus the 3 repeats of the match enforces that the first 4 characters can't be repeating non-word characters. This match doesn't consume any characters, it only matches a position if the condition is true.

As a side note, \W matches the opposite of \w, which is shorthand for [A-Za-z0-9_]. This means \W is shorthand for [^A-Za-z0-9_]

(.+) - If the previous positional match was successful, if the line consists of 1 or more characters, the entire line will be matched in capture group 2.

https://regex101.com/r/3p73lf/1 to explore the behavior of the regular expression.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文