sphinx画廊的正则表达式解释
我正在调试Sphinx Gallery Tooltip Generation,其中涉及以下代码:
def extract_intro_and_title(filename, docstring):
"""Extract and clean the first paragraph of module-level docstring."""
# lstrip is just in case docstring has a '\n\n' at the beginning
paragraphs = docstring.lstrip().split('\n\n')
# remove comments and other syntax like `.. _link:`
paragraphs = [p for p in paragraphs
if not p.startswith('.. ') and len(p) > 0]
if len(paragraphs) == 0:
raise ExtensionError(
"Example docstring should have a header for the example title. "
"Please check the example file:\n {}\n".format(filename))
# Title is the first paragraph with any ReSTructuredText title chars
# removed, i.e. lines that consist of (3 or more of the same) 7-bit
# non-ASCII chars.
# This conditional is not perfect but should hopefully be good enough.
title_paragraph = paragraphs[0]
match = re.search(r'^(?!([\W _])\1{3,})(.+)', title_paragraph,
re.MULTILINE)
if match is None:
raise ExtensionError(
'Could not find a title in first paragraph:\n{}'.format(
title_paragraph))
title = match.group(0).strip()
# Use the title if no other paragraphs are provided
intro_paragraph = title if len(paragraphs) < 2 else paragraphs[1]
# Concatenate all lines of the first paragraph and truncate at 95 chars
intro = re.sub('\n', ' ', intro_paragraph)
intro = _sanitize_rst(intro)
if len(intro) > 95:
intro = intro[:95] + '...'
return intro, title
我不明白的行是:
match = re.search(r'^(?!([\W _])\1{3,})(.+)', title_paragraph,
re.MULTILINE)
有人可以向我解释吗?
I am debugging sphinx gallery tooltip generation which involves following code:
def extract_intro_and_title(filename, docstring):
"""Extract and clean the first paragraph of module-level docstring."""
# lstrip is just in case docstring has a '\n\n' at the beginning
paragraphs = docstring.lstrip().split('\n\n')
# remove comments and other syntax like `.. _link:`
paragraphs = [p for p in paragraphs
if not p.startswith('.. ') and len(p) > 0]
if len(paragraphs) == 0:
raise ExtensionError(
"Example docstring should have a header for the example title. "
"Please check the example file:\n {}\n".format(filename))
# Title is the first paragraph with any ReSTructuredText title chars
# removed, i.e. lines that consist of (3 or more of the same) 7-bit
# non-ASCII chars.
# This conditional is not perfect but should hopefully be good enough.
title_paragraph = paragraphs[0]
match = re.search(r'^(?!([\W _])\1{3,})(.+)', title_paragraph,
re.MULTILINE)
if match is None:
raise ExtensionError(
'Could not find a title in first paragraph:\n{}'.format(
title_paragraph))
title = match.group(0).strip()
# Use the title if no other paragraphs are provided
intro_paragraph = title if len(paragraphs) < 2 else paragraphs[1]
# Concatenate all lines of the first paragraph and truncate at 95 chars
intro = re.sub('\n', ' ', intro_paragraph)
intro = _sanitize_rst(intro)
if len(intro) > 95:
intro = intro[:95] + '...'
return intro, title
The line which I do not understand is:
match = re.search(r'^(?!([\W _])\1{3,})(.+)', title_paragraph,
re.MULTILINE)
Can someone explain it to me please?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
首先:
这告诉我们,
re.search
采用一个模式、一个字符串和默认为 0 的可选标志。这本身可能没有多大帮助。
传递的标志是
re.MULTILINE
。这告诉正则表达式引擎将^
和$
视为每行的开头和结尾。默认情况下,这些适用于字符串的开头和结尾,无论字符串由多少行组成。正在匹配的模式正在寻找以下内容:
^
- 模式必须从每行的开头开始(?!([\W _])\1{3,} )
- 前四个字符不能是:非单词字符 (\W
)、空格 () 或下划线 (
_)。这是使用与字符组 (
([\W _])
) 匹配的负前瞻 ((?!
...)
)括号,表示捕获组 1。此匹配必须重复 3 次或以上 (\1{3,}
)。\1
表示捕获组 1 的内容,{3,}
表示至少 3 次。该匹配加上 3 次重复匹配强制前 4 个字符不能是重复的非单词字符。此匹配不消耗任何字符,仅在条件为真时匹配位置。作为旁注,
\W
与\w
匹配,后者是[A-Za-z0-9_]
的简写。这意味着\W
是[^A-Za-z0-9_]
的简写(.+)
- 如果上一个位置匹配成功,如果该行包含 1 个或多个字符,则整行将在捕获组 2 中匹配。https://regex101.com/r/3p73lf/1 来探索正则表达式的行为。
To start:
That tells us that
re.search
takes a pattern, a string, and optional flags that default to 0.That probably doesn't help much on its own.
The flag being passed is
re.MULTILINE
. That tells the regular expression engine to treat^
and$
as the beginning and end of each line. The default, those apply to the beginning and end of the string, regardless of how many lines make up the string.The pattern that's being matched is looking for for the following:
^
- the pattern must start at the beginning of each line(?!([\W _])\1{3,})
- the first four characters can't be: non-word characters (\W
), spaces () or underscores (
_
). This is using a negative lookahead ((?!
...)
) matching a character group (([\W _])
) in parentheses, meaning capture group 1. This match has to repeat 3 or more times (\1{3,}
).\1
signaling the contents of capture group 1, and{3,}
meaning at least 3 times. The match plus the 3 repeats of the match enforces that the first 4 characters can't be repeating non-word characters. This match doesn't consume any characters, it only matches a position if the condition is true.As a side note,
\W
matches the opposite of\w
, which is shorthand for[A-Za-z0-9_]
. This means\W
is shorthand for[^A-Za-z0-9_]
(.+)
- If the previous positional match was successful, if the line consists of 1 or more characters, the entire line will be matched in capture group 2.https://regex101.com/r/3p73lf/1 to explore the behavior of the regular expression.