s = "1\n2"
re.match('2', s, re.M) # no match
re.search('^2', s, re.M) # match
re.search('\A2', s, re.M) # no match <--- mimics `re.match`
re 中还有一个函数 re.fullmatch() 会扫描整个字符串,因此它会同时锚定在字符串的开头和结尾。 因此,在下面的示例中,x、y 和 z 匹配相同的内容。
x = re.match('pat\Z', s) # <--- already anchored at the beginning; must match end
y = re.search('\Apat\Z', s) # <--- match at the beginning and end of string
z = re.fullmatch('pat', s) # <--- already anchored at the beginning and end
第一个图显示 match 一样使用 search,>match 会更快。 第二个图支持@Jeyekomon的答案,并显示如果像search一样使用match,search会更快。 最后一张图显示,如果两者扫描相同的模式,则两者之间几乎没有什么区别。
用于生成性能图的代码。
import re
from random import choices
from string import ascii_lowercase
import matplotlib.pyplot as plt
from perfplot import plot
patterns = [
[re.compile(r'\Aword'), re.compile(r'word')],
[re.compile(r'word'), re.compile(r'(.*?)word')],
[re.compile(r'word')]*2
]
fig, axs = plt.subplots(1, 3, figsize=(20,6), facecolor='white')
for i, (pat1, pat2) in enumerate(patterns):
plt.sca(axs[i])
perfplot.plot(
setup=lambda n: [''.join(choices(ascii_lowercase, k=10)) for _ in range(n)],
kernels=[lambda lst: [*map(pat1.search, lst)], lambda lst: [*map(pat2.match, lst)]],
labels= [f"re.search(r'{pat1.pattern}', w)", f"re.match(r'{pat2.pattern}', w)"],
n_range=[2**k for k in range(24)],
xlabel='Length of list',
equality_check=None
)
fig.suptitle('re.match vs re.search')
fig.tight_layout();
re.match is anchored at the beginning of a string, while re.search scans through the entire string. So in the following example, x and y match the same thing.
x = re.match('pat', s) # <--- already anchored at the beginning of string
y = re.search('\Apat', s) # <--- match at the beginning
If a string doesn't contain line breaks, \A and ^ are essentially the same; the difference shows up in multiline strings. In the following example, re.match will never match the second line, while re.search can with the correct regex (and flag).
s = "1\n2"
re.match('2', s, re.M) # no match
re.search('^2', s, re.M) # match
re.search('\A2', s, re.M) # no match <--- mimics `re.match`
There's another function in re, re.fullmatch() that scans the entire string, so it is anchored both at the beginning and the end of a string. So in the following example, x, y and z match the same thing.
x = re.match('pat\Z', s) # <--- already anchored at the beginning; must match end
y = re.search('\Apat\Z', s) # <--- match at the beginning and end of string
z = re.fullmatch('pat', s) # <--- already anchored at the beginning and end
Based on Jeyekomon's answer (and using their setup), using the perfplot library, I plotted the results of timeit tests that looks into:
how do they compare if re.search "mimics" re.match? (first plot)
how do they compare if re.match "mimics" re.search? (second plot)
how do they compare if the same pattern is passed to them? (last plot)
Note that the last pattern doesn't produce the same output (because re.match is anchored at the beginning of a string.)
The first plot shows match is faster if search is used like match. The second plot supports @Jeyekomon's answer and shows search is faster if match is used like search. The last plot shows there's very little difference between the two if they scan for the same pattern.
Code used to produce the performance plot.
import re
from random import choices
from string import ascii_lowercase
import matplotlib.pyplot as plt
from perfplot import plot
patterns = [
[re.compile(r'\Aword'), re.compile(r'word')],
[re.compile(r'word'), re.compile(r'(.*?)word')],
[re.compile(r'word')]*2
]
fig, axs = plt.subplots(1, 3, figsize=(20,6), facecolor='white')
for i, (pat1, pat2) in enumerate(patterns):
plt.sca(axs[i])
perfplot.plot(
setup=lambda n: [''.join(choices(ascii_lowercase, k=10)) for _ in range(n)],
kernels=[lambda lst: [*map(pat1.search, lst)], lambda lst: [*map(pat2.match, lst)]],
labels= [f"re.search(r'{pat1.pattern}', w)", f"re.match(r'{pat2.pattern}', w)"],
n_range=[2**k for k in range(24)],
xlabel='Length of list',
equality_check=None
)
fig.suptitle('re.match vs re.search')
fig.tight_layout();
re.search('test', ' test') # returns a Truthy match object (because the search starts from any index)
re.match('test', ' test') # returns None (because the search start from 0 index)
re.match('test', 'test') # returns a Truthy match object (match at 0 index)
Quick answer
re.search('test', ' test') # returns a Truthy match object (because the search starts from any index)
re.match('test', ' test') # returns None (because the search start from 0 index)
re.match('test', 'test') # returns a Truthy match object (match at 0 index)
re.match attempts to match a pattern at the beginning of the string. re.search attempts to match the pattern throughout the string until it finds a match.
区别在于,re.match() 会误导任何习惯使用 Perl、grep 或 sed 的人正则表达式匹配,而 re.search() 则不然。 :-)
更清醒的是,正如 John D. Cook 所说,re.match()“表现得好像每个模式都在前面添加了 ^ ”。 换句话说,re.match('pattern') 等于re.search('^pattern')。 因此它锚定了图案的左侧。 但它也不锚定模式的右侧:仍然需要终止$。
坦率地说,鉴于上述情况,我认为应该弃用 re.match() 。 我很想知道应该保留它的原因。
The difference is, re.match() misleads anyone accustomed to Perl, grep, or sed regular expression matching, and re.search() does not. :-)
More soberly, As John D. Cook remarks, re.match() "behaves as if every pattern has ^ prepended." In other words, re.match('pattern') equals re.search('^pattern'). So it anchors a pattern's left side. But it also doesn't anchor a pattern's right side: that still requires a terminating $.
Frankly given the above, I think re.match() should be deprecated. I would be interested to know reasons it should be retained.
re.searchsearches for the pattern throughout the string, whereas re.match does not search the pattern; if it does not, it has no other choice than to match it at start of the string.
import random
import re
import string
import time
LENGTH = 10
LIST_SIZE = 1000000
def generate_word():
word = [random.choice(string.ascii_lowercase) for _ in range(LENGTH)]
word = ''.join(word)
return word
wordlist = [generate_word() for _ in range(LIST_SIZE)]
start = time.time()
[re.search('python', word) for word in wordlist]
print('search:', time.time() - start)
start = time.time()
[re.match('(.*?)python(.*?)', word) for word in wordlist]
print('match:', time.time() - start)
我进行了 10 次测量(1M、2M、...、10M 个单词),得到了以下图:
如您所见,搜索对于模式 'python' 比匹配模式 '(.*?)python(.*?)' 更快。
Python 很聪明。 避免试图变得更聪明。
match is much faster than search, so instead of doing regex.search("word") you can do regex.match((.*?)word(.*?)) and gain tons of performance if you are working with millions of samples.
import random
import re
import string
import time
LENGTH = 10
LIST_SIZE = 1000000
def generate_word():
word = [random.choice(string.ascii_lowercase) for _ in range(LENGTH)]
word = ''.join(word)
return word
wordlist = [generate_word() for _ in range(LIST_SIZE)]
start = time.time()
[re.search('python', word) for word in wordlist]
print('search:', time.time() - start)
start = time.time()
[re.match('(.*?)python(.*?)', word) for word in wordlist]
print('match:', time.time() - start)
I made 10 measurements (1M, 2M, ..., 10M words) which gave me the following plot:
As you can see, searching for the pattern 'python' is faster than matching the pattern '(.*?)python(.*?)'.
If zero or more characters at the beginning of string match the regular expression pattern, return a
corresponding MatchObject instance.
Return None if the string does not
match the pattern; note that this is
different from a zero-length match.
Note: If you want to locate a match
anywhere in string, use search()
instead.
Scan through string looking for a
location where the regular expression
pattern produces a match, and return a
corresponding MatchObject instance.
Return None if no position in the
string matches the pattern; note that
this is different from finding a
zero-length match at some point in the
string.
So if you need to match at the beginning of the string, or to match the entire string use match. It is faster. Otherwise use search.
Python offers two different primitive
operations based on regular
expressions: match checks for a match only at the beginning of the string,
while search checks for a match anywhere in the string (this is what
Perl does by default).
Note that match may differ from search
even when using a regular expression
beginning with '^': '^' matches only
at the start of the string, or in MULTILINE mode also immediately
following a newline. The “match”
operation succeeds only if the pattern
matches at the start of the string
regardless of mode, or at the starting
position given by the optional pos
argument regardless of whether a
newline precedes it.
Now, enough talk. Time to see some example code:
# example code:
string_with_newlines = """something
someotherthing"""
import re
print re.match('some', string_with_newlines) # matches
print re.match('someother',
string_with_newlines) # won't match
print re.match('^someother', string_with_newlines,
re.MULTILINE) # also won't match
print re.search('someother',
string_with_newlines) # finds something
print re.search('^someother', string_with_newlines,
re.MULTILINE) # also finds something
m = re.compile('thing
, re.MULTILINE)
print m.match(string_with_newlines) # no match
print m.match(string_with_newlines, pos=4) # matches
print m.search(string_with_newlines,
re.MULTILINE) # also matches
发布评论
评论(10)
re.match
锚定在字符串的开头,而re.search
则扫描整个字符串。 因此,在下面的示例中,x
和y
匹配相同的内容。如果字符串不包含换行符,则
\A
和^
本质上是相同的; 差异显示在多行字符串中。 在以下示例中,re.match
永远不会匹配第二行,而re.search
可以使用正确的正则表达式(和标志)。re
中还有一个函数re.fullmatch()
会扫描整个字符串,因此它会同时锚定在字符串的开头和结尾。 因此,在下面的示例中,x
、y
和z
匹配相同的内容。基于 Jeyekomon 的答案(并使用他们的设置),使用 perfplot 库,我绘制了 timeit 测试的结果,该测试调查了:
re.search
“模仿”re.match
,他们如何比较? (第一个图)re.match
“模仿”re.search
,他们如何比较? (第二个图)请注意,最后一个模式不会产生相同的输出(因为
re.match
锚定在字符串的开头。)第一个图显示
match 一样使用
search
,>match 会更快。 第二个图支持@Jeyekomon的答案,并显示如果像search
一样使用match
,search
会更快。 最后一张图显示,如果两者扫描相同的模式,则两者之间几乎没有什么区别。用于生成性能图的代码。
re.match
is anchored at the beginning of a string, whilere.search
scans through the entire string. So in the following example,x
andy
match the same thing.If a string doesn't contain line breaks,
\A
and^
are essentially the same; the difference shows up in multiline strings. In the following example,re.match
will never match the second line, whilere.search
can with the correct regex (and flag).There's another function in
re
,re.fullmatch()
that scans the entire string, so it is anchored both at the beginning and the end of a string. So in the following example,x
,y
andz
match the same thing.Based on Jeyekomon's answer (and using their setup), using the perfplot library, I plotted the results of timeit tests that looks into:
re.search
"mimics"re.match
? (first plot)re.match
"mimics"re.search
? (second plot)Note that the last pattern doesn't produce the same output (because
re.match
is anchored at the beginning of a string.)The first plot shows
match
is faster ifsearch
is used likematch
. The second plot supports @Jeyekomon's answer and showssearch
is faster ifmatch
is used likesearch
. The last plot shows there's very little difference between the two if they scan for the same pattern.Code used to produce the performance plot.
快速回答
Quick answer
re.match 尝试匹配字符串开头的模式。 re.search 尝试整个字符串匹配模式,直到找到匹配项。
re.match attempts to match a pattern at the beginning of the string. re.search attempts to match the pattern throughout the string until it finds a match.
更短:
search
扫描整个字符串。match
仅扫描字符串的开头。以下 Ex 说:
Much shorter:
search
scans through the whole string.match
scans only the beginning of the string.Following Ex says it:
您可以参考下面的示例来了解
re.match
的工作原理,并且 re.searchre.match
将返回none
,但是re.search
将返回abc
。You can refer the below example to understand the working of
re.match
and re.searchre.match
will returnnone
, butre.search
will returnabc
.区别在于,
re.match()
会误导任何习惯使用 Perl、grep 或 sed 的人正则表达式匹配,而re.search()
则不然。 :-)更清醒的是,正如 John D. Cook 所说,
re.match()
“表现得好像每个模式都在前面添加了 ^ ”。 换句话说,re.match('pattern')
等于re.search('^pattern')
。 因此它锚定了图案的左侧。 但它也不锚定模式的右侧:仍然需要终止$
。坦率地说,鉴于上述情况,我认为应该弃用
re.match()
。 我很想知道应该保留它的原因。The difference is,
re.match()
misleads anyone accustomed to Perl, grep, or sed regular expression matching, andre.search()
does not. :-)More soberly, As John D. Cook remarks,
re.match()
"behaves as if every pattern has ^ prepended." In other words,re.match('pattern')
equalsre.search('^pattern')
. So it anchors a pattern's left side. But it also doesn't anchor a pattern's right side: that still requires a terminating$
.Frankly given the above, I think
re.match()
should be deprecated. I would be interested to know reasons it should be retained.re.search
在整个字符串中搜索模式,而re.match
不搜索 模式; 如果不匹配,则除了在字符串开头匹配之外别无选择。re.search
searches for the pattern throughout the string, whereasre.match
does not search the pattern; if it does not, it has no other choice than to match it at start of the string.这条评论来自 @ivan_bilan 在上面接受的答案下 让我思考这样的黑客是否真的能加速任何事情,所以让我们看看您将真正获得多少性能。
我准备了以下测试套件:
我进行了 10 次测量(1M、2M、...、10M 个单词),得到了以下图:
如您所见,搜索对于模式
'python'
比匹配模式'(.*?)python(.*?)'
更快。Python 很聪明。 避免试图变得更聪明。
This comment from @ivan_bilan under the accepted answer above got me thinking if such hack is actually speeding anything up, so let's find out how many tons of performance you will really gain.
I prepared the following test suite:
I made 10 measurements (1M, 2M, ..., 10M words) which gave me the following plot:
As you can see, searching for the pattern
'python'
is faster than matching the pattern'(.*?)python(.*?)'
.Python is smart. Avoid trying to be smarter.
search
⇒ 在字符串中的任意位置查找内容并返回匹配对象。match
⇒ 在字符串的开头查找某些内容并返回一个匹配对象。search
⇒ find something anywhere in the string and return a match object.match
⇒ find something at the beginning of the string and return a match object.re.match
锚定在字符串的开头。 这与换行符无关,因此它与在模式中使用^
不同。正如 re.match 文档 所说:
re.search
搜索整个字符串,如 文档说:因此,如果您需要在字符串的开头匹配,或者匹配整个字符串,请使用
match
。 它更快。 否则使用搜索
。该文档有一个
match
的特定部分与也涵盖多行字符串的search
相比:现在,说得够多了。 是时候看一些示例代码了:
re.match
is anchored at the beginning of the string. That has nothing to do with newlines, so it is not the same as using^
in the pattern.As the re.match documentation says:
re.search
searches the entire string, as the documentation says:So if you need to match at the beginning of the string, or to match the entire string use
match
. It is faster. Otherwise usesearch
.The documentation has a specific section for
match
vs.search
that also covers multiline strings:Now, enough talk. Time to see some example code: