str.find如何这么快?
我遇到了一个较早的问题,在迭代字符串并使用切片时,我正在寻找substring。事实证明,这是关于性能的不良主意。 str.find
要快得多。但是我不明白为什么?
import random
import string
import timeit
# Generate 1 MB of random string data
haystack = "".join(random.choices(string.ascii_lowercase, k=1_000_000))
def f():
return [i for i in range(len(haystack)) if haystack[i : i + len(needle)] == needle]
def g():
return [i for i in range(len(haystack)) if haystack.startswith(needle, i)]
def h():
def find(start=0):
while True:
position = haystack.find(needle, start)
if position < 0:
return
start = position + 1
yield position
return list(find())
number = 100
needle = "abcd"
expectation = f()
for func in "fgh":
assert eval(func + "()") == expectation
t = timeit.timeit(func + "()", globals=globals(), number=number)
print(func, t)
结果:
f 26.46937609199813
g 16.11952730899793
h 0.07721933699940564
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
内置的Python函数在C中实现,这使它们可以更快。使用Python时,不可能做出同样执行的函数。
The built-in Python functions are implemented in C, which allows them to be much faster. It's not possible to make a function that performs just as well when using Python.
f
和g
速度很慢,因为它们是否可以在o(nm)
复杂性。f
由于切片操作而创建一个新的字符串对象(如Barmar在注释中指出的那样)。H
很快,因为它可以跳过许多位置。例如,如果找不到针头
字符串,则仅执行一个查找
。内置查找
函数在C中高度优化,因此比解释的纯Python代码更快。此外,查找
函数使用一种高效算法,称为。该算法比搜索针头
在haystack
相对较大时的每个可能位置要快得多。 The related CPython code is available here.如果发生的数量相对较少,则您的实现应该已经很好。否则,最好使用基于 kmp算法,但在纯python中这样做将非常低效。您可以在C或Cython中这样做。话虽这么说,这并不是很重要,而且维护并不是很好。
f
andg
are slow since they check ifneedle
can be found in every possible location ofhaystack
resulting in aO(n m)
complexity.f
is slower because of the slicing operation that creates a new string object (as pointed out by Barmar in the comments).h
is fast because it can skip many locations. For example, if theneedle
string is not found, only onefind
is performed. The built-infind
function is highly optimized in C and thus faster than an interpreted pure-Python code. Additionally, thefind
function use an efficient algorithm called Crochemore and Perrin's Two-Way. This algorithm is much faster than searchingneedle
at every possible location ofhaystack
when the string is relatively big. The related CPython code is available here.If the number of occurrence is relatively small, your implementation should already be good. Otherwise, it may be better to use a custom variant based on the CPTW algorithm of possibly the KMP algorithm but doing that in pure-Python will be very inefficient. You could do that in C or with Cython. That being said this is not trivial to do and not great to maintain.