str.find如何这么快？

发布于 2025-01-25 09:52:22 字数 1046 浏览 1 评论 0 原文

我遇到了一个较早的问题，在迭代字符串并使用切片时，我正在寻找substring。事实证明，这是关于性能的不良主意。 str.find 要快得多。但是我不明白为什么？

import random
import string
import timeit

# Generate 1 MB of random string data
haystack = "".join(random.choices(string.ascii_lowercase, k=1_000_000))

def f():
    return [i for i in range(len(haystack)) if haystack[i : i + len(needle)] == needle]

def g():
    return [i for i in range(len(haystack)) if haystack.startswith(needle, i)]

def h():
    def find(start=0):
        while True:
            position = haystack.find(needle, start)
            if position < 0:
                return
            start = position + 1
            yield position
    return list(find())

number = 100
needle = "abcd"
expectation = f()
for func in "fgh":
    assert eval(func + "()") == expectation
    t = timeit.timeit(func + "()", globals=globals(), number=number)
    print(func, t)

结果：

f 26.46937609199813
g 16.11952730899793
h 0.07721933699940564

原文

I had an earlier problem where I was looking for a substring while iterating the string and using slicing. Turns out that's a really bad idea regarding performance. str.find is much faster. But I don't understand why?

import random
import string
import timeit

# Generate 1 MB of random string data
haystack = "".join(random.choices(string.ascii_lowercase, k=1_000_000))

def f():
    return [i for i in range(len(haystack)) if haystack[i : i + len(needle)] == needle]

def g():
    return [i for i in range(len(haystack)) if haystack.startswith(needle, i)]

def h():
    def find(start=0):
        while True:
            position = haystack.find(needle, start)
            if position < 0:
                return
            start = position + 1
            yield position
    return list(find())

number = 100
needle = "abcd"
expectation = f()
for func in "fgh":
    assert eval(func + "()") == expectation
    t = timeit.timeit(func + "()", globals=globals(), number=number)
    print(func, t)

Results:

f 26.46937609199813
g 16.11952730899793
h 0.07721933699940564

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

独享拥抱 2025-02-01 09:52:23

内置的Python函数在C中实现，这使它们可以更快。使用Python时，不可能做出同样执行的函数。

回复收藏 0 原文

世俗缘 2025-02-01 09:52:22

f 和 g 速度很慢，因为它们是否可以在的情况 o（nm）复杂性。 f 由于切片操作而创建一个新的字符串对象（如Barmar在注释中指出的那样）。

H 很快，因为它可以跳过许多位置。例如，如果找不到针头字符串，则仅执行一个查找。内置查找函数在C中高度优化，因此比解释的纯Python代码更快。此外，查找函数使用一种高效算法，称为。该算法比搜索针头在 haystack 相对较大时的每个可能位置要快得多。 The related CPython code is available here.

如果发生的数量相对较少，则您的实现应该已经很好。否则，最好使用基于 kmp算法，但在纯python中这样做将非常低效。您可以在C或Cython中这样做。话虽这么说，这并不是很重要，而且维护并不是很好。