str.find如何这么快?

发布于 2025-01-25 09:52:22 字数 1046 浏览 1 评论 0 原文

我遇到了一个较早的问题,在迭代字符串并使用切片时,我正在寻找substring。事实证明,这是关于性能的不良主意。 str.find 要快得多。但是我不明白为什么?

import random
import string
import timeit

# Generate 1 MB of random string data
haystack = "".join(random.choices(string.ascii_lowercase, k=1_000_000))

def f():
    return [i for i in range(len(haystack)) if haystack[i : i + len(needle)] == needle]

def g():
    return [i for i in range(len(haystack)) if haystack.startswith(needle, i)]

def h():
    def find(start=0):
        while True:
            position = haystack.find(needle, start)
            if position < 0:
                return
            start = position + 1
            yield position
    return list(find())

number = 100
needle = "abcd"
expectation = f()
for func in "fgh":
    assert eval(func + "()") == expectation
    t = timeit.timeit(func + "()", globals=globals(), number=number)
    print(func, t)

结果:

f 26.46937609199813
g 16.11952730899793
h 0.07721933699940564

I had an earlier problem where I was looking for a substring while iterating the string and using slicing. Turns out that's a really bad idea regarding performance. str.find is much faster. But I don't understand why?

import random
import string
import timeit

# Generate 1 MB of random string data
haystack = "".join(random.choices(string.ascii_lowercase, k=1_000_000))

def f():
    return [i for i in range(len(haystack)) if haystack[i : i + len(needle)] == needle]

def g():
    return [i for i in range(len(haystack)) if haystack.startswith(needle, i)]

def h():
    def find(start=0):
        while True:
            position = haystack.find(needle, start)
            if position < 0:
                return
            start = position + 1
            yield position
    return list(find())

number = 100
needle = "abcd"
expectation = f()
for func in "fgh":
    assert eval(func + "()") == expectation
    t = timeit.timeit(func + "()", globals=globals(), number=number)
    print(func, t)

Results:

f 26.46937609199813
g 16.11952730899793
h 0.07721933699940564

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

独享拥抱 2025-02-01 09:52:23

内置的Python函数在C中实现,这使它们可以更快。使用Python时,不可能做出同样执行的函数。

The built-in Python functions are implemented in C, which allows them to be much faster. It's not possible to make a function that performs just as well when using Python.

世俗缘 2025-02-01 09:52:22

f g 速度很慢,因为它们是否可以在 的情况 o(nm)复杂性。 f 由于切片操作而创建一个新的字符串对象(如Barmar在注释中指出的那样)。

H 很快,因为它可以跳过许多位置。例如,如果找不到针头字符串,则仅执行一个查找。内置查找函数在C中高度优化,因此比解释的纯Python代码更快。此外,查找函数使用一种高效算法,称为。该算法比搜索针头 haystack 相对较大时的每个可能位置要快得多。 The related CPython code is available here.

如果发生的数量相对较少,则您的实现应该已经很好。否则,最好使用基于 kmp算法,但在纯python中这样做将非常低效。您可以在C或Cython中这样做。话虽这么说,这并不是很重要,而且维护并不是很好。

f and g are slow since they check if needle can be found in every possible location of haystack resulting in a O(n m) complexity. f is slower because of the slicing operation that creates a new string object (as pointed out by Barmar in the comments).

h is fast because it can skip many locations. For example, if the needle string is not found, only one find is performed. The built-in find function is highly optimized in C and thus faster than an interpreted pure-Python code. Additionally, the find function use an efficient algorithm called Crochemore and Perrin's Two-Way. This algorithm is much faster than searching needle at every possible location of haystack when the string is relatively big. The related CPython code is available here.

If the number of occurrence is relatively small, your implementation should already be good. Otherwise, it may be better to use a custom variant based on the CPTW algorithm of possibly the KMP algorithm but doing that in pure-Python will be very inefficient. You could do that in C or with Cython. That being said this is not trivial to do and not great to maintain.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文