为什么python（_ in _ in _）比循环快得多

发布于 2025-02-05 18:06:15 字数 1401 浏览 2 评论 0原文

这个问题与：这篇文章，但是我在那里找不到答案。

# 0m2.676s
#
# note that it should be == instead of in
# but as far as timing goes, it would be even faster
# (thanks to @Booboo for pointing this)
#
# here: ------------++
#                   ||
if any("xmuijdswly" in w for w in data):
    print("FOUND IT")

比以下速度要快得多：

# 0m13.476s
for d in data:
    if "xmuijdswly" == d:
        print("FOUND IT")
        break

我的数据包含10^7平均长度的任意字符串30

编辑：两个Linux 中存在时间差和 Windows：

PS ... > Measure-Command { python test_any.py }
TotalSeconds      : 4.5402383

PS ...> Measure-Command { python test_for.py }
TotalSeconds      : 17.7107506

编辑：生成随机字符串程序（用于完整）

$ cat main.c
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char **argv)
{
    FILE *fl=fopen("input.txt", "w+t");
    for (int i=0;i<10000000;i++)
    {
        int length = 5 + rand() % 50;
        for (int j=0;j<length;j++)
            fprintf(fl, "%c", rand() % 26 + 'a');
        fprintf(fl, "\n");
    }
    fclose(fl);
    return 0;
}

原文

This question is very similar to: this post, but I couldn't find the answer there.

# 0m2.676s
#
# note that it should be == instead of in
# but as far as timing goes, it would be even faster
# (thanks to @Booboo for pointing this)
#
# here: ------------++
#                   ||
if any("xmuijdswly" in w for w in data):
    print("FOUND IT")

is much faster than:

# 0m13.476s
for d in data:
    if "xmuijdswly" == d:
        print("FOUND IT")
        break

my data contains 10^7 arbitrary strings of average length 30

EDIT: the time difference exists in both linux and windows:

PS ... > Measure-Command { python test_any.py }
TotalSeconds      : 4.5402383

PS ...> Measure-Command { python test_for.py }
TotalSeconds      : 17.7107506

EDIT: generating random strings program (for completeness)

$ cat main.c
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char **argv)
{
    FILE *fl=fopen("input.txt", "w+t");
    for (int i=0;i<10000000;i++)
    {
        int length = 5 + rand() % 50;
        for (int j=0;j<length;j++)
            fprintf(fl, "%c", rand() % 26 + 'a');
        fprintf(fl, "\n");
    }
    fclose(fl);
    return 0;
}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

独闯女儿国 2025-02-12 18:06:16

首先，您的两个基准正在执行不同的测试。第一个带有for循环的是通过测试平等来与字符串列表相匹配。第二个是在操作员测试中使用，一个字符串是否是另一个字符串的子字符串，并且比测试平等要贵。但是，即使我们更改第二个基准以测试平等，这应该使其运行得更快，我发现结果与您发现的相反，即使用任何运行速度都比用于循环。在以下演示中，我已经安排了两个基准测试必须遍历所有10_000_000字符串，因为永远不会找到匹配：

import time

data = ['-' * 30 for _ in range(10_000_000)]

def test1():
    for d in data:
        if 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123' == d:
            return True
    return False

def test2():
    return any('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123' == w for w in data)

t = time.time()
b = test1()
print(b, time.time() - t)

t = time.time()
b = test2()
print(b, time.time() - t)

打印：

False 0.2701404094696045
False 0.5590765476226807

使用任何的第二个版本都慢慢运行（也许您应该重新检查一下您的结果）。 为什么使用的版本实际上实际运行更慢？

参数任何是生成器表达式，这导致创建生成器函数。因此，要测试数据的每个值，必须制定一个函数调用以生成下一个比较结果，并且该功能调用是使事物更慢的原因。我们可以将您的将任何版本都变成使用显式生成器函数的版本：

import time

data = ['-' * 30 for _ in range(10_000_000)]

def test1():
    for d in data:
        if 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123' == d:
            return True
    return False

def generator():
    for w in data:
        yield 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123' == w

def test2():
    return any(generator())

t = time.time()
b = test1()
print(b, time.time() - t)

t = time.time()
b = test2()
print(b, time.time() - t)

打印：

False 0.28400230407714844
False 0.5430142879486084

您可以看到，这两种任何版本所需的时间基本相同，就像我会期望的。

更新：使用OP的C程序生成数据

在这里我使用OP的C程序生成数据，并且我使用了相同的“ Xmuijdswly”字符串进行比较：

import time

total_length = 0
data = []
with open('input.txt') as f:
    for line in f:
        line = line.strip()
        total_length += len(line)
        data.append(line)
print('Average string length:', total_length / len(data))

def test1():
    for d in data:
        if 'xmuijdswly' == d:
            return True
    return False

def test2():
    return any('xmuijdswly' == w for w in data)

t = time.time()
b = test1()
print(b,  'Time using for loop:', time.time() - t)

t = time.time()
b = test2()
print(b, 'Time using any:', time.time() - t)

打印：

Average string length: 29.4972984
False Time using for loop: 0.3110032081604004
False Time using any: 0.6610157489776611

一个< em>可能的OP结果的解释

OP实际上尚未发布他们正在测试的完整程序，大概必须先读取来自input.txt 。如果任何版本是第二次运行的，则 input.txt 中的数据将被操作系统缓存，因为循环版本已读取文件，因此对于任何版本，读取输入数据的I/O时间将少得多。

是时候做data列表的初步创建在基准计时中包含的吗？

First, your two benchmarks are performing different tests. The first one, with the for loop, is matching a string against a list of strings by testing for equality. The second one, is using the in operator testing whether one string is a substring of another and would be a more expensive operation than testing for equality. But even if we change the second benchmark to instead test for equality, which should make it run even faster, I find the results to be the opposite of what you found, i.e. using any runs more slowly than the for loop. In the following demo I have arranged it that both benchmarks must iterate through all 10_000_000 strings because a match will never be found:

import time

data = ['-' * 30 for _ in range(10_000_000)]

def test1():
    for d in data:
        if 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123' == d:
            return True
    return False

def test2():
    return any('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123' == w for w in data)

t = time.time()
b = test1()
print(b, time.time() - t)

t = time.time()
b = test2()
print(b, time.time() - t)

Prints:

False 0.2701404094696045
False 0.5590765476226807

The second version using any runs twice as slowly (perhaps you should re-check your results). Why does the version using any actually run more slowly?

The argument to any is a generator expression, which results in creating a generator function. So to test each value of the data a function call to generate the next comparison result must be made and that function call is what is making things run more slowly. We can turn your any version into one that uses an explicit generator function:

import time

data = ['-' * 30 for _ in range(10_000_000)]

def test1():
    for d in data:
        if 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123' == d:
            return True
    return False

def generator():
    for w in data:
        yield 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123' == w

def test2():
    return any(generator())

t = time.time()
b = test1()
print(b, time.time() - t)

t = time.time()
b = test2()
print(b, time.time() - t)

Prints:

False 0.28400230407714844
False 0.5430142879486084

As you can see, the time required for both any versions is essentially the same, as I would expect.

Update: Using the OP's C Program to Generate the Data

Here I used the OP's C program to generate the data and I have used the same "xmuijdswly" string for the comparison:

import time

total_length = 0
data = []
with open('input.txt') as f:
    for line in f:
        line = line.strip()
        total_length += len(line)
        data.append(line)
print('Average string length:', total_length / len(data))

def test1():
    for d in data:
        if 'xmuijdswly' == d:
            return True
    return False

def test2():
    return any('xmuijdswly' == w for w in data)

t = time.time()
b = test1()
print(b,  'Time using for loop:', time.time() - t)

t = time.time()
b = test2()
print(b, 'Time using any:', time.time() - t)

Prints:

Average string length: 29.4972984
False Time using for loop: 0.3110032081604004
False Time using any: 0.6610157489776611

One Possible Explanation for OP's Results

The OP hasn't actually posted the complete programs that they were testing against, which presumably have to first read in the data from input.txt. If the any version was run second, the data in input.txt would have been cached by the operating system as a result of the looping version having read the file and therefore the I/O time to read the input data would be much less for the any version.

Was the time to do that initial creation of the data list erroneously included in the benchmark timings?

回复收藏 0 原文