为什么从过滤器创建集合比创建列表或元组快得多?
我正在交互上运行 filter
,并希望将结果存储在序列中(我需要一个序列,以便我可以在其上使用 random.choice
)。我注意到从 filter 对象创建集合比创建列表或元组要快得多。这是为什么?我首先认为过滤器类型是集合的子类型,这可以解释这一点,但是过滤器函数实际上与生成器表达式相同,因此它内部不能真正是集合。
我运行了以下测试来检查速度:
import time
def test ( n, seq ):
for method in ( set, list, tuple ):
t = time.time()
for i in range( n ):
method( seq )
print( method.__name__, ( time.time() - t ) )
someFilter = filter( lambda x: x % 3 == 0, range( 1000 ) )
test( 10000000, someFilter )
结果清楚地说明了使用集合:
set 1.9240000247955322
list 8.82200002670288
tuple 7.031999826431274
那么为什么从过滤器创建集合要快得多?通常不应该花费与从序列创建集合一样长的时间,其中每个元素都必须进行散列吗?或者它是否以某种方式从内部过滤器表示中获得提升?
作为比较,在 range
表达式上运行测试时,set
花费的时间大约是 list
和 tuple
的两倍(两者的速度几乎相同)。
编辑:
斯文的答案是完全正确的,但为了完整性,更新的测试将在实际的过滤器上运行:
import time
def testFilter ( n, test, rangeSize ):
for method in ( set, list, tuple ):
t = time.time()
for i in range( n ):
method( filter( test, range( rangeSize ) ) )
print( method.__name__, ( time.time() - t ) )
testFilter( 100000, lambda x: x % 3 == 0, 1000 )
结果实际上显示了 list
和 tuple
两者更有意义是最快的,尽管 set 并不慢,所以使用什么不会有任何区别:
set 27.868000030517578
list 27.131999969482422
tuple 27.138000011444092
I’m running filter
on an interable and want to store the result in a sequence (I need a sequence so that I can use random.choice
on it). I noticed that creating a set from a filter object is a lot faster than creating a list or a tuple. Why is that? I first though that the filter type is a subtype of set, which would explain this, but the filter
function is actually identical to a generator expression, so it cannot really be a set internally.
I ran the following test to check the speed:
import time
def test ( n, seq ):
for method in ( set, list, tuple ):
t = time.time()
for i in range( n ):
method( seq )
print( method.__name__, ( time.time() - t ) )
someFilter = filter( lambda x: x % 3 == 0, range( 1000 ) )
test( 10000000, someFilter )
And the results were speaking clearly for using a set:
set 1.9240000247955322
list 8.82200002670288
tuple 7.031999826431274
So why is creating a set from a filter so much faster? Shouldn’t it usually take as long as creating a set from a sequence, where every element has to be hashed? Or is it somehow getting a boost from the internal filter representation?
For comparison, when running the test on a range
expression, set
takes about twice as long as list
and tuple
(which are both nearly identical in speed).
edit:
Sven’s answer is totally right, but for the completeness an updated test that will run on an actual filter:
import time
def testFilter ( n, test, rangeSize ):
for method in ( set, list, tuple ):
t = time.time()
for i in range( n ):
method( filter( test, range( rangeSize ) ) )
print( method.__name__, ( time.time() - t ) )
testFilter( 100000, lambda x: x % 3 == 0, 1000 )
The result actually shows what makes more sense with list
and tuple
both being the fastest, although set isn’t really slower, so it won’t make any difference what to use:
set 27.868000030517578
list 27.131999969482422
tuple 27.138000011444092
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
filter()
在 Python 3 中返回一个迭代器,并且该迭代器将在第一次运行内部 for 循环时被消耗。之后,您只需测量构造器的速度 - 这就是为什么您必须经常重复它以使其至少消耗一点时间。因此,
set()
的构造函数似乎是处理空迭代器最快的构造函数。filter()
returns an iterator in Python 3, and this iterator will be consumed on the first run of the inner for-loop. After that, you are only measuring the speed of the contructor -- that's why you have to repeat it so often to make it consume at least a bit of time.So it seems that the constructor of
set()
is the fastest one in dealing with an empty iterator.当计时表明结果不合逻辑时,通常是计时套件本身有问题;-)
尝试使用 timeit 模块 可以帮助您避免常见的计时错误。特别是,您希望为每个测试运行一个新的设置,并且仅对循环体而不是循环体加上测试代码进行计时。
在这种情况下,它至少会使计时具有可比性(所有这些都将使用 Python 3 版本的 *filter 返回的新迭代器),并且它会显示出令人难以置信的快速计时(因为只有
方法(迭代器)
代码将在循环中计时)。FWIW,pypy 会更难计时,因为过于简单的循环会被完全优化掉。
[编辑问题的答案] 新的计时是可比的(一个很好的改进),但结果仍然显示设置时间和循环时间的组合,因此很难看到可能显着的差异。您的期望应该是列表和元组胜过集合,因为集合必须做更多的工作(散列每个输入而不是简单地存储它们)。
When timings suggest results that are illogical, often it is the timing suite itself that is at fault ;-)
Try using the timeit module which can help steer you clear of common timing mistakes. In particular, you want to run a fresh setup for each test and only time the loop body rather than the body plus the test code.
In this case, it would have at least made the timings comparable (all of them would have used a fresh iterator as returned by Python 3's version of *filter) and it would have shown unbelievably fast timings (because only the
method(iterator)
code would have been timed in a loop).FWIW, pypy will be even harder to time because overly simple loops get optimized away completely.
[Answer to the question as edited] You new timings are comparable (a good improvement) but the results still show the combination of the setup time and the loop time making it difficult to see differences that might be significant. Your expectation should be that lists and tuples beat sets because sets have to do more work (hashing each of the inputs rather than simply storing them).