为什么collections.deque比collections.defaultdict慢?
请原谅我以如此笼统的方式询问,因为我确信它们的性能取决于人们如何使用它们,但在我的情况下 collections.deque
比 collections.defaultdict< 慢得多/code> 当我想验证一个值是否存在时。
我使用了 Peter Norvig 的拼写更正来验证用户输入的一小部分单词。由于我没有使用带有词频的字典,所以一开始我使用了一个简单的 list
而不是 defaultdict
,但很快就用 deque
替换了它我注意到单个单词查找大约需要 25 秒。
令人惊讶的是,这并不比使用 list
快,所以我又重新使用 defaultdict
,它几乎立即返回结果。
有人可以向我解释这种性能差异吗?
提前致谢
PS:如果你们中有人想重现我所说的内容,请更改 Norvig 脚本中的以下几行。
-NWORDS = train(words(file('big.txt').read()))
+NWORDS = collections.deque(words(file('big.txt').read()))
-return max(candidates, key=NWORDS.get)
+return candidates
Forgive me for asking in in such a general way as I'm sure their performance is depending on how one uses them, but in my case collections.deque
was way slower than collections.defaultdict
when I wanted to verify the existence of a value.
I used the spelling correction from Peter Norvig in order to verify a user's input against a small set of words. As I had no use for a dictionary with word frequencies I used a simple list
instead of defaultdict
at first, but replaced it with deque
as soon as I noticed that a single word lookup took about 25 seconds.
Surprisingly, that wasn't faster than using a list
so I returned to using defaultdict
which returned results almost instantaneously.
Can someone explain this difference in performance to me?
Thanks in advance
PS: If one of you wants to reproduce what I was talking about, change the following lines in Norvig's script.
-NWORDS = train(words(file('big.txt').read()))
+NWORDS = collections.deque(words(file('big.txt').read()))
-return max(candidates, key=NWORDS.get)
+return candidates
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这三种数据结构不可互换,它们服务于非常不同的目的,并且具有非常不同的特征:
所有这些属性都很重要,每当您选择其中之一时,请记住它们。在这种特殊情况下,让你头疼的是字典的最后一个属性和必须检查的可能更正数量的组合。一些简单的组合应该得出该代码为给定单词生成的编辑次数的具体公式,但是每个经常错误预测此类事情的人都会知道,即使对于普通单词来说,这个数字也将是惊人的大。
对于这些编辑中的每一个,都会有一个检查
edit in NWORDS
,以清除导致未知单词的编辑。 Norvig 的程序中没有一点问题,因为正如前面提到的,in
检查(键存在检查)非常快。但是您用序列(双端队列)交换了字典!对于序列,in
必须迭代整个序列并将每个项目与搜索到的值进行比较(当找到匹配时它可以停止,但由于最少的编辑是位于序列开头的已知单词)双端队列,它通常仍然搜索全部或大部分双端队列)。由于有相当多的单词,并且测试是针对生成的每个编辑完成的,因此您最终会花费 99% 的时间在序列中进行线性搜索,在该序列中您可以只散列一个字符串并比较它一次(或者最多 -发生碰撞的情况 - 几次)。如果您不需要权重,那么从概念上讲,您可以使用您从未查看过的虚假值,并且仍然可以获得 O(1)
in
检查的性能提升。实际上,你应该只使用一个set
,它使用与字典几乎相同的算法,只是删除它存储值的部分(它实际上是首先这样实现的,我不知道自从集合在专用的、单独的 C 模块中重新实现以来,两者的分歧有多大)。These three data structures aren't interchangeable, they serve very different purposes and have very different characteristics:
All of these properties are important, keep them in mind whenever you choose one over the other. What breaks your neck in this particular case is a combination of the last property of dictionaries and the number of possible corrections that have to be checked. Some simple combinatorics should arrive at a concrete formula for the number of edits this code generates for a given word, but everyone who mispredicted such things often enough will know it's going to be surprisingly large number even for average words.
For each of these edits, there is a check
edit in NWORDS
to weeds out edits that result in unknown words. Not a bit problem in Norvig's program, sincein
checks (key existence checks) are, as metioned before, very fast. But you swaped the dictionary with a sequence (a deque)! For sequences,in
has to iterate over the whole sequence and compare each item with the value searched for (it can stop when it finds a match, but since the least edits are know words sitting at the beginning of the deque, it usually still searches all or most of the deque). Since there are quite a few words and the test is done for each edit generated, you end up spending 99% of your time doing a linear search in a sequence where you could just hash a string and compare it once (or at most - in case of collisions - a few times).If you don't need weights, you can conceptually use bogus values you never look at and still get the performance boost of an O(1)
in
check. Practically, you should just use aset
which uses pretty much the same algorithms as the dictionaries and just cuts away the part where it stores the value (it was actually first implemented like that, I don't know how far the two diverged since sets were re-implemented in a dedicated, seperate C module).