为什么collections.deque比collections.defaultdict慢?

发布于 2024-11-27 14:38:01 字数 775 浏览 1 评论 0原文

请原谅我以如此笼统的方式询问,因为我确信它们的性能取决于人们如何使用它们,但在我的情况下 collections.dequecollections.defaultdict< 慢得多/code> 当我想验证一个值是否存在时。

我使用了 Peter Norvig 的拼写更正来验证用户输入的一小部分单词。由于我没有使用带有词频的字典,所以一开始我使用了一个简单的 list 而不是 defaultdict,但很快就用 deque 替换了它我注意到单个单词查找大约需要 25 秒。

令人惊讶的是,这并不比使用 list 快,所以我又重新使用 defaultdict ,它几乎立即返回结果。

有人可以向我解释这种性能差异吗?

提前致谢


PS:如果你们中有人想重现我所说的内容,请更改 Norvig 脚本中的以下几行。

-NWORDS = train(words(file('big.txt').read()))
+NWORDS = collections.deque(words(file('big.txt').read()))

-return max(candidates, key=NWORDS.get)
+return candidates

Forgive me for asking in in such a general way as I'm sure their performance is depending on how one uses them, but in my case collections.deque was way slower than collections.defaultdict when I wanted to verify the existence of a value.

I used the spelling correction from Peter Norvig in order to verify a user's input against a small set of words. As I had no use for a dictionary with word frequencies I used a simple list instead of defaultdict at first, but replaced it with deque as soon as I noticed that a single word lookup took about 25 seconds.

Surprisingly, that wasn't faster than using a list so I returned to using defaultdict which returned results almost instantaneously.

Can someone explain this difference in performance to me?

Thanks in advance


PS: If one of you wants to reproduce what I was talking about, change the following lines in Norvig's script.

-NWORDS = train(words(file('big.txt').read()))
+NWORDS = collections.deque(words(file('big.txt').read()))

-return max(candidates, key=NWORDS.get)
+return candidates

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

嘴硬脾气大 2024-12-04 14:38:01

这三种数据结构不可互换,它们服务于非常不同的目的,并且具有非常不同的特征:

  • 列表是动态数组,您使用它们顺序存储项目以进行快速随机访问,用作堆栈(在末尾添加和删除)或只是存储某些内容,然后以相同的顺序对其进行迭代。
  • 双端队列也是序列,仅用于在两端添加和删除元素,而不是随机访问或类似堆栈的增长。
  • 字典(提供一个相对简单和方便的默认值,但是 - 对于这个问题 - 不相关的扩展)是哈希表,它们将功能齐全的键(而不是索引)与值相关联,并通过键提供对值的快速访问并且(必然)非常快速地检查密钥是否存在。它们不维护秩序,并且要求密钥可散列,但是,如果不打破鸡蛋,就无法制作煎蛋卷。

所有这些属性都很重要,每当您选择其中之一时,请记住它们。在这种特殊情况下,让你头疼的是字典的最后一个属性和必须检查的可能更正数量的组合。一些简单的组合应该得出该代码为给定单词生成的编辑次数的具体公式,但是每个经常错误预测此类事情的人都会知道,即使对于普通单词来说,这个数字也将是惊人的大。

对于这些编辑中的每一个,都会有一个检查edit in NWORDS,以清除导致未知单词的编辑。 Norvig 的程序中没有一点问题,因为正如前面提到的,in 检查(键存在检查)非常快。但是您用序列(双端队列)交换了字典!对于序列,in 必须迭代整个序列并将每个项目与搜索到的值进行比较(当找到匹配时它可以停止,但由于最少的编辑是位于序列开头的已知单词)双端队列,它通常仍然搜索全部或大部分双端队列)。由于有相当多的单词,并且测试是针对生成的每个编辑完成的,因此您最终会花费 99% 的时间在序列中进行线性搜索,在该序列中您可以只散列一个字符串并比较它一次(或者最多 -发生碰撞的情况 - 几次)。

如果您不需要权重,那么从概念上讲,您可以使用您从未查看过的虚假值,并且仍然可以获得 O(1) in 检查的性能提升。实际上,你应该只使用一个 set ,它使用与字典几乎相同的算法,只是删除它存储值的部分(它实际上是首先这样实现的,我不知道自从集合在专用的、单独的 C 模块中重新实现以来,两者的分歧有多大)。

These three data structures aren't interchangeable, they serve very different purposes and have very different characteristics:

  • Lists are dynamic arrays, you use them to store items sequentially for fast random access, use as stack (adding and removing at the end) or just storing something and later iterating over it in the same order.
  • Deques are sequences too, only for adding and removing elements at both ends instead of random access or stack-like growth.
  • Dictionaries (providing a default value just a relatively simple and convenient but - for this question - irrelevant extension) are hash tables, they associate fully-featured keys (instead of an index) with values and provide very fast access to a value by a key and (necessarily) very fast checks for key existence. They don't maintain order and require the keys to be hashable, but well, you can't make an omelette without breaking eggs.

All of these properties are important, keep them in mind whenever you choose one over the other. What breaks your neck in this particular case is a combination of the last property of dictionaries and the number of possible corrections that have to be checked. Some simple combinatorics should arrive at a concrete formula for the number of edits this code generates for a given word, but everyone who mispredicted such things often enough will know it's going to be surprisingly large number even for average words.

For each of these edits, there is a check edit in NWORDS to weeds out edits that result in unknown words. Not a bit problem in Norvig's program, since in checks (key existence checks) are, as metioned before, very fast. But you swaped the dictionary with a sequence (a deque)! For sequences, in has to iterate over the whole sequence and compare each item with the value searched for (it can stop when it finds a match, but since the least edits are know words sitting at the beginning of the deque, it usually still searches all or most of the deque). Since there are quite a few words and the test is done for each edit generated, you end up spending 99% of your time doing a linear search in a sequence where you could just hash a string and compare it once (or at most - in case of collisions - a few times).

If you don't need weights, you can conceptually use bogus values you never look at and still get the performance boost of an O(1) in check. Practically, you should just use a set which uses pretty much the same algorithms as the dictionaries and just cuts away the part where it stores the value (it was actually first implemented like that, I don't know how far the two diverged since sets were re-implemented in a dedicated, seperate C module).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文