为什么在一个大的 std::list 上迭代这么慢?
正如标题所示,我的一个程序遇到了问题,我使用 std::list 作为堆栈,并迭代列表的所有元素。当列表变得非常大时,该程序花费的时间就太长了。
有人对此有好的解释吗?是一些堆栈/缓存行为吗?
(通过将列表更改为 std::vector 和 std::deque (顺便说一句,这是一个令人惊奇的数据结构)解决了问题,一切突然变得更快)
编辑:我不是傻瓜,我不访问元素在列表的中间。我对列表所做的唯一事情是在末尾/开头删除/添加元素,并迭代列表的所有元素。 我总是使用迭代器来迭代列表。
As title suggests, I had problems with a program of mine where I used a std::list as a stack and also to iterate over all elements of the list. The program was taking way too long when the lists became very big.
Does anyone have a good explanation for this? Is it some stack/cache behavior?
(Solved the problem by changing the lists to std::vector and std::deque (an amazing data structure by the way) and everything suddenly went so much faster)
EDIT: I'm not a fool and I don't access elements in the middle of the lists. The only thing I did with the lists was to remove/add elements at the end/beginning and to iterate through all elements of the list.
And I always used iterators to iterate over the list.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
列表的缓存位置很糟糕(不存在)。每个节点都是一个新的内存分配,并且可能位于任何地方。因此,每次当您沿着指针从一个节点到下一个节点时,您都会跳转到内存中一个新的、不相关的位置。是的,这对性能有很大影响。高速缓存未命中可能比高速缓存命中慢两个数量级。在向量或双端队列中,几乎每次访问都会命中缓存。向量是一个连续的内存块,因此对其进行迭代的速度尽可能快。双端队列是几个较小的内存块,因此它会偶尔出现缓存未命中的情况,但它们仍然很少见,并且迭代仍然非常快,因为您主要获得缓存命中。
一个列表将几乎所有缓存未命中。而且性能会很糟糕。
实际上,从性能的角度来看,链表几乎不是正确的选择。
编辑:
正如评论所指出的,列表的另一个问题是数据依赖性。现代 CPU 喜欢重叠操作。但如果下一条指令取决于这条指令的结果,它就无法做到这一点。
如果你迭代一个向量,那没问题。您可以即时计算要读取的下一个地址,而无需检查内存。如果您现在正在地址
x
处读取,则下一个元素将位于地址x + sizeof(T)
处,其中 T 是元素类型。因此,那里没有依赖关系,CPU 可以立即开始加载下一个元素或其后的元素,同时仍然处理较早的元素。这样,当我们需要数据时,数据就会为我们准备好,这进一步有助于掩盖访问 RAM 中数据的成本。在列表中,我们需要跟踪从节点
i
到节点i+1
的指针,直到加载了i+1
为止,我们甚至不知道在哪里寻找i+2
。我们存在数据依赖性,因此 CPU 被迫一次读取一个节点,并且它无法提前开始读取未来的节点,因为它还不知道它们在哪里。如果列表不全是缓存未命中,这不会是一个大问题,但由于我们遇到了大量缓存未命中,这些延迟的成本很高。
Lists have terrible (nonexistent) cache locality. Every node is a new memory allocation, and may be anywhere. So every time you follow a pointer from one node to the next, you jump to a new, unrelated, place in memory. And yes, that hurts performance quite a bit. A cache miss may be two orders of magnitudes slower than a cache hit. In a vector or deque, pretty much every access will be a cache hit. A vector is one single contiguous block of memory, so iterating over that is as fast as you're going to get. A deque is several smaller blocks of memory, so it introduces the occasional cache miss, but they'll still be rare, and iteration will still be very fast as you're getting mostly cache hits.
A list will be almost all cache misses. And performance will suck.
In practice, a linked list is hardly ever the right choice from a performance point of view.
Edit:
As a comment pointed out, another problem with lists is data dependencies. A modern CPU likes to overlap operations. But it can't do that if the next instruction depends on the result of this one.
If you're iterating over a vector, that's no problem. You can compute the next address to read on the fly, without ever having to check in memory. If you're reading at address
x
now, then the next element will be located at addressx + sizeof(T)
where T is the element type. So there are no dependencies there, and the CPU can start loading the next element, or the one after it, immediately, while still processing an earlier element. That way, the data will be ready for us when we need it, and this further helps mask the cost of accessing data in RAM.In a list, we need to follow a pointer from node
i
to nodei+1
, and untili+1
has been loaded, we don't even know where to look fori+2
. We have a data dependency, so the CPU is forced to read nodes one at a time, and it can't start reading future nodes ahead of time, because it doesn't yet know where they are.If a list hadn't been all cache misses, this wouldn't have been a big problem, but since we're getting a lot of cache misses, these delays are costly.
这是由于使用列表时会出现大量缓存未命中。对于向量,周围的元素存储在处理器缓存中。
It is due to the large amounts of cache misses you get when using a list. With a vector the surrounding elements are stored in the processors cache.
查看以下 stackoverflow 线程。
Have a look at the following stackoverflow thread.
存在一个缓存问题:向量中的所有数据都存储在连续的块中,每个列表元素都是单独分配的,并且可能恰好存储在相当随机的内存位置中,这会导致更多的缓存错过了。但是,我敢打赌您会遇到其他答案中描述的问题之一。
There is a cache issue: all data in vector are stored in a contiguous chunk, and each list element is allocated separately and may happen to be stored in quite a random place of memory, which leads to more cache misses. However, I bet that you encounter one of the issues described in the other answers.
简单的答案是,因为迭代向量根本不是迭代,它只是从数组的底部开始并一个接一个地读取元素。
我看到这被标记为 C++,而不是 C,但由于它们在幕后做同样的事情,值得指出的是,您可以通过分配任意大的数组以及 realloc()ing 和 memmove 将元素添加到数组的开头和结尾如果空间不足,则在 2 个伴随数组之间进行 () 操作。非常快。
将元素添加到数组开头的技巧是通过将指针推进到数组的开头来偏置数组的逻辑开头,然后在前面添加元素时将其后退。 (也是堆栈的实现方式)
以完全相同的方式,可以使 C 支持负下标。
C++ 使用向量 STL 类为您完成所有这些工作,但仍然值得记住幕后发生的事情。
The simple answer is because iterating over a vector isn't iterating at all, it's just starting at the base of an array and reading the elements one after another.
I see this is marked C++, not C, but since they do the same thing under the covers it's worth pointing out that you can add elements to the beginning and end of an array by allocating it arbitrarily large, and realloc()ing and memmove()ing between 2 companion arrays if and when you run out of room. Very fast.
The trick to adding elements to the beginning of an array is to bias the logical start of the array by advancing the pointer into the array at the start, and then backing it up when adding elements at the front. (also the way a stack is implemented)
In exactly the same way, C can be made to support negative subscripts.
C++ does all this for you with the vector STL class, but still worth remembering what's going on under the covers.
[编辑:我的立场是正确的。 std::list 没有运算符[]。抱歉。]
从您的描述中很难看出,但我怀疑您试图随机访问这些项目(即通过索引):
而不是使用迭代器:
“向量”和“矢量”都可以。 “deque”擅长随机访问,因此对于这些类型来说,任何一个都可以充分执行——在两种情况下都是 O(1)。但“列表”不擅长随机访问。通过索引访问列表将花费 O(n^2) 时间,而使用迭代器时则需要 O(1) 时间。
[Edit: I stand corrected. std::list doesn't have operator[]. Sorry.]
It's hard to tell from your description, but I suspect you were trying to access the items randomly (i.e., by index):
Instead of using the iterators:
Both "vector" & "deque" are good at random access, so either will perform adequately for those types---O(1) in both cases. But "list" is not good at random access. Accessing the list by index would take O(n^2) time, versus O(1) when using iterators.