在插入 STL 集之前我应该随机洗牌吗?
我需要将 1000 万个字符串插入到 C++ STL 集中。字符串已排序。如果我按排序顺序插入字符串,是否会出现病态问题?我应该先随机吗?或者 G++ STL 实现会自动为我重新平衡吗?
I need to insert 10-million strings into a C++ STL set. The strings are sorted. Will I have a pathological problem if I insert the strings in sorted order? Should I randomize first? Or will the G++ STL implementation automatically rebalance for me?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
set 实现通常使用红黑树,它会为您重新平衡。但是,如果您在插入之前随机化数据,则插入可能会更快(也可能不会) - 唯一确定的方法是使用您的设置实现和特定数据进行测试。无论哪种方式,检索时间都是相同的。
The set implementation typically uses a red-black tree, which will rebalance for you. However, insertion may be faster (or it may not) if you randomise the data before inserting - the only way to be sure is to do a test with your set implementation and specific data. Retrieval times will be the same, either way.
实施将自动重新平衡。然而,鉴于您知道输入已排序,您可以给它一些帮助:您可以在执行插入时提供“提示”,在这种情况下,向先前插入的项目提供迭代器将是完全正确的提示为下一次插入提供。在这种情况下,每次插入都将具有摊销常数复杂度,而不是您期望的对数复杂度。
The implementation will re-balance automatically. Given that you know the input is sorted, however, you can give it a bit of assistance: You can supply a "hint" when you do an insertion, and in this case supplying the iterator to the previously inserted item will be exactly the right hint to supply for the next insertion. In this case, each insertion will have amortized constant complexity instead of the logarithmic complexity you'd otherwise expect.
我唯一的问题是:你真的需要一套吗?
如果数据已经排序,并且您不需要在创建后插入/删除元素,则
deque
会更好:binary_search
:我怀疑您需要的不仅仅是一个ForwardIterator
来进行二分搜索,猜猜这个网站又关闭了:(The only question I have: do you really need a
set
?If the data is already sorted and you don't need to insert / delete elements after the creation, a
deque
would be better:On
binary_search
: I suspect you need more than aForwardIterator
for a binary search, guess this site is off again :(http://en.wikipedia.org/wiki/Standard_Template_Library
设置:“使用 self 实现-平衡二叉搜索树。”
http://en.wikipedia.org/wiki/Standard_Template_Library
set: "Implemented using a self-balancing binary search tree."
g++ 的 libstdc++ 使用红黑树作为集合和映射。
http://en.wikipedia.org/wiki/Red-black_tree
这是自平衡树,插入总是 O(log n)。 C++标准也要求所有的实现都具有这个特性,所以在实践中,它们几乎都是红黑树,或者非常相似的东西。
因此,不必担心放置元素的顺序。
g++'s libstdc++ uses red black trees for sets and maps.
http://en.wikipedia.org/wiki/Red-black_tree
This is a self balancing tree, and insertions are always O(log n). The C++ standard also requires that all implementations have this characteristic, so in practice, they are almost always red black trees, or something very similar.
So don't worry about the order you put the elements in.
一个非常便宜且简单的解决方案是从字符串集合的两端插入。也就是说,先加“A”,然后加“ZZZZZ”,再加“AA”,再加“ZZZZY”,以此类推,直到中间相遇。它不需要高昂的洗牌成本,但它可能会回避病态的情况。
A very cheap and simple solution is to insert from both ends of your collections of strings. That is to say, first add "A", then "ZZZZZ", then "AA", then "ZZZZY", etcetera until you meet in the middle. It doesn't require the hefty cost of shuffling, yet it is likely to sidestep pathological cases.
也许“unordered_set”可以是一个替代方案。
Maybe 'unordered_set' can be an alternative.