不可变数据结构性能
我不明白作为一个集合的东西怎么可能是不可变的并且仍然具有可接受的性能。
根据我在 F# Sets 中读到的内容,内部使用红黑树作为其实现。如果每次我们想要向红黑树添加新内容时,我们基本上都必须重新创建它,那么它如何才能具有良好的性能呢?我在这里缺少什么?
尽管我要求 F# 的集合这样做,但我认为这与具有或使用不可变数据结构的任何其他语言一样相关。
谢谢
I don't get how can something as a Set be immutable and still have an acceptable performance.
From what I've read in F# Sets internally use Red Black Trees as their implementation. If each time we want to add something new to a Red Black Tree we have to basically recreate it, how can it have ever good performance? What am I missing here?
Although I am asking this for F#'s Sets, I think this is as relevant in any other language which has or uses immutable data structures.
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
很简单,集合是基于节点的存储实体。对于集合,您可以将其实现为一棵树,其中当您将元素“添加”到集合的下一个版本时,您不会重新创建所有边和节点,而只是创建一组新的边。您可以这样做,因为节点本身永远不会改变,其中包含的对象也不会改变。
它的真正好处是在单线程应用程序中而不是在多线程应用程序中。不可变的数据结构消除了对锁定机制的需要。如果它们永远不会改变,你就不必担心状态。
Quite simply a Set is a node based storage entity. In the case of a Set you can implement it as a tree wherein you are not recreating all the edges and the nodes when you "add" an element to the next version of the Set, instead you're just creating a new set of edges. You can do this because the nodes themselves will never change, nor will the objects held within them.
The real benefit it's found within single threaded applications but rather in multi-threaded applications. Immutable data structures remove the need for locking mechanisms. If they're never going to change, you don't have to worry about state.
不确定这是如何在语言中实现的,但是数据结构对于程序员来说是不可变的,但可以在幕后进行优化。
例如,我有一个列表 a=[1,2,3,4,5]。我附加 6.b=[a [6]] 并且它们都可以是不可变的。这样做不会损失任何性能,而且比复制值更快。
那么,让我问你,因为我不知道,为什么做不可变的事情会更慢?就树而言,我有点明白你的观点。我猜你必须在当前节点上方重新创建节点,但不能在下方重新创建节点(假设我们有子指针而不是父指针)。
not sure how this is implemented in the language, but the data structures could be perceived to be immutable to the programmer, but be optimized behind the scenes.
for instance, I have a list a=[1,2,3,4,5]. I append 6. b=[a [6]] and they can both be immutable. You don't lose any performance by doing this, and it's faster than copying the values.
So, let me ask you, because I don't know, why would it be slower to do things as immutable? In the case of the tree, I kind of see your point. You'd have to recreate nodes above the current node I guess, but not below (assuming we have children pointers and not parent pointers).
几乎所有不可变集合都是某种形式的平衡树。要创建新树,您必须重新分配从更改(插入、删除、“更新”)到根的路径上的节点。只要树是平衡的,这就会花费对数时间。如果您有类似 2-3-4 树(类似于红黑树)的预期出度为 3 的树,则只需使用 10 次分配即可处理一百万个元素。
在数据结构被期望是纯粹的语言中,它们确保分配速度很快。分配一个四元素节点将花费一次比较、一次增量和四次存储。在许多情况下,您可以分摊多个分配的比较成本。
如果您想了解有关这些结构如何工作的更多信息,一个很好的来源是 纯函数式数据结构,作者:Chris Okasaki。
Almost all immutable collections are some form of balanced tree. To create a new tree, you have to reallocate nodes on the path from the change (insert, remove, "update") to the root. As long as the tree is balanced this takes logarithmic time. If you have something like a 2-3-4 tree (similar to red-black trees) with expected outdegree three, you can handle a million elements using only 10 allocations.
And in languages where data structures are expected to be pure, they make sure allocation is fast. Allocating a four-element node is going to cost a compare, an increment, and four stores. And in many cases you can amortize the cost of a compare over several allocations.
If you want to know more about how these structures work, an excellent source is Purely Functional Data Structures by Chris Okasaki.
您不必重新创建整棵树。许多分支将保持不变并且可以“重复使用”。举一个简单的例子,如果需要将新节点添加到当前树中的叶子中,则只需克隆该节点的父节点并赋予新分支。
You do not have to recreate the whole tree. Many of the branches will stay the same and can be 'reused'. As a simple example, if the new node needs to be added to a leaf in the current tree, then only the parents of that node needs to be cloned and given new branches.
正如其他人指出的那样,您不必重新创建整个数据结构。您只需重新创建已更改的部分并引用保持不变的现有子树。由于数据结构的不变性,您可以重用子树,因此几乎不需要复制所有内容。事实上,如果您很少需要克隆可变数据结构,它可能会产生更大的影响。
特别是,对于平衡树(例如红黑树),这将为您提供:
当然,这对于某些应用程序来说可能开销太大,但实际上并没有那么糟糕。此外,.NET 垃圾收集器中的分配速度非常快(我认为本质上是O(1)),因此这并不是真正的问题。更多的分配意味着 GC 需要更频繁地运行,但这也并不像听起来那么重要 - 如今计算机拥有相当多的内存。 .NET 4.0 实际上在许多情况下都有帮助(另请参阅 Jon Harrop 的此处的答案 )
As others pointed out, you don't have to re-create the whole data structure. You just have to re-create parts that have changed and reference existing sub-trees that stayed the same. Thanks to the immutability of the data structure, you can reuse sub-trees, so copying everything is almost never needed. In fact, if you needed to clone a mutable data structure rarely, it could have much higher impact.
In particular, for a ballanced trees (such as red-black trees) this gives you:
This may be - of course - too much overhead for some applications, but it actually isn't all that bad. Moreover, allocation in .NET garbage collector is very fast (I think, essentially O(1)), so this isn't really a problem. More allocation means that GC needs to run more frequently, but this also isn't as critical as it may sound - computers have quite a lot of memory these days. The .NET 4.0 actually helps in many cases (see also Jon Harrop's answer here)
正如其他人所说,不可变的数据结构不必完全重新创建,因为它可以重用自身的旧部分。您可以这样做,因为旧部分是不可变的,并且保证数据不会更改。
我有一个现实世界中不可变性能的例子。我用不可变的红黑树做了一些测试 我用 F# 编写的,它的执行速度只比 C++ 中的 std::sort 慢 3 倍。考虑到它不是专门为排序而设计的,我认为这确实很快。
As others have stated an immutable data structure doesn't have to be completely recreated since it can reuse old parts of itself. You can do this because the old parts are immutable and the data is guaranteed not to change.
I have a real world example of immutable performance. I did some testing with an immutable Red Black tree I made in F# and it only performs 3 times slower than std::sort in c++. Which I think is really fast considering it wasn't designed specifically for sorting.
语言语义的限制仅适用于该语言的源代码。实现(编译器、解释器、运行时环境等)可以自由地执行任何操作以获得最佳性能,只要它保持相同的行为即可。对于大多数语言来说都是如此。
编辑:
可以进行一些优化,包括数据共享(正是因为数据是不可变的)、在幕后使用可变性、优化尾部调用(因为 FP 使用大量递归)等等。
The limitations of language semantics only applies to the source code in the language. The implementation (compiler, interpreter, runtime environment, etc.) is free to do whatever it wants for the best performance as long as it keeps the same behavior. This is true for most languages.
Edit:
Several optimizations can be made including data sharing (precisely because the data are immutable), using mutability behind the scenes, optimizing tail calls (since FP uses a lot of recursion), and others.
请参阅
函数式编程:不可变的数据结构效率
(特别是我的答案,它指向Rich Hickey 的演讲)提供了“一般”令人信服的证据,证明不可变结构也可以非常高效。
至于在 F#
Set
的具体情况下这是否正确,嗯,也许今天只是中等程度。使用更高效的底层结构会很棒(从实用的角度来看;从理论上来说,当然一切都是 O(logN) (实际上是 O(1)))。See
functional programming: immutable data structure efficiency
(especially my answer that points to Rich Hickey's talk) for the 'general' convincing evidence that yes, immutable structures can also be very efficient.
As to how well this is true in the specific case of F#
Set
, well, perhaps only moderately so today. It would be great to use a more efficient underlying structure (in pragmatic terms; in theoretical terms, of course everything is O(logN) (which in practical terms is O(1))).