F# 中多核并行缓存局部性的最佳实践
我正在研究 F# 中的多核并行性。我不得不承认不变性确实有助于编写正确的并行实现。然而,当核心数量增加时,很难实现良好的加速和良好的可扩展性。例如,我对快速排序算法的经验是,许多尝试以纯函数方式实现并行快速排序并使用 List
或 Array
作为表示形式都失败了。对这些实现的分析表明,与顺序版本相比,缓存未命中的数量显着增加。然而,如果使用数组内部的变异来实现并行快速排序,则可以获得很好的加速。因此,我认为突变可能是优化多核并行性的一个很好的实践。
我相信 缓存局部性 是函数式语言中多核并行的一大障碍。函数式编程涉及创建许多短暂的对象;破坏这些对象可能会破坏 CPU 高速缓存的一致性属性。我看到了许多如何提高命令式语言中的缓存局部性的建议,例如,此处 和此处。但我不清楚它们在函数式编程中是如何完成的,特别是对于经常出现的递归数据结构,如树等。
是否有任何技术可以提高不纯函数语言(特别是 F#)中的缓存局部性?任何建议或代码示例都非常受欢迎。
I'm studying multicore parallelism in F#. I have to admit that immutability really helps to write correct parallel implementation. However, it's hard to achieve good speedup and good scalability when the number of cores grows. For example, my experience with Quick Sort algorithm is that many attempts to implement parallel Quick Sort in a purely functional way and using List
or Array
as the representation are failed. Profiling those implementations shows that the number of cache misses increases significantly compared to those of sequential versions. However, if one implements parallel Quick Sort using mutation inside arrays, a good speedup could be obtained. Therefore, I think mutation might be a good practice for optimizing multicore parallelism.
I believe that cache locality is a big obstacle for multicore parallelism in a functional language. Functional programming involves in creating many short-lived objects; destruction of those objects may destroy coherence property of CPU caches. I have seen many suggestions how to improve cache locality in imperative languages, for example, here and here. But it's not clear to me how they would be done in functional programming, especially with recursive data structures such as trees, etc, which appear quite often.
Are there any techniques to improve cache locality in an impure functional language (specifically F#)? Any advices or code examples are more than welcome.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
据我所知,缓存局部性(多线程或其他)的关键是
。
在实践中,这意味着您最终可能使用的数据结构在理论上并不是计算机科学的完美示例 - 但这没关系,计算机在理论上也不是计算机科学的完美示例。
关于该主题的一篇很好的学术论文是 使用复制进行高速缓存高效字符串排序
As far as I can make out, the key to cache locality (multithreaded or otherwise) is
To this end ;
In practice this means that you may end up using data structures that are not theoretically perfect examples of computer science - but that's all right, computers aren't theoretically perfect examples of computer science either.
A good academic paper on the subject is Cache-Efficient String Sorting Using Copying
在 F# 中允许函数内的可变性是一件好事,但它只能在优化代码时使用。纯函数式风格通常会产生更直观的实现,因此是首选。
以下是快速搜索返回的结果:Haskell 中的并行快速排序。让我们将关于性能的讨论集中在性能上。选择一个处理器,然后使用特定的算法对其进行测试。
为了不具体回答你的问题,我想说 Clojure 的方法实现 STM 可能是一般情况下如何解耦执行路径的一课多核处理器并改善缓存局部性。但只有当读取次数超过写入次数时它才有效。
Allowing mutability within functions in F# is a blessing, but it should only be used when optimizing code. Purely-functional style often yields more intuitive implementation, and hence is preferred.
Here's what a quick search returned: Parallel Quicksort in Haskell. Let's keep the discussion about performance focused on performance. Choose a processor, then bench it with a specific algorithm.
To answer your question without specifics, I'd say that Clojure's approach to implementing STM could be a lesson in general case on how to decouple paths of execution on multicore processors and improve cache locality. But it's only effective when number of reads outweigh number of writes.
我不是并行性专家,但无论如何这是我的建议。
I am no parallelism expert, but here is my advice anyway.
一个很好的方法是将工作分成更小的部分,并迭代每个核心上的每个部分。
我首先选择的一个选择是在并行之前寻找单个核心上的缓存局部性改进,这应该只是为每个核心再次细分工作的问题。例如,如果您正在使用大型矩阵进行矩阵计算,那么您可以将计算分成较小的部分。
这是一个很好的例子: 性能缓存局部性
Tomas Petricek 的书Real Work 函数式编程中有一些很棒的章节,查看第 14 章编写并行函数式程序,您可能会发现特别感兴趣的二叉树的并行处理。
A great approach is to split the work into smaller sections and iterate over each section on each core.
One option I would start with is to look for cache locality improvements on a single core before going parallel, it should be simply a matter of subdividing the work again for each core. For example if you are doing matrix calculations with large matrices then you could split up the calculations into smaller sections.
Heres a great example of that: Cache Locality For Performance
There were some great sections in Tomas Petricek's book Real Work functional programming, check out Chapter 14 Writing Parallel Functional Programs, you might find Parallel processing of a binary tree of particular interest.
要编写可扩展的应用程序,缓存位置对于应用程序的速度至关重要。 Scott Meyers 演讲很好地解释了这些原理。不变性不能很好地适应缓存局部性,因为您在内存中创建新对象,这会迫使 CPU 再次从新对象重新加载数据。
正如演讲中所指出的,即使在现代 CPU 上,L1 缓存也只有 32 KB 大小,可供所有内核之间共享代码和数据。如果您使用多线程,您应该尝试消耗尽可能少的内存(告别不变性)以保持最快的缓存。 L2 缓存约为 4-8 MB,与您尝试排序的数据相比,该缓存要大得多,但仍然很小。
如果您设法编写一个消耗尽可能少内存(数据缓存局部性)的应用程序,您可以获得 20 或更多的加速。但如果您针对 1 个核心进行管理,则扩展到更多核心很可能会损害性能,因为所有核心都在竞争相同的 L2 缓存。
为了充分利用它,C++ 人员使用 PGA(配置文件引导优化),这允许他们分析其应用程序,该应用程序用作编译器的输入数据,以便为特定用例发出更好的优化代码。
在托管代码中,您可以在一定程度上得到更好的结果,但由于影响缓存局部性的因素太多,因此在现实世界中,由于总缓存局部性,您不太可能看到 20 的加速。这仍然是使用分析数据的 C++ 和编译器的制度。
To write scalable Apps cache locality is paramount to your application speed. The principles are well explain by Scott Meyers talk. Immutability does not play well with cache locality since you create new objects in memory which forces the CPU to reload the data from the new object again.
As in the talk is noted even on modern CPUs the L1 cache has only 32 KB size which is shared for code and data between all cores. If you go multi threaded you should try to consume as little memory as possible (goodbye immutabilty) to stay in the fastest cache. The L2 cache is about 4-8 MB which is much bigger but still tiny compared to the data you are trying to sort.
If you manage to write an application which consumes as little memory as possible (data cache locality) you can get speedups of 20 or more. But if you manage this for 1 core it might be very well be that scaling to more cores will hurt performance since all cores are competing for the same L2 cache.
To get most out of it the C++ guys use PGA (Profile Guided Optimizations) which allows them to profile their application which is used as input data for the compiler to emit better optimized code for the specific use case.
You can get better to certain extent in a managed code but since so many factors influence your cache locality it is not likely that you will ever see a speedup of 20 in the real world due to total cache locality. This remains the regime of C++ and compilers which use profiling data.
您可能会从以下内容中得到一些想法:
Cache-Oblivious http://supertech.csail.mit.edu/ cacheObliviousBTree.html缓存不经意搜索树项目
DSapce@MIT 多核处理器中的缓存一致性策略http://dspace.mit.edu/handle/1721.1 /61276
通过优雅且高效地实现矩阵乘法来描述缓存不经意算法的革命性理念F#。
You may get some ideas from these:
Cache-Oblivious http://supertech.csail.mit.edu/cacheObliviousBTree.html Cache-Oblivious Search Trees Project
DSapce@MIT Cache coherence strategies in a many-core processor http://dspace.mit.edu/handle/1721.1/61276
describes the revolutionary idea of cache oblivious algorithms via the elegant and efficient implementation of a matrix multiply in F#.