F# 中多核并行缓存局部性的最佳实践

发布于 2024-11-10 18:51:30 字数 736 浏览 2 评论 0原文

我正在研究 F# 中的多核并行性。我不得不承认不变性确实有助于编写正确的并行实现。然而,当核心数量增加时,很难实现良好的加速和良好的可扩展性。例如,我对快速排序算法的经验是,许多尝试以纯函数方式实现并行快速排序并使用 ListArray 作为表示形式都失败了。对这些实现的分析表明,与顺序版本相比,缓存未命中的数量显着增加。然而,如果使用数组内部的变异来实现并行快速排序,则可以获得很好的加速。因此,我认为突变可能是优化多核并行性的一个很好的实践。

我相信 缓存局部性 是函数式语言中多核并行的一大障碍。函数式编程涉及创建许多短暂的对象;破坏这些对象可能会破坏 CPU 高速缓存的一致性属性。我看到了许多如何提高命令式语言中的缓存局部性的建议,例如,此处此处。但我不清楚它们在函数式编程中是如何完成的,特别是对于经常出现的递归数据结构,如树等。

是否有任何技术可以提高不纯函数语言(特别是 F#)中的缓存局部性?任何建议或代码示例都非常受欢迎。

I'm studying multicore parallelism in F#. I have to admit that immutability really helps to write correct parallel implementation. However, it's hard to achieve good speedup and good scalability when the number of cores grows. For example, my experience with Quick Sort algorithm is that many attempts to implement parallel Quick Sort in a purely functional way and using List or Array as the representation are failed. Profiling those implementations shows that the number of cache misses increases significantly compared to those of sequential versions. However, if one implements parallel Quick Sort using mutation inside arrays, a good speedup could be obtained. Therefore, I think mutation might be a good practice for optimizing multicore parallelism.

I believe that cache locality is a big obstacle for multicore parallelism in a functional language. Functional programming involves in creating many short-lived objects; destruction of those objects may destroy coherence property of CPU caches. I have seen many suggestions how to improve cache locality in imperative languages, for example, here and here. But it's not clear to me how they would be done in functional programming, especially with recursive data structures such as trees, etc, which appear quite often.

Are there any techniques to improve cache locality in an impure functional language (specifically F#)? Any advices or code examples are more than welcome.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

纵情客 2024-11-17 18:51:31

据我所知,缓存局部性(多线程或其他)的关键是

  • 将工作单元保留在适合缓存的连续 RAM 块中

  • 尽可能避免使用物体
    • 对象在堆上分配,并且可能会被喷射到各处,具体取决于堆碎片等。
    • 您对对象的内存布局的控制基本上为零,以至于 GC 可能随时移动它们。
  • 使用数组。大多数编译器将数组解释为连续的内存块。
    • 其他集合数据类型可能会将数据分布到各处 - 例如,链表由指针组成。
  • 使用基本类型的数组。对象类型是在堆上分配的,因此对象数组只是指向可能分布在整个堆上的对象的指针数组。
  • 如果不能使用基元,请使用结构数组。结构体的字段在内存中按顺序排列,并被 .NET 编译器视为基元。
  • 计算出要执行该操作的计算机上的缓存大小
    • CPU 具有不同大小的二级缓存
    • 谨慎的做法是,将代码设计为可根据不同的缓存大小进行扩展
    • 或者更简单地说,编写适合代码运行的最低通用缓存大小的代码
  • 弄清楚需要什么坐在靠近每个数据的位置
    • 实际上,您不会将整个工作集放入二级缓存
    • 检查(或重新设计)您的算法,以便您使用的数据结构保存“下一个”所需的数据,且接近之前所需的数据。

在实践中,这意味着您最终可能使用的数据结构在理论上并不是计算机科学的完美示例 - 但这没关系,计算机在理论上也不是计算机科学的完美示例。

关于该主题的一篇很好的学术论文是 使用复制进行高速缓存高效字符串排序

As far as I can make out, the key to cache locality (multithreaded or otherwise) is

  • Keep work units in a contiguous block of RAM that will fit into the cache

To this end ;

  • Avoid objects where possible
    • Objects are allocated on the heap, and might be sprayed all over the place, depending on heap fragmentation, etc.
    • You have essentially zero control over the memory placement of objects, to the extent that the GC might move them at any time.
  • Use arrays. Arrays are interpreted by most compilers as a contiguous block of memory.
    • Other collection datatypes might distribute things all over the place - linked lists, for example, are composed of pointers.
  • Use arrays of primitive types. Object types are allocated on the heap, so an array of objects is just an array of pointers to objects that may be distributed all over the heap.
  • Use arrays of structs, if you can't use primitives. Structs have their fields arranged sequentially in memory, and are treated as primitives by the .NET compilers.
  • Work out the size of the cache on the machine you'll be executing it on
    • CPUs have different size L2 caches
    • It might be prudent to design your code to scale with different cache sizes
    • Or more simply, write code that will fit inside the lowest common cache size your code will be running on
  • Work out what needs to sit close to each datum
    • In practice, you're not going to fit your whole working set into the L2 cache
    • Examine (or redesign) your algorithms so that the data structures you are using hold data that's needed "next" close to data that was previously needed.

In practice this means that you may end up using data structures that are not theoretically perfect examples of computer science - but that's all right, computers aren't theoretically perfect examples of computer science either.

A good academic paper on the subject is Cache-Efficient String Sorting Using Copying

风启觞 2024-11-17 18:51:31

在 F# 中允许函数内的可变性是一件好事,但它只能在优化代码时使用。纯函数式风格通常会产生更直观的实现,因此是首选。

以下是快速搜索返回的结果:Haskell 中的并行快速排序。让我们将关于性能的讨论集中在性能上。选择一个处理器,然后使用特定的算法对其进行测试。

为了不具体回答你的问题,我想说 Clojure 的方法实现 STM 可能是一般情况下如何解耦执行路径的一课多核处理器并改善缓存局部性。但只有当读取次数超过写入次数时它才有效。

Allowing mutability within functions in F# is a blessing, but it should only be used when optimizing code. Purely-functional style often yields more intuitive implementation, and hence is preferred.

Here's what a quick search returned: Parallel Quicksort in Haskell. Let's keep the discussion about performance focused on performance. Choose a processor, then bench it with a specific algorithm.

To answer your question without specifics, I'd say that Clojure's approach to implementing STM could be a lesson in general case on how to decouple paths of execution on multicore processors and improve cache locality. But it's only effective when number of reads outweigh number of writes.

抱猫软卧 2024-11-17 18:51:31

我不是并行性专家,但无论如何这是我的建议。

  1. 我希望本地可变的方法(其中每个核心都分配一个可读取和写入的内存区域)将始终击败纯方法。
  2. 尝试制定您的算法,使其在连续的内存区域上顺序工作。这意味着,如果您正在使用图形,则可能值得将节点“展平”为数组并在处理之前用索引替换引用。无论缓存局部性问题如何,这始终是 .NET 中的一种很好的优化技术,因为它有助于避免垃圾收集。

I am no parallelism expert, but here is my advice anyway.

  1. I would expect that a locally mutable approach where each core is allocated an area of memory which is both read and written will always beat a pure approach.
  2. Try to formulate your algorithm so that it works sequentially on a contiguous area of memory. This means that if you are working with graphs, it may be worth "flattening" nodes into arrays and replace references by indices before processing. Regardless of cache locality issues, this is always a good optimisation technique in .NET, as it helps keep garbage collection out of the way.
瀞厅☆埖开 2024-11-17 18:51:31

一个很好的方法是将工作分成更小的部分,并迭代每个核心上的每个部分。

我首先选择的一个选择是在并行之前寻找单个核心上的缓存局部性改进,这应该只是为每个核心再次细分工作的问题。例如,如果您正在使用大型矩阵进行矩阵计算,那么您可以将计算分成较小的部分。

这是一个很好的例子: 性能缓存局部性

Tomas Petricek 的书Real Work 函数式编程中有一些很棒的章节,查看第 14 章编写并行函数式程序,您可能会发现特别感兴趣的二叉树的并行处理。

A great approach is to split the work into smaller sections and iterate over each section on each core.

One option I would start with is to look for cache locality improvements on a single core before going parallel, it should be simply a matter of subdividing the work again for each core. For example if you are doing matrix calculations with large matrices then you could split up the calculations into smaller sections.

Heres a great example of that: Cache Locality For Performance

There were some great sections in Tomas Petricek's book Real Work functional programming, check out Chapter 14 Writing Parallel Functional Programs, you might find Parallel processing of a binary tree of particular interest.

别闹i 2024-11-17 18:51:31

要编写可扩展的应用程序,缓存位置对于应用程序的速度至关重要。 Scott Meyers 演讲很好地解释了这些原理。不变性不能很好地适应缓存局部性,因为您在内存中创建新对象,这会迫使 CPU 再次从新对象重新加载数据。
正如演讲中所指出的,即使在现代 CPU 上,L1 缓存也只有 32 KB 大小,可供所有内核之间共享代码和数据。如果您使用多线程,您应该尝试消耗尽可能少的内存(告别不变性)以保持最快的缓存。 L2 缓存约为 4-8 MB,与您尝试排序的数据相比,该缓存要大得多,但仍然很小。

如果您设法编写一个消耗尽可能少内存(数据缓存局部性)的应用程序,您可以获得 20 或更多的加速。但如果您针对 1 个核心进行管理,则扩展到更多核心很可能会损害性能,因为所有核心都在竞争相同的 L2 缓存。

为了充分利用它,C++ 人员使用 PGA(配置文件引导优化),这允许他们分析其应用程序,该应用程序用作编译器的输入数据,以便为特定用例发出更好的优化代码。

在托管代码中,您可以在一定程度上得到更好的结果,但由于影响缓存局部性的因素太多,因此在现实世界中,由于总缓存局部性,您不太可能看到 20 的加速。这仍然是使用分析数据的 C++ 和编译器的制度。

To write scalable Apps cache locality is paramount to your application speed. The principles are well explain by Scott Meyers talk. Immutability does not play well with cache locality since you create new objects in memory which forces the CPU to reload the data from the new object again.
As in the talk is noted even on modern CPUs the L1 cache has only 32 KB size which is shared for code and data between all cores. If you go multi threaded you should try to consume as little memory as possible (goodbye immutabilty) to stay in the fastest cache. The L2 cache is about 4-8 MB which is much bigger but still tiny compared to the data you are trying to sort.

If you manage to write an application which consumes as little memory as possible (data cache locality) you can get speedups of 20 or more. But if you manage this for 1 core it might be very well be that scaling to more cores will hurt performance since all cores are competing for the same L2 cache.

To get most out of it the C++ guys use PGA (Profile Guided Optimizations) which allows them to profile their application which is used as input data for the compiler to emit better optimized code for the specific use case.

You can get better to certain extent in a managed code but since so many factors influence your cache locality it is not likely that you will ever see a speedup of 20 in the real world due to total cache locality. This remains the regime of C++ and compilers which use profiling data.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文