并行编程和 C++

发布于 2024-07-08 10:02:01 字数 194 浏览 7 评论 0原文

我最近写了很多关于并行计算和编程的文章，我确实注意到在并行计算方面出现了很多模式。注意到 Microsoft 已经随 Microsoft Visual C++ 2010 社区技术预览版一起发布了一个库（名为并行模式库），我想知道您一直在使用和遇到的可能值得记住的常见并行编程模式是什么？当您使用 C++ 编写并行程序时，是否有任何您遵循的习惯用法和模式似乎不断出现？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

墨洒年华 2024-07-15 10:02:01

模式：

生产者/消费者
- 一个线程产生数据
- 一个线程消耗数据
循环并行
- 如果你能证明每个循环都是独立的
  每次迭代都可以在单独的线程中完成
Re-Draw Thread
- 其他线程确实工作并更新数据结构，但一个线程重新绘制屏幕。
主要活动主题
- 多个线程可以生成事件
- 一个线程必须处理事件（因为顺序很重要）
- 应该尝试分离事件线程/重新绘制线程
  这（有助于）防止 UI 冻结
  但如果不小心的话可能会导致过度的重新绘制。
工作组
- 一组线程等待队列上的作业。
- 线程从队列中提取一个工作项（如果没有可用的则等待）。
  线程处理一个工作项直至完成
  完成后，线程返回队列。

回复收藏 0 原文

孤者何惧 2024-07-15 10:02:01

首先，您必须在共享内存计算和无共享计算之间进行选择。共享内存更容易，但扩展性不太好 - 如果您

a) 有一个集群，而不是多处理器系统，或者

b) 如果您有很多 CPU（例如，> 60），则您将使用无共享，对于共享

内存，常见的解决方案是使用线程；它们作为一个概念很容易理解，并且易于在 API 中使用（但难以调试）。

对于无共享，您可以使用某种消息传递。在高性能计算中，MPI 被确立为消息传递中间件。

然后，您还需要为并行活动设计一个架构。最常见的方法（同样是因为它很容易理解）是农民-工人模式（又名主从）。

回复收藏 0 原文

深居我梦 2024-07-15 10:02:01

并行执行模式

具有确定性模式的结构化并行编程是一种高级方法，主要基于循环并行执行模式的集合，通常称为算法骨架或并行构造，它抽象程序描述并隐藏低层代码。程序员了解并行性级别的多线程细节和许多固有的复杂性。

这些可重用模式自动化了许多与并行范式相关的例程，例如同步、通信、数据分区或任务调度，并在内部处理它们。这种高级方法尝试传统的低级线程锁模型，以更多的抽象和更简单的方式来表达并行性，并关注生产力和可编程性而不是性能。

有许多常用的模式，例如：Map-Reduce、Fork-Join、Pipeline 或 Parallel Loop...

论文

“具有确定性模式的结构化并行编程”是讨论这些模式的论文。您还可以参阅“MHPM：多尺度混合编程模型：灵活的并行化方法”，其中描述了这种名为 XPU 的方法的 C++ 实现。

Library

XPU 是一个基于任务的 C++ 库，由一组可重用的执行模式组成。它允许在单个同构编程模型内以多个粒度级别表达多种类型的并行性。它易于使用，并说明了使用模式设计并行程序的兴趣。

例如，它允许表达：

任务并行模式：
简单或分层的 Fork/Join 执行模式，具有一些功能，例如
自动检测和保护共享数据。
数据并行模式：
具有可扩展数据分区的并行循环模式。
时间并行模式：
管道执行模式。

回复收藏 0 原文

久隐师 2024-07-15 10:02:01

您已经掌握了程序各部分并行性的基础知识。 C++17 正在获取其中的许多功能（例如，并行版本的 foreach、sort、find 和 Friends、map_reduce、map、reduce、prefix_sum ...）请参阅 C++ 并行扩展

然后你就有了像延续这样的项目。想想 std::future 但继续。实现这些的方法有几种(boost 现在有一个很好的方法，因为 std 方法没有 next(...) 或 then(...) 方法，但最大的好处是不必等待后续

auto fut = async([]( ){..some work...} ).then( [](result_of_prev ){...more work} ).then... ;
fut.wait( );

任务之间缺乏同步很重要，因为任务/线程/...之间的通信会减慢并行程序的速度，

因此使用基于任务的并行性非常好。他们可能有方法（例如信号量）进行通信，但这不是强制性的。 "nofollow noreferrer">Intel 线程构建块和 Microsoft 并行模式库有这方面的设施，

之后我们就有了 fork/join 模式。它并不意味着为每个任务创建 N 个线程。只是你有这 N 个，理想情况下独立的事情要做（fork），并且当它们完成时在某个地方有一个同步点（join）。

auto semaphore = make_semaphore( num_tasks );
add_task( [&semaphore]( ) {...task1...; semaphore.notify( ); } );
add_task( [&semaphore]( ) {...task2...; semaphore.notify( ); } );
...
add_task( [&semaphore]( ) {...taskN...; semaphore.notify( ); } );
semaphore.wait( );

从上面你可以开始看到这是一个流程图的模式。 Future 是 (A >> B >> C >> D)，Fork Join 是 (A|B|C|D)。这样您就可以将它们组合起来形成一个图表。 (A1>>A2|B1>>B2>>B3|C1|D1>>D2>>(E1>>E2|F1)) 其中 A1>>A2 表示 A1 必须在 A2 和 A 之前|B 表示A 和B 可以同时运行。缓慢的部分位于图形/子图的末尾，事物相遇的地方。

目标是找到系统中不需要通信的独立部分。如上所述，并行算法在几乎所有情况下都比顺序算法慢，直到工作负载足够高或大小足够大（假设通信不太频繁）。比如排序。在 4 核计算机上，您将获得大约 2.5 倍的性能提升或降低，因为合并过程很繁琐，需要大量同步，并且在第一轮合并后不会使所有核心都工作。在 N 非常大的 GPU 上，可以使用一种效率较低的排序，例如 Bitonic，它最终会非常快，因为你有很多工作人员来完成工作，每个人都安静地做自己的事情。

减少通信的一些技巧包括使用数组来获取结果，这样每个任务就不会尝试锁定对象来推送值。通常，随后这些结果的减少会非常快。

但对于所有类型的并行性，缓慢都来自于通信。减少它。

You have the basics that bolt on parallelism to parts of a program. C++17 is getting many of them (e.g. parallel versions of foreach, sort, find and friends, map_reduce, map, reduce, prefix_sum ... ) see C++ Extensions for Parallelism

Then you have the items like continuations. Think std::future but with continues. There are few ways of implementing these(boost has a good one now as the std one does not have a next(...) or then(...) method, but the big benefit is that one does not have to wait on it to do the next task

auto fut = async([]( ){..some work...} ).then( [](result_of_prev ){...more work} ).then... ;
fut.wait( );

The lack of synchronisation between the subsequent tasks is important as communication between tasks/threads/... is what slows down parallel programs.

So with that task based parallelism is really nice. With a task scheduler you just pass tasks off and walk away. They may have methods, like a semaphore, to communicate back but that is not mandatory. Both Intel Thread Building Blocks and Microsoft Parallel Pattern Library have facilities for this.

After that we have the fork/join pattern. It does not imply create N threads for each task. Just that you have these N, ideally independent, things to do(fork) and when they are done have a synchronisation point somewhere(join).

auto semaphore = make_semaphore( num_tasks );
add_task( [&semaphore]( ) {...task1...; semaphore.notify( ); } );
add_task( [&semaphore]( ) {...task2...; semaphore.notify( ); } );
...
add_task( [&semaphore]( ) {...taskN...; semaphore.notify( ); } );
semaphore.wait( );

From above you can start to see the pattern that this is a flow graph. Future is (A >> B >> C >> D) and Fork Join is (A|B|C|D). With that you can combine them to form a graph. (A1>>A2|B1>>B2>>B3|C1|D1>>D2>>(E1>>E2|F1)) Where A1>>A2 means A1 must preceed A2 and A|B means that A and B can run concurrently. The slow parts are at the end of the graphs/subgraphs where things meet up.

The goal is to find independent parts of the system that do not need to communicate. The parallel algorithms, as noted above, are in almost all cases slower than their sequential counterparts until the work load gets high enough or the size gets big enough(assuming communication isn't too chatty). For example sorting. On a 4 core computer you will get around 2.5X the performance give or take because the merge is chatty and requires a lot of synchronisation and does not work all the cores after the first merge round. On a GPU with an N that is very large, one can use a less efficient sort, like Bitonic, and it ends up being very fast because you have a lot of workers to power through the work and everyone is quiet doing their own thing.

Some tricks to reduce commmunication include, using an array for results so that each task isn't trying to lock an object in order to push a value. Often later the reducing of these results can be very quick.

But with all types of parallelism the slowness comes from communication. Reduce it.

回复收藏 0 原文

~没有更多了~