任务怎么写？（并行代码）

发布于 2024-09-01 08:49:55 字数 545 浏览 13 评论 0原文

英特尔线程构建模块给我留下了深刻的印象。我喜欢如何编写任务而不是线程代码，并且我喜欢它在我有限的理解下如何在后台工作（任务在池中，4 核上不会有 100 个线程，不能保证任务运行，因为它不在它自己的线程，并且可能远离池，但它可能与另一个相关任务一起运行，因此您不能做坏事，例如典型的线程不安全代码）。

我想更多地了解写作任务。我喜欢这里的“基于任务的多线程 - 如何为 100 个核心进行编程”视频 http:// /www.gdcvault.com/sponsor.php?sponsor_id=1（当前是倒数第二个链接。警告它不是“很棒”）。我最喜欢的部分是“解决迷宫最好并行完成”，大约在 48 分钟左右（您可以点击左侧的链接。这部分确实是您需要观看的全部内容）。

不过，我喜欢看到更多代码示例和一些如何编写任务的 API。有人有好的资源吗？我不知道一个类或一段代码在将其推入池中后会是什么样子，也不知道当您需要复制所有内容以及有多少内容被推入池中时，代码可能会看起来多么奇怪。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

暮光沉寂 2024-09-08 08:49:55

Java 有一个类似于线程构建块的并行任务框架 - 它称为 Fork-Join 框架。它可用于当前的 Java SE 6 并包含在即将推出的 Java SE 7。

除了 javadoc 类文档之外，还有可用于开始使用该框架的资源。在 jsr166 页面中，提到
“还有一个 wiki，其中包含这些课程的附加文档、注释、建议、示例等。”

fork-join 示例，例如矩阵乘法，是一个很好的选择开始的地方。

我使用 fork-join 框架来解决一些 Intel 的问题2009 年线程挑战。该框架是轻量级且低开销的 - 我的框架是解决 Kight's Tour 问题的唯一 Java 条目，并且它的性能优于竞赛中的其他条目。 java 源代码和文章可从挑战网站下载。

编辑：

我不知道一个类或片段如何
的代码可能会在将其推送到
一个游泳池[...]

您可以通过子类化其中一个 ForKJoinTask 来创建自己的任务子类，例如 RecursiveTask。以下是如何并行计算斐波那契数列。（摘自 RecursiveTask javadocs - 注释是我的。）

 // declare a new task, that itself spawns subtasks. 
 // The task returns an Integer result.
 class Fibonacci extends RecursiveTask<Integer> {
   final int n;      // the n'th number in the fibonacci sequence to compute
   Fibonnaci(int n) { this.n = n; } // constructor
   Integer compute() {   // this method is the main work of the task
     if (n <= 1)         // 1 or 0, base case to end recursion
        return n;
     Fibonacci f1 = new Fibonacci(n - 1);  // create a new task to compute n-1
     f1.fork();                            // schedule to run asynchronously
     Fibonacci f2 = new Fibonacci(n - 2);  // create a new task to compute n-2
     return f2.invoke() + f1.join();       // wait for both tasks to compute.
       // f2 is run as part of this task, f1 runs asynchronously. 
       // (you could create two separate tasks and wait for them both, but running
       // f2 as part of this task is a little more efficient.
   }
 }

然后运行此任务并获得结果

// default parallelism is number of cores
ForkJoinPool pool = new ForkJoinPool(); 
Fibonacci f = new Fibonacci(100);
int result = pool.invoke(f);

这是一个简单的示例，旨在使事情变得简单。实际上，性能不会那么好，因为与任务框架的开销相比，任务执行的工作是微不足道的。根据经验，任务应该执行一些重要的计算 - 足以使框架开销变得微不足道，但又不至于在问题结束时只有一个核心运行一项大型任务。将大型任务拆分为较小的任务可确保一个核心不会在其他核心闲置时执行大量工作 - 使用较小的任务可以让更多核心忙碌，但又不会太小以致任务不执行任何实际工作。

[...] 或者当
你需要复印所有的东西
以及有多少东西是被推动的
到水池上。

仅任务本身被推入池中。理想情况下，您不想复制任何内容：为了避免干扰和需要锁定（这会减慢程序速度），您的任务理想情况下应该使用独立数据。只读数据可以在所有任务之间共享，并且不需要复制。如果线程需要合作构建一些大型数据结构，最好它们单独构建各个部分，然后在最后将它们组合起来。组合可以作为单独的任务来完成，或者每个任务都可以将其一部分添加到整体解决方案中。这通常确实需要某种形式的锁定，但如果任务的工作量远大于更新解决方案的工作量，那么这不是一个严重的性能问题。 My Knight's Tour 解决方案采用这种方法来更新板上的公共旅游存储库。

处理任务和并发是常规单线程编程的一个重大范式转变。通常有多种设计可以解决给定的问题，但只有其中一些适合线程解决方案。可能需要进行几次尝试才能了解如何以多线程方式重新构建熟悉的问题。最好的学习方法是查看示例，然后自己尝试。始终分析并测量改变线程数量的影响。您可以在池构造函数中显式设置要在池中使用的线程（核心）数量。当任务被线性分解时，随着线程数量的增加，您可以预期接近线性的加速。

Java has a parallel task framework similar to Thread Building Blocks - it's called the Fork-Join framework. It's available for use with the current Java SE 6 and to be included in the upcoming Java SE 7.

There are resources available for getting started with the framework, in addition to the javadoc class documentation. From the jsr166 page, mentions that
"There is also a wiki containing additional documentation, notes, advice, examples, and so on for these classes."

The fork-join examples, such as matrix multiplication are a good place to start.

I used the fork-join framework in solving some of Intel's 2009 threading challenges. The framework is lightweight and low-overhead - mine was the only Java entry for the Kight's Tour problem and it outperformed other entries in the competition. The java sources and writeup are available from the challenge site for download.

EDIT:

I have no idea how a class or pieces
of code may look after pushing it onto
a pool [...]

You can make your own task by subclassing one of the ForKJoinTask subclasses, such as RecursiveTask. Here's how to compute the fibonacci sequence in parallel. (Taken from the RecursiveTask javadocs - comments are mine.)

 // declare a new task, that itself spawns subtasks. 
 // The task returns an Integer result.
 class Fibonacci extends RecursiveTask<Integer> {
   final int n;      // the n'th number in the fibonacci sequence to compute
   Fibonnaci(int n) { this.n = n; } // constructor
   Integer compute() {   // this method is the main work of the task
     if (n <= 1)         // 1 or 0, base case to end recursion
        return n;
     Fibonacci f1 = new Fibonacci(n - 1);  // create a new task to compute n-1
     f1.fork();                            // schedule to run asynchronously
     Fibonacci f2 = new Fibonacci(n - 2);  // create a new task to compute n-2
     return f2.invoke() + f1.join();       // wait for both tasks to compute.
       // f2 is run as part of this task, f1 runs asynchronously. 
       // (you could create two separate tasks and wait for them both, but running
       // f2 as part of this task is a little more efficient.
   }
 }

You then run this task and get the result

// default parallelism is number of cores
ForkJoinPool pool = new ForkJoinPool(); 
Fibonacci f = new Fibonacci(100);
int result = pool.invoke(f);

This is a trivial example to keep things simple. In practice, performance would not be so good, since the work executed by the task is trivial compared to the overhead of the task framework. As a rule of thumb, a task should perform some significant computation - enough to make the framework overhead insignificant, yet not so much that you end up with one core at the end of the problem running one large task. Splitting large tasks into smaller ones ensures that one core isn't left doing lots of work while other cores are idle - using smaller tasks keeps more cores busy, but not so small that the task does no real work.

[...] or how weird code may look when
you need to make a copy of everything
and how much of everything is pushed
onto a pool.

Only the tasks themselves are pushed into a pool. Ideally you don't want to be copying anything: to avoid interference and the need for locking, which would slow down your program, your tasks should ideally be working with independent data. Read-only data can be shared amongst all tasks, and doesn't need to be copied. If threads need to co-operate building some large data structure, it's best they build the pieces separately and then combine them at the end. The combining can be done as a separate task, or each task can add it's piece of the puzzle to the overall solution. This often does require some form of locking, but it's not a considerable performance issue if the work of the task is much greater than the the work updating the solution. My Knight's Tour solution takes this approach to update a common repository of tours on the board.

Working with tasks and concurrency is quite a paradigm shift from regular single-threaded programming. There are often several designs possible to solve a given problem, but only some of these will be suitable for a threaded solution. It can take a few attempts to get the feel for how to recast familiar problems in a multi-threaded way. The best way to learn is to look at the examples, and then try it for yourself. Always profile, and meausre the effects of varying the number of threads. You can explicitly set the number of threads (cores) to use in the pool in the pool constructor. When tasks are broken up linearly, you can expect near linear speedup as the number of threads increases.

回复收藏 0 原文

桃气十足 2024-09-08 08:49:55

玩那些声称可以解决无法解决的问题（最优任务调度是 NP 难）的“框架”根本不会对你有任何帮助——阅读有关并发算法的书籍和文章会对你有帮助。所谓的“任务”只不过是定义问题的可分离性（可以彼此独立计算的部分）的花哨名称。 可分离问题的类别非常小 - 并且它们已经在旧书中介绍过。

对于不可分离的问题，您必须规划阶段和阶段之间的数据屏障以交换数据。 > 用于同时数据交换的数据屏障的最佳编排不仅是 NP 困难，而且原则上不可能以通用方式解决 - 您需要检查所有可能的交错的历史 - 这就像已经指数集的幂集（就像去数学中从 N 到 R）。我提到的原因是为了明确指出，没有任何软件可以为您做到这一点，并且如何做到这一点本质上取决于实际算法以及并行化是否可行（即使理论上可行）的成败。

当您进入高并行度时，您甚至无法维护队列，甚至不再拥有内存总线 - 想象一下 100 个 CPU 试图在单个共享 int 上同步或尝试进行内存总线套利。您必须预先计划和预先配置将要运行的所有内容，并在白板上基本上证明其正确性。英特尔的线程构建模块在这个世界上还只是个小孩子。它们适用于仍然可以共享内存总线的少量核心。运行可分离的问题是轻而易举的事，无需任何“框架”即可完成。

因此，您又必须阅读尽可能多的不同并行算法。研究一个问题的近似最佳数据屏障布局通常需要 1-3 年的时间。当您在单个芯片上使用 16 个以上内核时，它就成为布局，因为只有第一个邻居可以有效地交换数据（在一个数据屏障周期期间）。因此，通过查看 CUDA 和论文以及 IBM 实验性 30 核 CPU 的结果，您实际上会学到比英特尔的销售宣传或某些 Java 玩具更多的知识。

请注意演示问题，这些问题所浪费的资源（核心数量和内存数量）远大于它们所实现的加速。如果需要 4 个内核和 4 倍 RAM 才能以 2 倍的速度解决某些问题，则该解决方案不可扩展以进行并行化。

Playing with "frameworks" which claim to be solving unsolvable (optimal task scheduling is NP hard) is not going to help you at all - reading books and than articles on concurrent algorithms will. So called "tasks" are nothing more that a fancy name for defining separability of the problem (parts that can be computed independently of each other). Class of separable problems is very small - and they are already covered in old books.

For problems which are not separable you have to plan phases and data barriers between phases to exchange data. Optimal orchestration of data barriers for simultaneous data exchange is not just NP hard but impossible to solve in a general way in principle - you'd need to examine history of all possible interleavings - that's like the power set of already exponential set (like going from N to R in math). The reason I mention is to make it clear that no software can ever do this for you and that how to do it is intrinsically dependent on actual algorithm and a make or break of whether parallelization is feasible at all (even if it's theoretically possible).

When you enter high parallelism you can't even maintain a queue, you don't even have a memory bus anymore - imagine 100 CPU-s trying to sync up on just a single shared int or trying to do memory bus arbitrage. You have to pre-plan and pre-configure everything that's going to run and essentially prove correctness on a white board. Intel's Threading Building Blocks are a small kid in that world. They are for small number of cores which can still share a memory bus. Running separable problems is a no-brainer which you can do without any "framework".

So you are back to having to read as many different parallel algorithms as you can. It normally takes 1-3 years to research approximately optimal data barrier layout for one problem. It becomes layout when you go for say 16+ cores on a single chip since only first neighbors can exchange data efficiently (during one data barrier cycle). So you'll actually learn much more by looking at CUDA and papers and results with IBM-s experimental 30-core CPU than Intel's sales pitch or some Java toy.

Beware of demo problems for which the size of resources wasted (number of cores and memory) is much bigger that the speedup they achieve. If it takes 4 cores and 4x RAM to solve something 2x faster, the solution is not scalable for parallelization.

回复收藏 0 原文

~没有更多了~