使用 TBB 的非常基本的 for 循环
我是一个非常新的程序员,我在使用英特尔的示例时遇到了一些问题。我认为如果我能看到最基本的循环是如何在 tbb 中实现的,那将会很有帮助。
for (n=0 ; n < songinfo.frames; ++n) {
sli[n]=songin[n*2];
sri[n]=songin[n*2+1];
}
这是我用来解交错音频数据的循环。这个循环会从 tbb 中受益吗?你会如何实施它?
I am a very new programmer, and I have some trouble with the examples from intel. I think it would be helpful if I could see how the most basic possible loop is implemented in tbb.
for (n=0 ; n < songinfo.frames; ++n) {
sli[n]=songin[n*2];
sri[n]=songin[n*2+1];
}
Here is a loop I am using to de-interleave audio data. Would this loop benefit from tbb? How would you implement it?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
首先,对于以下代码,我假设您的
arrays
类型为mytype*
,否则代码需要进行一些修改。此外,我假设您的范围不重叠,否则并行化尝试将无法正常工作(至少在没有更多工作的情况下),因为您在 tbb 中要求它:
首先您需要在某处初始化库(通常在您的
中)主要
)。对于代码,假设我在某处放置了using namespace tbb
。然后你将需要一个函子来捕获你的数组并执行for循环的主体:
现在你可以使用parallel_for来并行化这个循环:
应该可以做到(如果我没记错的话还没有看过tbb)过一段时间,所以可能会出现一些小错误)。
如果您使用 c++11,则可以使用 lambda 来简化代码:
话虽如此,tbb 并不完全是我向新程序员推荐的。我真的建议只并行化那些并行化很简单的代码,直到您对线程有非常牢固的把握。为此,我建议使用
openmp
,它比 tbb 启动起来要简单一些,同时仍然足够强大,可以并行化很多东西(不过,这取决于支持它的编译器)。对于您的循环,它如下所示:然后您必须告诉编译器使用 openmp 进行编译和链接(对于 gcc,
-fopenmp
,对于 Visual C++,/openmp
)。正如您所看到的,它使用起来要简单得多(对于如此简单的用例,更复杂的场景是另一回事)然后 tbb 并且具有在不支持 openmp 或 tbb 的平台上工作的额外好处(因为未知#pragmas
被编译器忽略)。就我个人而言,我在某些项目中使用 openmp 而使用 tbb,因为我无法使用它的开源许可证,并且购买 tbb 对于这些项目来说有点陡峭。现在我们已经了解了如何并行化循环,让我们来讨论它是否值得的问题。这是一个确实不容易回答的问题,因为它完全取决于您处理的元素数量以及您的程序预计运行在什么样的平台上。你的问题是带宽非常重,所以我不指望性能有太大的提高。
1000
元素,则由于开销,并行版本的循环很可能比单线程版本慢。1.X
的顺序)__restrict
(对于 gcc,对于 vs 没有任何线索)可能有助于解决这个问题。就我个人而言,我认为最有可能看到性能显着提高的情况是,如果您的系统具有单个多核 cpu,且数据集适合 L3 缓存(但不适合单独的 L2 缓存)。对于更大的数据集,您的性能可能会提高,但幅度不会太大(正确使用预取可能会获得类似的收益)。当然这纯粹是猜测。
First of all for the following code I assume your
arrays
are of typemytype*
, otherwise the code need some modifications. Furthermore I assume that your ranges don't overlap, otherwise parallelization attemps won't work correctly (at least not without more work)Since you asked for it in tbb:
First you need to initialize the library somewhere (typically in your
main
). For the code assume I put ausing namespace tbb
somewhere.Then you will need a functor which captures your arrays and executes the body of the forloop:
Now you can use
parallel_for
to parallelize this loop:That should do it (if I remember correctly haven't looked at tbb in a while, so there might be small mistakes).
If you use c++11, you can simplify the code by using
lambda
:That being said tbb is not exactly what I would recommend for a new programmer. I would really suggest parallelizing only code which is trivial to parallelize until you have a very firm grip on threading. For this I would suggest using
openmp
which is quiet a bit simpler to start with then tbb, while still being powerfull enough to parallelize a lot of stuff (Depends on the compiler supporting it,though). For your loop it would look like the following:Then you have to tell your compiler to compile and link with openmp (
-fopenmp
for gcc,/openmp
for visual c++). As you can see it is quite a bit simpler to use (for such easy usecases, more complex scenarious are a different matter) then tbb and has the added benefit of workingon plattforms which don't support openmp or tbb too (since unknown#pragmas
are ignored by the compiler). Personally I'm using openmp in favor of tbb for some projects since I couldn't use it's open source license and buying tbb was a bit to steep for the projects.Now that we have the how to parallize the loop out of the way, lets get to the question if it's worth it. This is a question which really can't be answered easily, since it completely depends on how many elements you process and what kind of platform your program is expected to run on. Your problem is very bandwidth heavy so I wouldn't count on to much of an increase in performance.
1000
elements the parallel version of the loop is very likely to be slower then the single threaded version due to overhead.1.X
even if you use a lot of processors)__restrict
(for gcc, no clue for vs) might help with that problem.Personally I think the situation where you are most likely to see a significant performance increase is if your system has a single multi-core cpu, for which the dataset fit's into the L3-Cache (but not the individual L2 Caches). For bigger datasets your performance will probably increase, but not by much (and correctly using prefetching might get similar gains). Of course this is pure speculization.