使用 TBB 的非常基本的 for 循环

发布于 2024-12-23 04:40:32 字数 258 浏览 4 评论 0原文

我是一个非常新的程序员，我在使用英特尔的示例时遇到了一些问题。我认为如果我能看到最基本的循环是如何在 tbb 中实现的，那将会很有帮助。

for (n=0 ; n < songinfo.frames; ++n) {  

         sli[n]=songin[n*2];
         sri[n]=songin[n*2+1];

}

这是我用来解交错音频数据的循环。这个循环会从 tbb 中受益吗？你会如何实施它？

原文

I am a very new programmer, and I have some trouble with the examples from intel. I think it would be helpful if I could see how the most basic possible loop is implemented in tbb.

for (n=0 ; n < songinfo.frames; ++n) {  

         sli[n]=songin[n*2];
         sri[n]=songin[n*2+1];

}

Here is a loop I am using to de-interleave audio data. Would this loop benefit from tbb? How would you implement it?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

唐婉 2024-12-30 04:40:32

首先，对于以下代码，我假设您的 arrays 类型为 mytype*，否则代码需要进行一些修改。此外，我假设您的范围不重叠，否则并行化尝试将无法正常工作（至少在没有更多工作的情况下），

因为您在 tbb 中要求它：

首先您需要在某处初始化库（通常在您的 中）主要）。对于代码，假设我在某处放置了 using namespace tbb 。

int main(int argc, char *argv[]){
   task_scheduler_init init;
   ...
}

然后你将需要一个函子来捕获你的数组并执行for循环的主体：

struct apply_func {
    const mytype* songin; //whatever type you are operating on
    mytype* sli;
    mytype* sri;
    apply_func(const mytype* sin, mytype* sl, mytype* sr):songin(sin), sli(sl), sri(sr)
    {}
    void operator()(const blocked_range<size_t>& range) {
      for(size_t n = range.begin(); n !=range.end(); ++n){
        sli[n]=songin[n*2];
        sri[n]=songin[n*2+1];
      }
    }
}

现在你可以使用parallel_for来并行化这个循环：

size_t grainsize = 1000; //or whatever you decide on (testing required for best performance);
apply_func func(songin, sli, sri);
parallel_for(blocked_range<size_t>(0, songinfo.frames, grainsize), func);

应该可以做到（如果我没记错的话还没有看过tbb）过一段时间，所以可能会出现一些小错误）。
如果您使用 c++11，则可以使用 lambda 来简化代码：

size_t grainsize = 1000; //or whatever you decide on (testing required for best performance);
parallel_for(blocked_range<size_t>(0, songinfo.frames, grainsize), 
             [&](const blocked_range<size_t>&){
                for(size_t n = range.begin(); n !=range.end(); ++n){
                  sli[n]=songin[n*2];
                  sri[n]=songin[n*2+1];
                }
             });

话虽如此，tbb 并不完全是我向新程序员推荐的。我真的建议只并行化那些并行化很简单的代码，直到您对线程有非常牢固的把握。为此，我建议使用 openmp ，它比 tbb 启动起来要简单一些，同时仍然足够强大，可以并行化很多东西（不过，这取决于支持它的编译器）。对于您的循环，它如下所示：

#pragma omp prallel for
for(size_t n = 0; n < songinfo.frames; ++n) {
  sli[n]=songin[n*2];
  sri[n]=songin[n*2+1];
}

然后您必须告诉编译器使用 openmp 进行编译和链接（对于 gcc，-fopenmp，对于 Visual C++，/openmp）。正如您所看到的，它使用起来要简单得多（对于如此简单的用例，更复杂的场景是另一回事）然后 tbb 并且具有在不支持 openmp 或 tbb 的平台上工作的额外好处（因为未知 #pragmas 被编译器忽略）。就我个人而言，我在某些项目中使用 openmp 而使用 tbb，因为我无法使用它的开源许可证，并且购买 tbb 对于这些项目来说有点陡峭。

现在我们已经了解了如何并行化循环，让我们来讨论它是否值得的问题。这是一个确实不容易回答的问题，因为它完全取决于您处理的元素数量以及您的程序预计运行在什么样的平台上。你的问题是带宽非常重，所以我不指望性能有太大的提高。

如果您仅处理 1000 元素，则由于开销，并行版本的循环很可能比单线程版本慢。
如果您的数据不在缓存中（因为它不适合）并且您的系统非常缺乏带宽，您可能看不到太多好处（尽管您可能会看到一些好处，但如果它即使您使用大量处理器，也按 1.X 的顺序）
如果您的系统是 ccNUMA（可能对于多插槽系统），则由于额外的传输成本，无论元素数量有多少，您的性能都可能会下降
。编译器可能会错过关于指针别名的优化（因为循环体被移动到不同的函数）。使用 __restrict （对于 gcc，对于 vs 没有任何线索）可能有助于解决这个问题。
...

就我个人而言，我认为最有可能看到性能显着提高的情况是，如果您的系统具有单个多核 cpu，且数据集适合 L3 缓存（但不适合单独的 L2 缓存）。对于更大的数据集，您的性能可能会提高，但幅度不会太大（正确使用预取可能会获得类似的收益）。当然这纯粹是猜测。

First of all for the following code I assume your arrays are of type mytype*, otherwise the code need some modifications. Furthermore I assume that your ranges don't overlap, otherwise parallelization attemps won't work correctly (at least not without more work)

Since you asked for it in tbb:

First you need to initialize the library somewhere (typically in your main). For the code assume I put a using namespace tbb somewhere.

int main(int argc, char *argv[]){
   task_scheduler_init init;
   ...
}

Then you will need a functor which captures your arrays and executes the body of the forloop:

struct apply_func {
    const mytype* songin; //whatever type you are operating on
    mytype* sli;
    mytype* sri;
    apply_func(const mytype* sin, mytype* sl, mytype* sr):songin(sin), sli(sl), sri(sr)
    {}
    void operator()(const blocked_range<size_t>& range) {
      for(size_t n = range.begin(); n !=range.end(); ++n){
        sli[n]=songin[n*2];
        sri[n]=songin[n*2+1];
      }
    }
}

Now you can use parallel_for to parallelize this loop:

size_t grainsize = 1000; //or whatever you decide on (testing required for best performance);
apply_func func(songin, sli, sri);
parallel_for(blocked_range<size_t>(0, songinfo.frames, grainsize), func);

That should do it (if I remember correctly haven't looked at tbb in a while, so there might be small mistakes).
If you use c++11, you can simplify the code by using lambda:

size_t grainsize = 1000; //or whatever you decide on (testing required for best performance);
parallel_for(blocked_range<size_t>(0, songinfo.frames, grainsize), 
             [&](const blocked_range<size_t>&){
                for(size_t n = range.begin(); n !=range.end(); ++n){
                  sli[n]=songin[n*2];
                  sri[n]=songin[n*2+1];
                }
             });

That being said tbb is not exactly what I would recommend for a new programmer. I would really suggest parallelizing only code which is trivial to parallelize until you have a very firm grip on threading. For this I would suggest using openmp which is quiet a bit simpler to start with then tbb, while still being powerfull enough to parallelize a lot of stuff (Depends on the compiler supporting it,though). For your loop it would look like the following:

#pragma omp prallel for
for(size_t n = 0; n < songinfo.frames; ++n) {
  sli[n]=songin[n*2];
  sri[n]=songin[n*2+1];
}

Then you have to tell your compiler to compile and link with openmp (-fopenmp for gcc, /openmp for visual c++). As you can see it is quite a bit simpler to use (for such easy usecases, more complex scenarious are a different matter) then tbb and has the added benefit of workingon plattforms which don't support openmp or tbb too (since unknown #pragmas are ignored by the compiler). Personally I'm using openmp in favor of tbb for some projects since I couldn't use it's open source license and buying tbb was a bit to steep for the projects.

Now that we have the how to parallize the loop out of the way, lets get to the question if it's worth it. This is a question which really can't be answered easily, since it completely depends on how many elements you process and what kind of platform your program is expected to run on. Your problem is very bandwidth heavy so I wouldn't count on to much of an increase in performance.

If you are only processing 1000 elements the parallel version of the loop is very likely to be slower then the single threaded version due to overhead.
If your data is not in the cache (because it doesn't fit) and your system is very bandwidth starved you might not see much of a benefit (although it's likely that you will see some benefit, just don't be supprised if its in the order of 1.X even if you use a lot of processors)
If your system is ccNUMA (likely for multisocket systems) your performance might decrease regardless of the amount of elements, due to additional transfercosts
The compiler might miss optimizations regarding pointer aliasing (since the loop body is moved to a different dunction). Using __restrict (for gcc, no clue for vs) might help with that problem.
...

Personally I think the situation where you are most likely to see a significant performance increase is if your system has a single multi-core cpu, for which the dataset fit's into the L3-Cache (but not the individual L2 Caches). For bigger datasets your performance will probably increase, but not by much (and correctly using prefetching might get similar gains). Of course this is pure speculization.

回复收藏 0 原文

~没有更多了~