线程构建块中嵌套循环的并行化

发布于 2024-10-29 12:52:08 字数 1485 浏览 9 评论 0原文

我是线程构建块的新手,并尝试在 TBB 中编码 FFT 算法以获得一些实践经验。在这种算法的情况下,我只能并行化最内层的循环。但这样做性能已经下降到不可接受的程度(超过一千次)。我尝试过将数组大小设置为最大 2^20 但结果相同。我的代码如下所示

for(stage1=0;stage1 < lim;stage1 ++)
{
    butterfly_index1 = butterfly_index2;
    butterfly_index2 = butterfly_index2 + butterfly_index2; 
    k = -1*2*PI/butterfly_index2;
    k_next = 0.0;
    for(stage2 = 0 ; stage2 < butterfly_index1;stage2++)
    {
        sine=sin(k_next);
        cosine = cos(k_next);
        k_next = k_next + k;
        FFT. sine = &sine;
        FFT.cosine = &cosine;
        FFT.n2 = &butterfly_index2;
        FFT.loop_init = &stage2;
        FFT.n1 = &butterfly_index1;
        parallel_for(blocked_range<int>(
                        stage2,SIZE,SIZE/4),FFT,simple_partitioner()); 
    }
}   

,parallel_loop 的主体是

void operator()(const blocked_range<int> &r)const
{
    for(int k = r.begin(); k != r.end(); k++)
    {   
        if(k != *loop_init)
        {
            if((k - (*loop_init))% (* n2)!= 0)
                continue;
        }           
        temp_real = (*cosine) * X[k + *n1].real - (*sine) * X[k + *n1].img;
        temp_img = (*sine)* X[k + *n1].real + (*cosine) * X[k + *n1].img;
        X[k + *n1].real = X[k].real - temp_real;
        X[k + *n1].img = X[k].img - temp_img;
        X[k].real = X[k].real + temp_real;
        X[k].img = X[k].img + temp_img; 
    }
}

如果我用正常循环替换它,那么事情是正确的。

I am new to threading building blocks and trying to encode the FFT algorithm in TBB to get some hands on experience. In case of this algorithm i can only parallelize the innermost loop. but by doing so the performance has degraded up to unacceptable degree (more than thousand times). I have tried for array size up to 2^20 but same result. My code is given below

for(stage1=0;stage1 < lim;stage1 ++)
{
    butterfly_index1 = butterfly_index2;
    butterfly_index2 = butterfly_index2 + butterfly_index2; 
    k = -1*2*PI/butterfly_index2;
    k_next = 0.0;
    for(stage2 = 0 ; stage2 < butterfly_index1;stage2++)
    {
        sine=sin(k_next);
        cosine = cos(k_next);
        k_next = k_next + k;
        FFT. sine = &sine;
        FFT.cosine = &cosine;
        FFT.n2 = &butterfly_index2;
        FFT.loop_init = &stage2;
        FFT.n1 = &butterfly_index1;
        parallel_for(blocked_range<int>(
                        stage2,SIZE,SIZE/4),FFT,simple_partitioner()); 
    }
}   

and body for parallel_loop is

void operator()(const blocked_range<int> &r)const
{
    for(int k = r.begin(); k != r.end(); k++)
    {   
        if(k != *loop_init)
        {
            if((k - (*loop_init))% (* n2)!= 0)
                continue;
        }           
        temp_real = (*cosine) * X[k + *n1].real - (*sine) * X[k + *n1].img;
        temp_img = (*sine)* X[k + *n1].real + (*cosine) * X[k + *n1].img;
        X[k + *n1].real = X[k].real - temp_real;
        X[k + *n1].img = X[k].img - temp_img;
        X[k].real = X[k].real + temp_real;
        X[k].img = X[k].img + temp_img; 
    }
}

If i replace it with normal loop then things are correct.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

征﹌骨岁月お 2024-11-05 12:52:08

对于非常短的工作负载,由于线程创建的开销,可能会出现巨大的速度减慢。对于数组中的 2^20 个元素,我不相信会发生如此巨大的性能下降。

性能下降的另一个重要原因是编译器在 TBB 化后无法优化(特别是向量化)代码。了解您的编译器是否可以生成矢量化报告,并查找串行版本和 TBB 版本之间的差异。

速度减慢的一个可能原因可能是parallel_for 的主体类的复制构造函数,因为主体被复制了很多。但给定的代码在这方面看起来并不可疑:似乎主体包含一些指针。不管怎样,看看是否有问题。


显着开销的另一个常见来源是太细粒度的并行性,即许多任务,每个任务只包含很少的工作。但这里情况也不是这样,因为blocked_range的第三个参数中的grainsize SIZE/4指定每个算法最多调用body的operator()() 4次。

我建议不要在初始实验中指定 simple_partitioner 和grainsize,而是让 TBB 动态分配工作。如果需要,您可以稍后进行调整。

For very short workloads, the huge slowdown can happen due to the overhead on thread creation. For 2^20 elements in the array, I do not believe such a huge performance drop happens.

Another significant source of performance degradation is compiler's inability to optimize (in particular, vectorize) the code after it has been TBBfied. Find out whether your compiler may produce a vectorization report, and look for differences between serial and TBB versions.

A possible source of slowdown might be the copy constructor of parallel_for's body class, because the bodies are copied a lot. But the given code does not look suspicious in this regard: seems the body contains a few pointers. Anyway, look whether it could be a problem.


Another usual source of significant overhead is too fine-grained parallelism, i.e. a lot of tasks each containing very little work. But here this is not the case either, because grainsize SIZE/4 in the 3rd parameter to blocked_range specifies that body's operator()() will be invoked at most 4 times per the algorithm.

I would recommend to not specify simple_partitioner and grainsize for initial experiments, and instead let TBB distribute the work dynamically. You can tune it later if necessary.

鯉魚旗 2024-11-05 12:52:08

我的代码的问题是工作量随着 n2 变量的增加而减少。因此,随着外循环的进行,parallel_for 的工作负载会减半,并且在几次迭代之后,它会变得太小,无法通过使用 TBB 来提高性能。因此,解决方案是仅当内部循环的工作负载足够时才并行化内部循环的迭代,并串行化其余迭代的内部循环。其次,包含条件检查 (k != r.end()) 的“for”循环头也会降低性能。解决方案是将 r.end() 替换为本地定义的变量,该变量初始化为 r.end()

感谢英特尔软件论坛为解决此问题提供的帮助

the problem with my code was the decrease in workload with increase in n2 variable. so as the out loop proceeds the workload for parallel_for become half and after just a few iteration it becomes too small to give performance boost by using TBB. So solution is to parallelize the inner loop only for iterations when the workload for inner loop is sufficient ans serialize the inner loop for rest of iterations. secondly the "for" loop header that contains the condition check (k != r.end()) also degrades the performance. solution is to replace the r.end() with a locally defined variable that is initialized to r.end()

Thanks to intel software forum for their assistance in solving this problem

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文