将迭代器的地址传递给 STL::for_each 中的函数

发布于 2024-12-10 20:09:18 字数 503 浏览 1 评论 0原文

我有一个最终想要并行化的函数。

目前,我在 for 循环中调用事物。

double temp = 0;
int y = 123;  // is a value set by other code
for(vector<double>::iterator i=data.begin(); i != data.end(); i++){
    temp += doStuff(i, y);
}

doStuff 需要知道它在列表中的位置有多远。所以我使用 i - data.begin() 来计算。

接下来,我想改用 stl::for_each 函数。我的挑战是我需要传递迭代器的地址和 y 的值。我见过使用bind2nd将参数传递给函数的示例,但是如何将迭代器的地址作为第一个参数传递?

boost FOREACH 函数看起来也有可能,但我不知道它是否会像 STL 版本那样神奇地自动并行化。

想法、想法、建议?

I have a function that I eventually want to parallelize.

Currently, I call things in a for loop.

double temp = 0;
int y = 123;  // is a value set by other code
for(vector<double>::iterator i=data.begin(); i != data.end(); i++){
    temp += doStuff(i, y);
}

doStuff needs to know how far down the list it is. So I use i - data.begin() to calculate.

Next, I'd like to use the stl::for_each function instead. My challenge is that I need to pass the address of my iterator and the value of y. I've seen examples of using bind2nd to pass a parameter to the function, but how can I pass the address of the iterator as the first parameter?

The boost FOREACH functions also looks like a possibility, however I do not know if it will parallelize auto-magically like the STL version does.

Thoughts, ideas, suggestions?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

孤云独去闲 2024-12-17 20:09:18

如果您想要真正的并行化,请

  • 在 (-O3) 和 SIMD 上使用带有树向量化优化的 GCC(例如 -march=native 以获得 SSE 支持)。如果操作 (dostuff) 并不简单,您可以选择提前执行此操作(std::transformstd::for_each)并累积下一个 (< code>std::accumulate),因为累积将像 SSE 指令上的其他操作一样得到优化!

    void apply_function(double& value)
    {
         值*= 3; // 只是一个示例...
    }
    
    // ...
    
    std::vector数据(1000);
    std::for_each(data.begin(), data.end(), &apply_function);
    双和 = std::accumulate(data.begin(), data.end(), 0);
    

请注意,虽然这实际上不会在多个线程上运行,但性能的提升将是巨大的,因为 SSE4 指令可以在单个核心上*并行处理许多浮动操作。

如果您想要真正的并行性,请使用以下之一

GNU 并行模式

编译与g++ -fopenmp -D_GLIBCXX_PARALLEL

__gnu_parallel::accumulate(data.begin(), data.end(), 0.0);

OpenMP 直接

使用 g++ -fopenmp 进行编译

double sum = 0.0;
#pragma omp parallel for reduction (+:sum)
for (size_t i=0; i<data.end(); i++)
{
    sum += do_stuff(i, data[i]);
}

这将导致循环并行化为与系统上(逻辑)CPU 核心数量一样多的线程(OMP 团队)实际机器,结果“神奇地”组合并同步。

最后备注:

您可以使用有状态函数对象来模拟 for_each 的二进制函数。这并不是完全推荐的做法。它也会显得非常低效(在没有优化的情况下编译时,确实如此)。这是因为函数对象在 STL 中是按值传递的。然而,期望编译器完全优化潜在的开销是合理的,特别是对于如下的简单情况:

struct myfunctor
{
    size_t index; 
    myfunctor() : index(0) {}

    double operator()(const double& v) const
    {
        return v * 3; // again, just a sample
    }
};

// ...
std::for_each(data.begin(), data.end(), myfunctor());

If you want real parallelization here, use

  • GCC with tree vectorization optimization on (-O3) and SIMD (e.g. -march=native to get SSE support). If the operation (dostuff) is non-trivial, you could opt to do it ahead of time (std::transform or std::for_each) and accumulate next (std::accumulate) since the accumulation will be optimized like nothing else on SSE instructions!

    void apply_function(double& value)
    {
         value *= 3; // just a sample...
    }
    
    // ...
    
    std::vector<double> data(1000);
    std::for_each(data.begin(), data.end(), &apply_function);
    double sum = std::accumulate(data.begin(), data.end(), 0);
    

Note that though this will not actually run on multiple threads, the performance increase will be massive since SSE4 instructions can handle many floating operations *in parallell _on a single core_ .

If you wanted true parallelism, use one of the following

GNU Parallel Mode

Compile with g++ -fopenmp -D_GLIBCXX_PARALLEL:

__gnu_parallel::accumulate(data.begin(), data.end(), 0.0);

OpenMP directly

Compile with g++ -fopenmp

double sum = 0.0;
#pragma omp parallel for reduction (+:sum)
for (size_t i=0; i<data.end(); i++)
{
    sum += do_stuff(i, data[i]);
}

This will result in the loop being parallelized into as many threads (OMP team) as there are (logical) CPU cores on the actual machine, and the result 'magically' combined and synchronized.

Final remarks:

You can simulate the binary function for for_each by using a stateful function object. This is not exactly recommended practice. It will also appear to be very inefficient (when compiling without optimization, it is). This is due to the fact that function objects are passed by value thoughout the STL. However, it is reasonable to expect a compiler to completely optimize the potential overhead of that away, especially for simple cases like the following:

struct myfunctor
{
    size_t index; 
    myfunctor() : index(0) {}

    double operator()(const double& v) const
    {
        return v * 3; // again, just a sample
    }
};

// ...
std::for_each(data.begin(), data.end(), myfunctor());
你穿错了嫁妆 2024-12-17 20:09:18

temp += doStuff( i, y ); 无法自动并行化。运算符+= 不能很好地处理并发性。

此外,stl 算法不会并行化任何东西。 Visual Studio 和 GCC 都有类似于 std::for_each 的并行算法。如果这就是你所追求的,你就必须使用它们。

OpenMP 可以自动并行化 for 循环,但您必须使用编译指示告诉编译器何时以及如何并行化(它无法为您弄清楚)。

您可能混淆了并行化与循环展开,这是 std::for_each 实现中的常见优化。

temp += doStuff( i, y ); cannot be auto parallelized. The operator += doesn't play well with concurrency.

Further the stl algorithms don't parallelize anything. Both Visual Studio and GCC have parallel algorithms similar to std::for_each. If that is what you're after you'll have to use those.

OpenMP can auto parallelize for loops, but you have to use pragmas to tell the compiler when and how (it can't figure it out for you).

You may have confused parallelization with loop unrolling, which is a common optimization in std::for_each implementations.

乜一 2024-12-17 20:09:18

如果您可以更改doStuff,以便它从当前元素所在的索引中单独获取当前元素的值,那么这相当简单。请注意:

struct context {
    std::size_type _index;
    int            _y;
    double         _result;
};

context do_stuff_wrapper(context current, double value)
{
    current._result += doStuff(current._index, value, current._y);
    current._index++;
}

context c = { 0, 123, 0.0 };
context result = std::accumulate(data.begin(), data.end(), c, do_stuff_wrapper);

但是请注意,标准库算法无法“自动并行化”,因为它们调用的函数可能会产生副作用(编译器知道是否会产生副作用,但库函数不知道)。如果您想要并行循环,则必须使用专用并行算法库,例如 PPL 或 TBB。

This is fairly straightforward if you can change doStuff so that it takes the value of the current element separately from the index at which the current element is located. Consider:

struct context {
    std::size_type _index;
    int            _y;
    double         _result;
};

context do_stuff_wrapper(context current, double value)
{
    current._result += doStuff(current._index, value, current._y);
    current._index++;
}

context c = { 0, 123, 0.0 };
context result = std::accumulate(data.begin(), data.end(), c, do_stuff_wrapper);

Note, however, that the Standard Library algorithms cannot "auto-parallelize" because the functions they call may have side effects (the compiler knows whether side effects are produced, but the library functions don't). If you want a parallelized loop, you'll have to go with a special-purpose parallelizing algorithms library, like PPL or TBB.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文