将迭代器的地址传递给 STL::for_each 中的函数
我有一个最终想要并行化的函数。
目前,我在 for 循环中调用事物。
double temp = 0;
int y = 123; // is a value set by other code
for(vector<double>::iterator i=data.begin(); i != data.end(); i++){
temp += doStuff(i, y);
}
doStuff 需要知道它在列表中的位置有多远。所以我使用 i - data.begin() 来计算。
接下来,我想改用 stl::for_each 函数。我的挑战是我需要传递迭代器的地址和 y 的值。我见过使用bind2nd将参数传递给函数的示例,但是如何将迭代器的地址作为第一个参数传递?
boost FOREACH 函数看起来也有可能,但我不知道它是否会像 STL 版本那样神奇地自动并行化。
想法、想法、建议?
I have a function that I eventually want to parallelize.
Currently, I call things in a for loop.
double temp = 0;
int y = 123; // is a value set by other code
for(vector<double>::iterator i=data.begin(); i != data.end(); i++){
temp += doStuff(i, y);
}
doStuff needs to know how far down the list it is. So I use i - data.begin() to calculate.
Next, I'd like to use the stl::for_each function instead. My challenge is that I need to pass the address of my iterator and the value of y. I've seen examples of using bind2nd to pass a parameter to the function, but how can I pass the address of the iterator as the first parameter?
The boost FOREACH functions also looks like a possibility, however I do not know if it will parallelize auto-magically like the STL version does.
Thoughts, ideas, suggestions?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如果您想要真正的并行化,请
std::transform
或std::for_each
)并累积下一个 (< code>std::accumulate),因为累积将像 SSE 指令上的其他操作一样得到优化!请注意,虽然这实际上不会在多个线程上运行,但性能的提升将是巨大的,因为 SSE4 指令可以在单个核心上*并行处理许多浮动操作。
如果您想要真正的并行性,请使用以下之一
GNU 并行模式
编译与
g++ -fopenmp -D_GLIBCXX_PARALLEL
:OpenMP 直接
使用
g++ -fopenmp
进行编译这将导致循环并行化为与系统上(逻辑)CPU 核心数量一样多的线程(OMP 团队)实际机器,结果“神奇地”组合并同步。
最后备注:
您可以使用有状态函数对象来模拟 for_each 的二进制函数。这并不是完全推荐的做法。它也会显得非常低效(在没有优化的情况下编译时,确实如此)。这是因为函数对象在 STL 中是按值传递的。然而,期望编译器完全优化潜在的开销是合理的,特别是对于如下的简单情况:
If you want real parallelization here, use
GCC with tree vectorization optimization on (-O3) and SIMD (e.g. -march=native to get SSE support). If the operation (dostuff) is non-trivial, you could opt to do it ahead of time (
std::transform
orstd::for_each
) and accumulate next (std::accumulate
) since the accumulation will be optimized like nothing else on SSE instructions!Note that though this will not actually run on multiple threads, the performance increase will be massive since SSE4 instructions can handle many floating operations *in parallell _on a single core_ .
If you wanted true parallelism, use one of the following
GNU Parallel Mode
Compile with
g++ -fopenmp -D_GLIBCXX_PARALLEL
:OpenMP directly
Compile with
g++ -fopenmp
This will result in the loop being parallelized into as many threads (OMP team) as there are (logical) CPU cores on the actual machine, and the result 'magically' combined and synchronized.
Final remarks:
You can simulate the binary function for for_each by using a stateful function object. This is not exactly recommended practice. It will also appear to be very inefficient (when compiling without optimization, it is). This is due to the fact that function objects are passed by value thoughout the STL. However, it is reasonable to expect a compiler to completely optimize the potential overhead of that away, especially for simple cases like the following:
temp += doStuff( i, y );
无法自动并行化。运算符+=
不能很好地处理并发性。此外,stl 算法不会并行化任何东西。 Visual Studio 和 GCC 都有类似于 std::for_each 的并行算法。如果这就是你所追求的,你就必须使用它们。
OpenMP 可以自动并行化 for 循环,但您必须使用编译指示告诉编译器何时以及如何并行化(它无法为您弄清楚)。
您可能混淆了并行化与循环展开,这是
std::for_each
实现中的常见优化。temp += doStuff( i, y );
cannot be auto parallelized. The operator+=
doesn't play well with concurrency.Further the stl algorithms don't parallelize anything. Both Visual Studio and GCC have parallel algorithms similar to
std::for_each
. If that is what you're after you'll have to use those.OpenMP can auto parallelize for loops, but you have to use pragmas to tell the compiler when and how (it can't figure it out for you).
You may have confused parallelization with loop unrolling, which is a common optimization in
std::for_each
implementations.如果您可以更改
doStuff
,以便它从当前元素所在的索引中单独获取当前元素的值,那么这相当简单。请注意:但是请注意,标准库算法无法“自动并行化”,因为它们调用的函数可能会产生副作用(编译器知道是否会产生副作用,但库函数不知道)。如果您想要并行循环,则必须使用专用并行算法库,例如 PPL 或 TBB。
This is fairly straightforward if you can change
doStuff
so that it takes the value of the current element separately from the index at which the current element is located. Consider:Note, however, that the Standard Library algorithms cannot "auto-parallelize" because the functions they call may have side effects (the compiler knows whether side effects are produced, but the library functions don't). If you want a parallelized loop, you'll have to go with a special-purpose parallelizing algorithms library, like PPL or TBB.