如何寻找并行的可能性?
我有一些串行代码,我已开始使用英特尔的 TBB 对其进行并行化。我的第一个目标是并行化代码中几乎所有的 for 循环(我什至在 for 循环内并行化了 for ),现在完成后我得到了一些加速。我正在寻找更多的地方/想法/选项来并行化......我知道这可能听起来有点模糊,没有太多参考问题,但我正在寻找通用的想法,我可以在我的代码中探索这些想法。
算法概述(以下算法在图像的所有级别上运行,从最短的开始,每次增加宽度和高度 2,直到达到实际的高度和宽度)。
For all image pairs starting with the smallest pair
For height = 2 to image_height - 2
Create a 5 by image_width ROI of both left and right images.
For width = 2 to image_width - 2
Create a 5 by 5 window of the left ROI centered around width and find best match in the right ROI using NCC
Create a 5 by 5 window of the right ROI centered around width and find best match in the left ROI using NCC
Disparity = current_width - best match
The edge pixels that did not receive a disparity gets the disparity of its neighbors
For height = 0 to image_height
For width = 0 to image_width
Check smoothness, uniqueness and order constraints*(parallelized separately)
For height = 0 to image_height
For width = 0 to image_width
For disparity that failed constraints, use the average disparity of
neighbors that passed the constraints
Normalize all disparity and output to screen
I have some serial code that I have started to parallelize using Intel's TBB. My first aim was to parallelize almost all the for loops in the code (I have even parallelized for within for loop)and right now having done that I get some speedup.I am looking for more places/ideas/options to parallelize...I know this might sound a bit vague without having much reference to the problem but I am looking for generic ideas here which I can explore in my code.
Overview of algo( the following algo is run over all levels of the image starting with shortest and increasing width and height by 2 each time till you reach actual height and width).
For all image pairs starting with the smallest pair
For height = 2 to image_height - 2
Create a 5 by image_width ROI of both left and right images.
For width = 2 to image_width - 2
Create a 5 by 5 window of the left ROI centered around width and find best match in the right ROI using NCC
Create a 5 by 5 window of the right ROI centered around width and find best match in the left ROI using NCC
Disparity = current_width - best match
The edge pixels that did not receive a disparity gets the disparity of its neighbors
For height = 0 to image_height
For width = 0 to image_width
Check smoothness, uniqueness and order constraints*(parallelized separately)
For height = 0 to image_height
For width = 0 to image_width
For disparity that failed constraints, use the average disparity of
neighbors that passed the constraints
Normalize all disparity and output to screen
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
仅从某些角度来看,并行化某些东西可能并不总是值得的。
仅仅因为你有一个 for 循环,其中每次迭代都可以彼此独立地完成,并不总是意味着你应该这样做。
TBB 在启动这些parallel_for 循环时会产生一些开销,因此除非您循环大量次,否则您可能不应该对其进行并行化。
但是,如果每个循环都非常昂贵(就像 CirrusFlyer 的示例一样),那么可以随意并行化它。
更具体地说,寻找并行计算的开销相对于并行化成本较小的时间。
另外,执行嵌套的parallel_for 循环时要小心,因为这可能会变得昂贵。您可能只想坚持并行化外部 for 循环。
Just for some perspective, it may not always be worthwhile to parallelize something.
Just because you have a for loop where each iteration can be done independently of each other, doesn't always mean you should.
TBB has some overhead for starting those parallel_for loops, so unless you're looping a large number of times, you probably shouldn't parallelize it.
But, if each loop is extremely expensive (Like in CirrusFlyer's example) then feel free to parallelize it.
More specifically, look for times where the overhead of the parallel computation is small relative to the cost of having it parallelized.
Also, be careful about doing nested parallel_for loops, as this can get expensive. You may want to just stick with paralellizing the outer for loop.
愚蠢的答案是任何耗时或迭代的事情。我使用 Microsoft 的 .NET v4.0 任务并行库,其设置的有趣之处之一是其“表达的并行性”。一个有趣的术语,用于描述“尝试的并行性”。不过,如果主机平台没有必要的内核,您的编码语句可能会说“在此处使用 TPL”,它将简单地调用旧式串行代码来代替它。
我已经开始在我的所有项目中使用 TPL。特别是任何有循环的地方(这要求我设计我的类和方法,以便循环迭代之间不存在依赖关系)。但是,对于任何可能只是好的老式多线程代码的地方,我都会看看现在是否可以将其放置在不同的内核上。
到目前为止,我最喜欢的是一个应用程序,它会下载约 7,800 个不同的 URL 来分析页面的内容,如果它找到它正在寻找的信息,则会进行一些额外的处理......这过去需要 26 - 29 分钟来完成。我的 Dell T7500 工作站配备双四核 Xeon 3GHz 处理器、24GB RAM 和 Windows 7 Ultimate 64 位版本,现在可以在大约 5 分钟内完成整个任务。对我来说有很大的不同。
我还有一个发布/订阅通信引擎,我一直在重构它以利用 TPL(特别是在将数据从服务器“推送”到客户端时......您可能有 10,000 个客户端计算机,他们已经表达了对特定事物的兴趣,即一旦该事件发生,我需要将数据推送给所有这些)。我还没有完成这项工作,但我真的很期待看到这方面的结果。
值得深思...
The silly answer is anything that is time consuming or iterative. I use Microsoft's .NET v4.0 Task Parallel Library and one of the interesting things about their setup is its "expressed parallelism." An interesting term to describe "attempted parallelism." Though, your coding statements may say "use the TPL here" if the host platform doesn't have the necessary cores it will simply invoke the old fashion serial code in its place.
I have begun to use the TPL on all my projects. Any place there are loops especially (this requires that I design my classes and methods such that there are no dependencies between the loop iterations). But any place that might have been just good old fashion multithreaded code I look to see if it's something I can place on different cores now.
My favorite so far has been an application I have that downloads ~7,800 different URL's to analyze the contents of the pages, and if it finds information that it's looking for does some additional processing .... this used to take between 26 - 29 minutes to complete. My Dell T7500 workstation with dual quad core Xeon 3GHz processors, with 24GB of RAM, and Windows 7 Ultimate 64-bit edition now crunches the entire thing in about 5 minutes. A huge difference for me.
I also have a publish / subscribe communication engine that I have been refactoring to take advantage of TPL (especially on "push" data from the Server to Clients ... you may have 10,000 client computers who have stated their interest in specific things, that once that event occurs, I need to push data to all of them). I don't have this done yet but I'm REALLY LOOKING FORWARD to seeing the results on this one.
Food for thought ...