Intel Parallel Studio 时序不一致
我有一些使用 Intel TBB 的代码,并且在 32 核机器上运行。在代码中,我使用
parallel_for(blocked_range (2,left_image_width-2, left_image_width /32) ...
为执行并发工作的线程生成 32 个线程,不存在竞争条件,并且希望每个线程都获得相同的工作量。我使用clock_t 来测量我的程序需要多长时间才能完成,
然后我通过Intel Parallel Studio 运行我的代码,它在2 秒内运行了代码。期待,但我不明白为什么两者之间有如此大的差异。 time_t 是否对所有内核上的时钟周期进行求和?下面是有问题的代码片段
clock_t begin=clock();
create_threads_and_do_work();
clock_t end=clock();
double diffticks=end-begin;
double diffms=(diffticks*1000)/CLOCKS_PER_SEC;
cout<<"And the time is "<<diffms<<" ms"<<endl;
。 。
I have some code that uses Intel TBB and I'm running on a 32 core machine. In the code, I use
parallel_for(blocked_range (2,left_image_width-2, left_image_width /32) ...
to spawn 32 to threads that do concurrent work, there are no race conditions and each thread is hopefully given the same amount of work. I'm using clock_t to measure how long my program takes. For a certain image, it takes roughly 19 seconds to complete.
Then I ran my code through Intel Parallel Studio and it ran the code in 2 seconds. This is the result I was expecting but I can't figure out why there's such a large difference between the two. Does time_t sum the clock cycles on all the cores? Even then it doesn't make sense. Below is the snippet in question.
clock_t begin=clock();
create_threads_and_do_work();
clock_t end=clock();
double diffticks=end-begin;
double diffms=(diffticks*1000)/CLOCKS_PER_SEC;
cout<<"And the time is "<<diffms<<" ms"<<endl;
Any advice would be appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
目前尚不清楚运行时间的差异是否是两个不同输入(图像)的结果,或者仅仅是两种不同的运行时间测量方法(clock_t 差异与英特尔软件测量)的结果。此外,您没有向我们展示 create_threads_and_do_work() 中发生的情况,也没有提及您正在使用 Intel Parallel Studio 中的哪个工具,是 Vtune 吗?
您的clock_t差异方法将对调用它的线程(示例中的主线程)的处理时间进行求和,但它可能不会计算在create_threads_and_do_work()内生成的线程的处理时间。是否执行取决于在该函数中您是否等待所有线程完成然后才退出该函数,或者您是否只是生成线程并立即退出(在它们完成处理之前)。如果您在函数中所做的只是parallel_for(),那么clock_t 差异应该产生正确的结果,并且应该与其他运行时测量没有什么不同。
Intel Parallel Studio 中有一个名为 Vtune 的分析工具。是一个功能强大的工具,当您通过它运行程序时,您可以查看(以图形方式)代码中每个函数的处理时间(以及调用时间)。我很确定这样做之后你可能就会明白了。
最后一个想法 - 使用英特尔软件时程序是否完成了其进程?我这么问是因为有时 Vtune 会收集数据一段时间,然后停止而不让程序完成。
It's isn't quite clear if the difference in run time is a result of two different inputs (images) or simply two different run-time measuring methods (clock_t difference vs. Intel software measurement). Furthermore, you aren't showing us what goes on in create_threads_and_do_work(), and you didn't mention what tool within Intel Parallel Studio you are using, is it Vtune?
Your clock_t difference method will sum the processing time of the thread that called it (the main thread in your example), but it might not count the processing time of the threads spawned within create_threads_and_do_work(). Whether it does or doesn't depends on whether within that function you wait for all threads to complete and only then exit the function or if you simply spawn the threads and exit immediately (before they complete processing). If all you do in the function is that parallel_for(), then the clock_t difference should yield the right result and should be no different than other run-time measurements.
Within Intel Parallel Studio there is a profiling tool called Vtune. is a powerful tool and When you run your program through it you can view (in a graphically pleasing way) the processing time (as well as times called) of each function in your code. I'm pretty sure after doing this you'll probably figure it out.
One last idea - did the program complete its course when using Intel software? I'm asking because sometimes Vtune will collect data for some time and then stop without allowing the program to complete.