分析器输出中线程并发开销时间的含义是什么?
如果有对英特尔 VTune Amplifier 有丰富经验的人告诉我这件事,我将不胜感激。
最近,我收到了其他使用英特尔 VTune Amplifier 来针对我的程序的人的性能分析报告。它表明,线程并发区域存在高开销时间。
开销时间的含义是什么?他们不知道(问我),我无法访问英特尔 VTune 放大器。
我有一些模糊的想法。该程序有许多线程睡眠调用,因为 pthread 条件在目标平台中不稳定(或者我做得很糟糕),因此我更改了许多例程以在循环中执行工作,如下所示:
while (true)
{
mutex.lock();
if (event changed)
{
mutex.unlock();
// do something
break;
}
else
{
mutex.unlock();
usleep(3 * 1000);
}
}
这可以标记为 <强>管理时间?
有什么建议吗?
我从英特尔网站找到了有关开销时间的帮助文档。 http ://software.intel.com/sites/products/documentation/hpc/amplifierxe/en-us/win/ug_docs/olh/common/overhead_time.html#overhead_time
摘录:
开销时间是从以下开始的持续时间释放共享资源并以接收该资源结束。理想情况下,开销时间的持续时间非常短,因为它减少了线程必须等待获取资源的时间。然而,并非并行应用程序中的所有 CPU 时间都可以用于执行真正的有效负载工作。如果并行运行时(英特尔® 线程构建模块、OpenMP*)使用效率低下,则很大一部分时间可能会花费在并行运行时内,从而在高并发级别浪费 CPU 时间。例如,这可能是由于递归并行算法中工作拆分的粒度较低造成的:当工作负载大小变得太低时,拆分工作和执行内务工作的开销就会变得很大。
仍然令人困惑..这是否意味着“你做了不必要/太频繁的锁定”?
I'd be really appreciated if someone with good experience of Intel VTune Amplifier tell me about this thing.
Recently I received performance analysis report from other guys who used Intel VTune Amplifier against my program. It tells, there is high overhead time in the thread concurrency area.
What's the meaning of the Overhead Time? They don't know (asked me), I don't have access to Intel VTune Amplifier.
I have vague ideas. This program has many thread sleep calls because pthread condition
is unstable (or I did badly) in the target platform so I change many routines to do works in the loop look like below:
while (true)
{
mutex.lock();
if (event changed)
{
mutex.unlock();
// do something
break;
}
else
{
mutex.unlock();
usleep(3 * 1000);
}
}
This can be flagged as Overhead Time?
Any advice?
I found help documentation about Overhead Time from Intel site.
http://software.intel.com/sites/products/documentation/hpc/amplifierxe/en-us/win/ug_docs/olh/common/overhead_time.html#overhead_time
Excerpt:
Overhead time is a duration that starts with the release of a shared resource and ends with the receipt of that resource. Ideally, the duration of Overhead time is very short because it reduces the time a thread has to wait to acquire a resource. However, not all CPU time in a parallel application may be spent on doing real pay load work. In cases when parallel runtime (Intel® Threading Building Blocks, OpenMP*) is used inefficiently, a significant portion of time may be spent inside the parallel runtime wasting CPU time at high concurrency levels. For example, this may result from low granularity of work split in recursive parallel algorithms: when the workload size becomes too low, the overhead on splitting the work and performing the housekeeping work becomes significant.
Still confusing.. Could it mean "you made unnecessary/too frequent lock"?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
尽管我自己也尝试过使用pthread,但我也不是这方面的专家。
为了展示我对开销时间的理解,让我们以一个简单的单线程程序为例来计算数组和:
在该代码的一个简单[合理完成]的多线程版本中,数组可以被分成一个片段线程,每个线程保存自己的总和,线程完成后,将总和相加。
在编写得非常糟糕的多线程版本中,数组可以像以前一样被分解,并且每个线程都可以
atomicAdd
到全局总和。在这种情况下,原子加法一次只能由一个线程完成。我相信开销时间是衡量所有其他线程在等待执行自己的atomicAdd 时花费多长时间的时间(您可以尝试编写此程序来检查是否想确定)。
当然,它还考虑了处理信号量和互斥体切换所需的时间。在您的情况下,这可能意味着大量的时间花费在 mutex.lock 和 mutex.unlock 的内部。
不久前,我并行化了一款软件(使用 pthread_barrier),并且遇到了运行屏障所需时间比仅使用一个线程所需的时间更长的问题。事实证明,必须有 4 个屏障的循环执行得足够快,使得开销不值得。
I am also not much of an expert on that, though I have tried to use
pthread
a bit myself.To demonstrate my understanding of overhead time, let us take the example of a simple single-threaded program to compute an array sum:
In a simple [reasonably done] multi-threaded version of that code, the array could be broken into one piece per thread, each thread keeps its own sum, and after the threads are done, the sums are summed.
In a very poorly written multi-threaded version, the array could be broken down as before, and every thread could
atomicAdd
to a global sum.In this case, the atomic addition can only be done by one thread at a time. I believe that overhead time is a measure of how long all of the other threads spend while waiting to do their own
atomicAdd
(you could try writing this program to check if you want to be sure).Of course, it also takes into account the time it takes to deal with switching the semaphores and mutexes around. In your case, it probably means a significant amount of time is spent on the internals of the mutex.lock and mutex.unlock.
I parallelized a piece of software a while ago (using
pthread_barrier
), and had issues where it took longer to run the barriers than it did to just use one thread. It turned out that the loop that had to have 4 barriers in it was executed quickly enough to make the overhead not worth it.抱歉,我不是
pthread
或英特尔 VTune Amplifier 方面的专家,但是,锁定互斥锁和解锁它可能会计入开销时间。锁定和解锁互斥体可以作为系统调用来实现,分析器可能会将其归入线程开销之下。
Sorry, I'm not an expert on
pthread
or Intel VTune Amplifier, but yes, locking a mutex and unlocking it will probably count as overhead time.Locking and unlocking mutexes can be implemented as system calls, which the profiler probably would just lump under threading overhead.
我不熟悉 vTune,但操作系统中有线程之间切换的开销。每次一个线程停止并且另一个线程加载到处理器上时,都需要存储当前线程上下文,以便在线程下次运行时可以恢复它,然后需要恢复新线程的上下文,以便它可以继续处理。
问题可能是线程太多,因此处理器大部分时间都在线程之间切换。如果线程数与处理器相同,多线程应用程序将最有效地运行。
I'm not familiar with vTune but there is an in the OS overhead switching between threads. Each time a thread stops and another loads on a processor the current thread context needs to be stored so that it can be restored when the thread next runs and then the new thread's context needs to be restored so it can carry on processing.
The problem may be that you have too many threads and so the processor is spending most of its time switching between them. Multi threaded applications will run most efficiently if there are the same number of threads as processors.