分析器输出中线程并发开销时间的含义是什么？

发布于 2024-10-16 17:09:04 字数 1251 浏览 2 评论 0原文

如果有对英特尔 VTune Amplifier 有丰富经验的人告诉我这件事，我将不胜感激。

最近，我收到了其他使用英特尔 VTune Amplifier 来针对我的程序的人的性能分析报告。它表明，线程并发区域存在高开销时间。

开销时间的含义是什么？他们不知道（问我），我无法访问英特尔 VTune 放大器。

我有一些模糊的想法。该程序有许多线程睡眠调用，因为 pthread 条件在目标平台中不稳定（或者我做得很糟糕），因此我更改了许多例程以在循环中执行工作，如下所示：

while (true)
{
   mutex.lock();
   if (event changed)
   {
      mutex.unlock();
      // do something
      break;
   }
   else
   {
      mutex.unlock();
      usleep(3 * 1000);
   }
}

这可以标记为 <强>管理时间？

有什么建议吗？

我从英特尔网站找到了有关开销时间的帮助文档。 http ://software.intel.com/sites/products/documentation/hpc/amplifierxe/en-us/win/ug_docs/olh/common/overhead_time.html#overhead_time

摘录：

开销时间是从以下开始的持续时间释放共享资源并以接收该资源结束。理想情况下，开销时间的持续时间非常短，因为它减少了线程必须等待获取资源的时间。然而，并非并行应用程序中的所有 CPU 时间都可以用于执行真正的有效负载工作。如果并行运行时（英特尔® 线程构建模块、OpenMP*）使用效率低下，则很大一部分时间可能会花费在并行运行时内，从而在高并发级别浪费 CPU 时间。例如，这可能是由于递归并行算法中工作拆分的粒度较低造成的：当工作负载大小变得太低时，拆分工作和执行内务工作的开销就会变得很大。

仍然令人困惑..这是否意味着“你做了不必要/太频繁的锁定”？

原文

I'd be really appreciated if someone with good experience of Intel VTune Amplifier tell me about this thing.

Recently I received performance analysis report from other guys who used Intel VTune Amplifier against my program. It tells, there is high overhead time in the thread concurrency area.

What's the meaning of the Overhead Time? They don't know (asked me), I don't have access to Intel VTune Amplifier.

I have vague ideas. This program has many thread sleep calls because pthread condition is unstable (or I did badly) in the target platform so I change many routines to do works in the loop look like below:

while (true)
{
   mutex.lock();
   if (event changed)
   {
      mutex.unlock();
      // do something
      break;
   }
   else
   {
      mutex.unlock();
      usleep(3 * 1000);
   }
}

This can be flagged as Overhead Time?

Any advice?

I found help documentation about Overhead Time from Intel site.
http://software.intel.com/sites/products/documentation/hpc/amplifierxe/en-us/win/ug_docs/olh/common/overhead_time.html#overhead_time

Excerpt:

Overhead time is a duration that starts with the release of a shared resource and ends with the receipt of that resource. Ideally, the duration of Overhead time is very short because it reduces the time a thread has to wait to acquire a resource. However, not all CPU time in a parallel application may be spent on doing real pay load work. In cases when parallel runtime (Intel® Threading Building Blocks, OpenMP*) is used inefficiently, a significant portion of time may be spent inside the parallel runtime wasting CPU time at high concurrency levels. For example, this may result from low granularity of work split in recursive parallel algorithms: when the workload size becomes too low, the overhead on splitting the work and performing the housekeeping work becomes significant.

Still confusing.. Could it mean "you made unnecessary/too frequent lock"?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

猫烠⑼条掵仅有一顆心 2024-10-23 17:09:04

尽管我自己也尝试过使用pthread，但我也不是这方面的专家。

为了展示我对开销时间的理解，让我们以一个简单的单线程程序为例来计算数组和：

for(i=0;i<NUM;i++) {
    sum += array[i];
}

在该代码的一个简单[合理完成]的多线程版本中，数组可以被分成一个片段线程，每个线程保存自己的总和，线程完成后，将总和相加。

在编写得非常糟糕的多线程版本中，数组可以像以前一样被分解，并且每个线程都可以atomicAdd到全局总和。

在这种情况下，原子加法一次只能由一个线程完成。我相信开销时间是衡量所有其他线程在等待执行自己的atomicAdd 时花费多长时间的时间（您可以尝试编写此程序来检查是否想确定）。

当然，它还考虑了处理信号量和互斥体切换所需的时间。在您的情况下，这可能意味着大量的时间花费在 mutex.lock 和 mutex.unlock 的内部。

不久前，我并行化了一款软件（使用 pthread_barrier），并且遇到了运行屏障所需时间比仅使用一个线程所需的时间更长的问题。事实证明，必须有 4 个屏障的循环执行得足够快，使得开销不值得。

I am also not much of an expert on that, though I have tried to use pthread a bit myself.

To demonstrate my understanding of overhead time, let us take the example of a simple single-threaded program to compute an array sum:

for(i=0;i<NUM;i++) {
    sum += array[i];
}

In a simple [reasonably done] multi-threaded version of that code, the array could be broken into one piece per thread, each thread keeps its own sum, and after the threads are done, the sums are summed.

In a very poorly written multi-threaded version, the array could be broken down as before, and every thread could atomicAdd to a global sum.

In this case, the atomic addition can only be done by one thread at a time. I believe that overhead time is a measure of how long all of the other threads spend while waiting to do their own atomicAdd (you could try writing this program to check if you want to be sure).

Of course, it also takes into account the time it takes to deal with switching the semaphores and mutexes around. In your case, it probably means a significant amount of time is spent on the internals of the mutex.lock and mutex.unlock.

I parallelized a piece of software a while ago (using pthread_barrier), and had issues where it took longer to run the barriers than it did to just use one thread. It turned out that the loop that had to have 4 barriers in it was executed quickly enough to make the overhead not worth it.

回复收藏 0 原文