超线程...让我的渲染器慢了 10 倍

发布于 2024-10-15 00:15:08 字数 1719 浏览 7 评论 0 原文

执行摘要: 如何在代码中指定 OpenMP 应该只对 REAL 核心使用线程,即不计算超线程核心?

详细分析:多年来,我在空闲时间编写了一个纯软件的开源渲染器(光栅器/光线跟踪器)。 GPL 代码和 Windows 二进制文件可从此处获取: https://www.thanassis.space/renderer.html 它在 Windows、Linux、OS/X 和 BSD 下编译和运行良好。

上个月我引入了光线追踪模式 - 生成的图片质量飞速提升。不幸的是,光线追踪比光栅化慢几个数量级。为了提高速度,就像我对光栅器所做的那样,我向光线追踪器添加了 OpenMP(和 TBB)支持 - 以轻松利用额外的 CPU 核心。光栅化和光线追踪都很容易进行线程化(每个三角形的工作 - 每个像素的工作)。

在家里,使用我的 Core2Duo,第二个核心可以帮助所有模式 - 光栅化和光线追踪模式都获得了 1.85 倍到 1.9 倍的加速。

问题:自然,我很好奇看到顶级CPU性能(我也“玩”GPU,初步 CUDA 端口),所以我想要一个坚实的比较基础。我把代码给了我的一个好朋友,他可以使用一台“野兽”机器,配备 16 核、1500 美元的英特尔超级处理器。

他以“最重”模式(光线追踪器模式)运行它

……他的速度是我的 Core2Duo 的五分之一(!)

喘息 - 恐怖。刚刚发生了什么?

我们开始尝试不同的修改、补丁……最终我们找到了答案。

通过使用 OMP_NUM_THREADS 环境变量,可以控制生成多少个 OpenMP 线程。 随着线程数量从 1 个增加到 8 个,速度也在增加(接近线性增加)。 当我们超过 8 的那一刻,速度开始下降,直到它急剧下降到我的 Core2Duo 速度的五分之一,此时所有 16 个核心都被使用了!

为什么是8?

因为 8 是真实核心的数量。其他 8 个是……超线程!

理论: 现在,这对我来说是个新闻 - 我已经看到超线程在其他算法中提供了很多帮助(高达 25%),所以这是出乎意料的。显然,即使每个超线程核心都有自己的寄存器(和 SSE 单元?),光线追踪器也无法利用额外的处理能力。这让我想到......

可能不是处理能力匮乏——而是内存带宽。

光线追踪器使用包围体层次结构数据结构来加速光线-三角形相交。如果使用超线程核心,则一对中的每个“逻辑核心”都会尝试从该数据结构中的不同位置(即内存中)读取数据,并且CPU缓存(每对本地)将被完全破坏。至少,这是我的理论 - 非常欢迎任何建议。

问题是:OpenMP 检测“核心”数量并生成与之匹配的线程 - 也就是说,它在计算中包含超线程“核心”。就我而言,这显然会导致速度方面的灾难性结果。有谁知道如何使用 OpenMP API(如果可能的话,可移植的)仅为真实核心生成线程,而不是超线程核心?

PS 代码是开放的(GPL)并且可以在上面的链接中获取,请随意在您自己的机器上重现 - 我猜这会在所有超线程 CPU 中发生。

PPS 请原谅这篇文章的长度,我认为这是一次教育经历并想分享。

Executive summary:
How can one specify in his code that OpenMP should only use threads for the REAL cores, i.e. not count the hyper-threading ones?

Detailed analysis: Over the years, I've coded a SW-only, open source renderer (rasterizer/raytracer) in my free time. The GPL code and Windows binaries are available from here:
https://www.thanassis.space/renderer.html
It compiles and runs fine under Windows, Linux, OS/X and the BSDs.

I introduced a raytracing mode this last month - and the quality of the generated pictures sky-rocketed. Unfortunately, raytracing is orders of magnitude slower than rasterizing. To increase speed, just as I did for the rasterizers, I added OpenMP (and TBB) support to the raytracer - to easily make use of additional CPU cores. Both rasterizing and raytracing are easily amenable to threading (work per triangle - work per pixel).

At home, with my Core2Duo, the 2nd core helped all the modes - both the rasterizing and the raytracing modes got a speedup that is between 1.85x and 1.9x.

The problem: Naturally, I was curious to see the top CPU performance (I also "play" with GPUs, preliminary CUDA port), so I wanted a solid base for comparisons. I gave the code to a good friend of mine, who has access to a "beast" machine, with a 16-core, 1500$ Intel super processor.

He runs it in the "heaviest" mode, the raytracer mode...

...and he gets one fifth the speed of my Core2Duo (!)

Gasp - horror. What just happened?

We started trying different modifications, patches, ... and eventually we figured it out.

By using the OMP_NUM_THREADS environment variable, one can control how many OpenMP threads are spawned.
As the number of threads was increasing from 1 up to 8, the speed was increasing (close to a linear increase).
The moment we crossed 8, speed started to diminish, until it nose-dived to one fifth the speed of my Core2Duo, when all 16 cores were used!

Why 8?

Because 8 was the number of the real cores. The other 8 were... hyperthreading ones!

The theory: Now, this was news to me - I've seen hyper-threading help a lot (up to 25%) in other algorithms, so this was unexpected. Apparently, even though each hyper-threading core comes with its own registers (and SSE unit?), the raytracer could not make use of the extra processing power. Which lead me to think...

It is probably not processing power that is starved - it is memory bandwidth.

The raytracer uses a bounding volume hierarchy data structure, to accelerate ray-triangle intersections. If the hyperthreaded cores are used, then each of the "logical cores" in a pair, is trying to read from different places in that data structure (i.e. in memory) - and the CPU caches (local per pair) are completely thrashed. At least, that's my theory - any suggestions most welcome.

So, the question: OpenMP detects the number of "cores" and spawns threads to match it - that is, it includes the hyperthreaded "cores" in the calculation. In my case, this apparently leads to disastrous results, speed-wise. Does anyone know how to use the OpenMP API (if possible, portably) to only spawn threads for the REAL cores, and not the hyperthreaded ones?

P.S. The code is open (GPL) and available at the link above, feel free to reproduce on your own machine - I am guessing this will happen in all hyperthreaded CPUs.

P.P.S. Excuse the length of the post, I thought it was an educational experience and wanted to share.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

仅此而已 2024-10-22 00:15:08

基本上,您需要一些相当可移植的方式来查询环境以获取相当低级的硬件详细信息 - 通常,您不能仅通过系统调用来做到这一点(操作系统通常甚至不知道硬件线程和内核之间的差异)。

hwloc 是一个支持多种平台的库 - 支持 Linux 和 Linux。 windows(和其他)、intel 和AMD芯片。 Hwloc 将使您了解有关硬件拓扑的所有信息,并了解内核和硬件线程(在 hwloc 术语中称为 PU - 处理单元)之间的区别。因此,您可以在开始时调用这个库,找到实际核心的数量,然后调用 omp_set_num_threads() (或者只是在并行部分的开头添加该变量作为指令)。

Basically, you need some fairly portable way of querying the environment for fairly low-level hardware details - and generally, you can't do that from just system calls (the OS is generally unaware even of the difference between hardware threads and cores).

One library which supports a number of platforms is hwloc - supports Linux & windows (and others), intel & amd chips. Hwloc will let you find everything out about the hardware topology, and knows the difference between cores and hardware threads (called PUs - processing units - in hwloc terminology). So you'd call this library at the start, find the number of actual cores, and call omp_set_num_threads() (or just add that variable as a directive at the start of parallel sections).

亢潮 2024-10-22 00:15:08

不幸的是,您关于为什么会发生这种情况的假设很可能是正确的。可以肯定的是,您必须使用配置文件工具 - 但我之前在光线追踪中见过这种情况,所以这并不奇怪。无论如何,目前无法从 OpenMP 中确定某些处理器是“真实的”,而某些处理器是超线程的。您可以编写一些代码来确定这一点,然后自己设置数字。然而,仍然存在 OpenMP 不会在处理器本身上调度线程的问题 - 它允许操作系统执行此操作。

OpenMP ARB 语言委员会一直致力于尝试定义一种标准方法,让用户确定其环境并说明如何运行。目前,这一讨论仍在激烈进行。许多实现允许您通过使用实现定义的环境变量将线程“绑定”到处理器。但是,用户必须知道处理器编号以及哪些处理器是“真实的”处理器,哪些是超线程的。

Unfortunately your assumption about why this is occurring is most likely correct. To be sure, you would have to use a profile tool - but I have seen this before with raytracing, so it is not surprising. In any case, there is currently no way to determine from OpenMP that some of the processors are "real" and some are hyperthreaded. You could write some code to determine this and then set the number yourself. However, there would still be the problem that OpenMP doesn't schedule the threads on the processors itself - it allows the OS to do that.

There has been work in the OpenMP ARB language committee to try and define a standard way for the user to determine his environment and say how to run. At this time, this discussion is still raging on. Many implementations allow you to "bind" the threads to the processors, by use of an implementation defined environment variable. However, the user has to know the processor numbering and which processors are "real" vs. hyperthreaded.

浊酒尽余欢 2024-10-22 00:15:08

问题是OMP如何使用HT。
这不是内存带宽!
我在 2.6GHz HT PIV 上尝试了简单的循环。
结果是惊人的...

使用 OMP:

    $ time ./a.out 
    4500000000
    real    0m28.360s
    user    0m52.727s
    sys 0m0.064s

不使用 OMP:
$ 时间 ./a.out
4500000000

    real0   m25.417s
    user    0m25.398s
    sys 0m0.000s

代码:

    #include <stdio.h>
    #define U64 unsigned long long
    int main() {
      U64 i;
      U64 N = 1000000000ULL; 
      U64 k = 0;
      #pragma omp parallel for reduction(+:k)
      for (i = 0; i < N; i++) 
      {
        k += i%10; // last digit
      }
      printf ("%llu\n", k);
      return 0;
    }

The problem is how OMP uses HT.
It's not memory bandwidth!
I tried simple loop on my 2.6GHz HT PIV.
The result is amazing...

With OMP:

    $ time ./a.out 
    4500000000
    real    0m28.360s
    user    0m52.727s
    sys 0m0.064s

Without OMP:
$ time ./a.out
4500000000

    real0   m25.417s
    user    0m25.398s
    sys 0m0.000s

Code:

    #include <stdio.h>
    #define U64 unsigned long long
    int main() {
      U64 i;
      U64 N = 1000000000ULL; 
      U64 k = 0;
      #pragma omp parallel for reduction(+:k)
      for (i = 0; i < N; i++) 
      {
        k += i%10; // last digit
      }
      printf ("%llu\n", k);
      return 0;
    }
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文