如何测试代码的问题大小缩放性能

发布于 2025-01-26 02:54:39 字数 1733 浏览 5 评论 0原文

我正在运行一个简单的内核,该内核添加了两个双精确复合价值流。我已经使用OpenMP与自定义调度进行了并行对其进行了处理:slice_indices容器包含不同线程的不同索引。

    for (const auto& index : slice_indices)
    {
        auto* tens1_data_stream = tens1.get_slice_data(index);
        const auto* tens2_data_stream = tens2.get_slice_data(index);
        #pragma omp simd safelen(8)
        for (auto d_index = std::size_t{}; d_index < tens1.get_slice_size(); ++d_index)
        {
            tens1_data_stream[d_index].real += tens2_data_stream[d_index].real;
            tens1_data_stream[d_index].imag += tens2_data_stream[d_index].imag;
        }
    }

目标计算机具有Intel(R)Xeon(R)Platinum 8168 CPU @ 2.70GHz,带有24个内核,L1 CACHE 32KB,L2 CACE 1MB和L3 CACE 33MB。总内存带宽为115GB/s。

以下是我的代码尺寸s = n x n x n。

有人可以告诉我我提供的信息,如果

  1. 它的扩展很好,和/或
  2. 我如何确定它是否正在利用它可以使用的所有资源?

提前致谢。

编辑

现在,我用24个内核和48个内核(两个NUMA节点,同一处理器)绘制了GFLOP的性能。看起来如此:

现在是强大而弱的缩放图:

注意:我已经测量了BW,事实证明为105GB/s。

问题: 在弱缩放图中,在6个线程/问题大小90x90x90x16 b时怪异峰的含义对我来说并不明显。有人可以清除这个吗?

I'm running a simple kernel which adds two streams of double-precision complex-values. I've parallelized it using OpenMP with custom scheduling: the slice_indices container contains different indices for different threads.

    for (const auto& index : slice_indices)
    {
        auto* tens1_data_stream = tens1.get_slice_data(index);
        const auto* tens2_data_stream = tens2.get_slice_data(index);
        #pragma omp simd safelen(8)
        for (auto d_index = std::size_t{}; d_index < tens1.get_slice_size(); ++d_index)
        {
            tens1_data_stream[d_index].real += tens2_data_stream[d_index].real;
            tens1_data_stream[d_index].imag += tens2_data_stream[d_index].imag;
        }
    }

The target computer has a Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz with 24 cores, L1 cache 32kB, L2 cache 1MB and L3 cache 33MB. The total memory bandwidth is 115GB/s.

The following is how my code scales with problem size S = N x N x N.
enter image description here

Can anybody tell me with the information I've provided if:

  1. it's scaling well, and/or
  2. how I could go about finding out if it's utilizing all the resources which are available to it?

Thanks in advance.

EDIT:

Now I've plotted the performance in GFLOP/s with 24 cores and 48 cores (two NUMA nodes, the same processor). It appears so:
enter image description here

And now the strong and weak scaling plots:
enter image description here
enter image description here

Note: I've measured the BW and it turns out to be 105GB/S.

Question: The meaning of the weird peak at 6 threads/problem size 90x90x90x16 B in the weak scaling plot is not obvious to me. Can anybody clear this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

三人与歌 2025-02-02 02:54:39

您的图形大致具有正确的形状:微小的阵列应适合L1缓存,因此获得非常高的性能。兆字节左右的阵列在L2中的阵列并获得较低的性能,除此之外,您应该从内存中流式传输并获得低性能。因此,问题大小和运行时之间的关系确实应该随着大小的增加而变得更陡峭。但是,当您击中连续的缓存边界时,结果图(BTW,OPS/SEC比单纯的运行时更常见)应具有逐步结构。我会说您没有足够的数据点来证明这一点。

另外,通常您会重复几次“实验”到1。甚至统计打ic和2。确保数据确实在缓存中。

由于您标记了此“ OpenMP”,因此您还应该探索采用给定的数组尺寸,并改变核心计数。然后,您应该获得或多或少的线性提高性能,直到处理器没有足够的带宽来维持所有核心。

评论者提出了强/弱缩放的概念。强大的缩放费用是:考虑到一定的问题大小,使用越来越多的核心。这应该使您的性能不断提高,但是随着开销开始占主导地位,回报的降低。较弱的缩放意味着:每个过程/线程/任何常数保持问题大小,并增加处理元素的数量。正如我所指出的,这应该使您几乎可以提高性能,直到您用尽带宽为止。您似乎要做的实际上都不是:您正在做“乐观的缩放”:增加问题大小,而其他一切都恒定。除了我所指出的那样,这应该给您带来越来越更好的性能。

因此,如果您想说“此代码缩放”,则必须在哪种情况下决定。就其价值而言,您的200GB/sec数字是合理的。这取决于您的体系结构的详细信息,但是对于最近的Intel节点来说,听起来很合理。

Your graph has roughly the right shape: tiny arrays should fit in the L1 cache, and therefore get very high performance. Arrays of a megabyte or so fit in L2 and get lower performance, beyond that you should stream from memory and get low performance. So the relation between problem size and runtime should indeed get steeper with increasing size. However, the resulting graph (btw, ops/sec is more common than mere runtime) should have a stepwise structure as you hit successive cache boundaries. I'd say you don't have enough data points to demonstrate this.

Also, typically you would repeat each "experiment" several times to 1. even out statistical hiccups and 2. make sure that data is indeed in cache.

Since you tagged this "openmp" you should also explore taking a given array size, and varying the core count. You should then get a more or less linear increase in performance, until the processor does not have enough bandwidth to sustain all the cores.

A commenter brought up the concepts of strong/weak scaling. Strong scaling means: given a certain problem size, use more and more cores. That should give you increasing performance, but with diminishing returns as overhead starts to dominate. Weak scaling means: keep the problem size per process/thread/whatever constant, and increase the number of processing elements. That should give you almost linear increasing performance, until -- as I indicated -- you run out of bandwidth. What you seem to do is actually neither of these: you're doing "optimistic scaling": increase the problem size, with everything else constant. That should give you better and better performance, except for cache effects as I indicated.

So if you want to say "this code scales" you have to decide under what scenario. For what it's worth, your figure of 200Gb/sec is plausible. It depends on details of your architecture, but for a fairly recent Intel node that sounds reasonable.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文