如何测试代码的问题大小缩放性能

发布于 2025-01-26 02:54:39 字数 1733 浏览 5 评论 0原文

我正在运行一个简单的内核，该内核添加了两个双精确复合价值流。我已经使用OpenMP与自定义调度进行了并行对其进行了处理：slice_indices容器包含不同线程的不同索引。

    for (const auto& index : slice_indices)
    {
        auto* tens1_data_stream = tens1.get_slice_data(index);
        const auto* tens2_data_stream = tens2.get_slice_data(index);
        #pragma omp simd safelen(8)
        for (auto d_index = std::size_t{}; d_index < tens1.get_slice_size(); ++d_index)
        {
            tens1_data_stream[d_index].real += tens2_data_stream[d_index].real;
            tens1_data_stream[d_index].imag += tens2_data_stream[d_index].imag;
        }
    }

目标计算机具有Intel（R）Xeon（R）Platinum 8168 CPU @ 2.70GHz，带有24个内核，L1 CACHE 32KB，L2 CACE 1MB和L3 CACE 33MB。总内存带宽为115GB/s。

以下是我的代码尺寸s = n x n x n。

有人可以告诉我我提供的信息，如果

它的扩展很好，和/或
我如何确定它是否正在利用它可以使用的所有资源？

提前致谢。

编辑：

现在，我用24个内核和48个内核（两个NUMA节点，同一处理器）绘制了GFLOP的性能。看起来如此：

现在是强大而弱的缩放图：

注意：我已经测量了BW，事实证明为105GB/s。

问题： 在弱缩放图中，在6个线程/问题大小90x90x90x16 b时怪异峰的含义对我来说并不明显。有人可以清除这个吗？

原文

I'm running a simple kernel which adds two streams of double-precision complex-values. I've parallelized it using OpenMP with custom scheduling: the slice_indices container contains different indices for different threads.

    for (const auto& index : slice_indices)
    {
        auto* tens1_data_stream = tens1.get_slice_data(index);
        const auto* tens2_data_stream = tens2.get_slice_data(index);
        #pragma omp simd safelen(8)
        for (auto d_index = std::size_t{}; d_index < tens1.get_slice_size(); ++d_index)
        {
            tens1_data_stream[d_index].real += tens2_data_stream[d_index].real;
            tens1_data_stream[d_index].imag += tens2_data_stream[d_index].imag;
        }
    }

The target computer has a Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz with 24 cores, L1 cache 32kB, L2 cache 1MB and L3 cache 33MB. The total memory bandwidth is 115GB/s.

The following is how my code scales with problem size S = N x N x N.

Can anybody tell me with the information I've provided if:

it's scaling well, and/or
how I could go about finding out if it's utilizing all the resources which are available to it?

Thanks in advance.

EDIT:

Now I've plotted the performance in GFLOP/s with 24 cores and 48 cores (two NUMA nodes, the same processor). It appears so:

And now the strong and weak scaling plots:

Note: I've measured the BW and it turns out to be 105GB/S.

Question: The meaning of the weird peak at 6 threads/problem size 90x90x90x16 B in the weak scaling plot is not obvious to me. Can anybody clear this?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

三人与歌 2025-02-02 02:54:39

您的图形大致具有正确的形状：微小的阵列应适合L1缓存，因此获得非常高的性能。兆字节左右的阵列在L2中的阵列并获得较低的性能，除此之外，您应该从内存中流式传输并获得低性能。因此，问题大小和运行时之间的关系确实应该随着大小的增加而变得更陡峭。但是，当您击中连续的缓存边界时，结果图（BTW，OPS/SEC比单纯的运行时更常见）应具有逐步结构。我会说您没有足够的数据点来证明这一点。

另外，通常您会重复几次“实验”到1。甚至统计打ic和2。确保数据确实在缓存中。

由于您标记了此“ OpenMP”，因此您还应该探索采用给定的数组尺寸，并改变核心计数。然后，您应该获得或多或少的线性提高性能，直到处理器没有足够的带宽来维持所有核心。

评论者提出了强/弱缩放的概念。强大的缩放费用是：考虑到一定的问题大小，使用越来越多的核心。这应该使您的性能不断提高，但是随着开销开始占主导地位，回报的降低。较弱的缩放意味着：每个过程/线程/任何常数保持问题大小，并增加处理元素的数量。正如我所指出的，这应该使您几乎可以提高性能，直到您用尽带宽为止。您似乎要做的实际上都不是：您正在做“乐观的缩放”：增加问题大小，而其他一切都恒定。除了我所指出的那样，这应该给您带来越来越更好的性能。

因此，如果您想说“此代码缩放”，则必须在哪种情况下决定。就其价值而言，您的200GB/sec数字是合理的。这取决于您的体系结构的详细信息，但是对于最近的Intel节点来说，听起来很合理。

回复收藏 0 原文

~没有更多了~