测量 NUMA(非统一内存访问)。没有可观察到的不对称性。为什么?

发布于 2024-12-02 02:41:03 字数 4714 浏览 9 评论 0 原文

我尝试测量 NUMA 的不对称内存访问效果,但失败了。

实验

在 Intel Xeon X5570 @ 2.93GHz、2 个 CPU、8 核上进行。

在固定到核心 0 的线程上,我使用 numa_alloc_local 在核心 0 的 NUMA 节点上分配大小为 10,000,000 字节的数组 x。 然后我迭代数组 x 50 次并读取和写入数组中的每个字节。测量进行 50 次迭代所用的时间。

然后,在服务器中的每个其他核心上,我固定一个新线程,并再次测量进行 50 次读写迭代所用的时间 到数组 x 中的每个字节。

数组x很大,可以最大限度地减少缓存影响。我们想要测量 CPU 必须一直到 RAM 进行加载和存储时的速度,而不是缓存发挥作用时的速度。

我的服务器中有两个 NUMA 节点,因此我希望在分配数组 x 的同一节点上具有亲和力的核心具有 更快的读/写速度。我没有看到这一点。

为什么?

也许 NUMA 仅与具有 > 的系统相关。 8-12 核,正如我在其他地方看到的那样?

http://lse.sourceforge.net/numa/faq/

numatest.cpp

#include <numa.h>
#include <iostream>
#include <boost/thread/thread.hpp>
#include <boost/date_time/posix_time/posix_time.hpp>
#include <pthread.h>

void pin_to_core(size_t core)
{
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(core, &cpuset);
    pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
}

std::ostream& operator<<(std::ostream& os, const bitmask& bm)
{
    for(size_t i=0;i<bm.size;++i)
    {
        os << numa_bitmask_isbitset(&bm, i);
    }
    return os;
}

void* thread1(void** x, size_t core, size_t N, size_t M)
{
    pin_to_core(core);

    void* y = numa_alloc_local(N);

    boost::posix_time::ptime t1 = boost::posix_time::microsec_clock::universal_time();

    char c;
    for (size_t i(0);i<M;++i)
        for(size_t j(0);j<N;++j)
        {
            c = ((char*)y)[j];
            ((char*)y)[j] = c;
        }

    boost::posix_time::ptime t2 = boost::posix_time::microsec_clock::universal_time();

    std::cout << "Elapsed read/write by same thread that allocated on core " << core << ": " << (t2 - t1) << std::endl;

    *x = y;
}

void thread2(void* x, size_t core, size_t N, size_t M)
{
    pin_to_core(core);

    boost::posix_time::ptime t1 = boost::posix_time::microsec_clock::universal_time();

    char c;
    for (size_t i(0);i<M;++i)
        for(size_t j(0);j<N;++j)
        {
            c = ((char*)x)[j];
            ((char*)x)[j] = c;
        }

    boost::posix_time::ptime t2 = boost::posix_time::microsec_clock::universal_time();

    std::cout << "Elapsed read/write by thread on core " << core << ": " << (t2 - t1) << std::endl;
}

int main(int argc, const char **argv)
{
    int numcpus = numa_num_task_cpus();
    std::cout << "numa_available() " << numa_available() << std::endl;
    numa_set_localalloc();

    bitmask* bm = numa_bitmask_alloc(numcpus);
    for (int i=0;i<=numa_max_node();++i)
    {
        numa_node_to_cpus(i, bm);
        std::cout << "numa node " << i << " " << *bm << " " << numa_node_size(i, 0) << std::endl;
    }
    numa_bitmask_free(bm);

    void* x;
    size_t N(10000000);
    size_t M(50);

    boost::thread t1(boost::bind(&thread1, &x, 0, N, M));
    t1.join();

    for (size_t i(0);i<numcpus;++i)
    {
        boost::thread t2(boost::bind(&thread2, x, i, N, M));
        t2.join();
    }

    numa_free(x, N);

    return 0;
}

输出

g++ -o numatest -pthread -lboost_thread -lnuma -O0 numatest.cpp

./numatest

numa_available() 0                    <-- NUMA is available on this system
numa node 0 10101010 12884901888      <-- cores 0,2,4,6 are on NUMA node 0, which is about 12 Gb
numa node 1 01010101 12874584064      <-- cores 1,3,5,7 are on NUMA node 1, which is slightly smaller than node 0

Elapsed read/write by same thread that allocated on core 0: 00:00:01.767428
Elapsed read/write by thread on core 0: 00:00:01.760554
Elapsed read/write by thread on core 1: 00:00:01.719686
Elapsed read/write by thread on core 2: 00:00:01.708830
Elapsed read/write by thread on core 3: 00:00:01.691560
Elapsed read/write by thread on core 4: 00:00:01.686912
Elapsed read/write by thread on core 5: 00:00:01.691917
Elapsed read/write by thread on core 6: 00:00:01.686509
Elapsed read/write by thread on core 7: 00:00:01.689928

执行情况无论哪个核心正在执行读取和写入操作,对数组 x 进行 50 次迭代读取和写入大约需要 1.7 秒。

更新:

我的 CPU 上的缓存大小为 8Mb,因此 10Mb 数组 x 可能不足以消除缓存效应。我尝试了 100Mb 数组 x,并且 我尝试在最里面的循环中使用 __sync_synchronize() 发出完整的内存栅栏。它仍然没有揭示 NUMA 节点之间的任何不对称性。

更新 2:

我尝试使用 __sync_fetch_and_add() 读取和写入数组 x 。还是什么都没有。

I've tried to measure the asymmetric memory access effects of NUMA, and failed.

The Experiment

Performed on an Intel Xeon X5570 @ 2.93GHz, 2 CPUs, 8 cores.

On a thread pinned to core 0, I allocate an array x of size 10,000,000 bytes on core 0's NUMA node with numa_alloc_local.
Then I iterate over array x 50 times and read and write each byte in the array. Measure the elapsed time to do the 50 iterations.

Then, on each of the other cores in my server, I pin a new thread and again measure the elapsed time to do 50 iterations of reading and writing
to every byte in array x.

Array x is large to minimize cache effects. We want to measure the speed when the CPU has to go all the way to RAM to load and store, not when caches are helping.

There are two NUMA nodes in my server, so I would expect the cores that have affinity on the same node in which array x is allocated to have
faster read/write speed. I'm not seeing that.

Why?

Perhaps NUMA is only relevant on systems with > 8-12 cores, as I've seen suggested elsewhere?

http://lse.sourceforge.net/numa/faq/

numatest.cpp

#include <numa.h>
#include <iostream>
#include <boost/thread/thread.hpp>
#include <boost/date_time/posix_time/posix_time.hpp>
#include <pthread.h>

void pin_to_core(size_t core)
{
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(core, &cpuset);
    pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
}

std::ostream& operator<<(std::ostream& os, const bitmask& bm)
{
    for(size_t i=0;i<bm.size;++i)
    {
        os << numa_bitmask_isbitset(&bm, i);
    }
    return os;
}

void* thread1(void** x, size_t core, size_t N, size_t M)
{
    pin_to_core(core);

    void* y = numa_alloc_local(N);

    boost::posix_time::ptime t1 = boost::posix_time::microsec_clock::universal_time();

    char c;
    for (size_t i(0);i<M;++i)
        for(size_t j(0);j<N;++j)
        {
            c = ((char*)y)[j];
            ((char*)y)[j] = c;
        }

    boost::posix_time::ptime t2 = boost::posix_time::microsec_clock::universal_time();

    std::cout << "Elapsed read/write by same thread that allocated on core " << core << ": " << (t2 - t1) << std::endl;

    *x = y;
}

void thread2(void* x, size_t core, size_t N, size_t M)
{
    pin_to_core(core);

    boost::posix_time::ptime t1 = boost::posix_time::microsec_clock::universal_time();

    char c;
    for (size_t i(0);i<M;++i)
        for(size_t j(0);j<N;++j)
        {
            c = ((char*)x)[j];
            ((char*)x)[j] = c;
        }

    boost::posix_time::ptime t2 = boost::posix_time::microsec_clock::universal_time();

    std::cout << "Elapsed read/write by thread on core " << core << ": " << (t2 - t1) << std::endl;
}

int main(int argc, const char **argv)
{
    int numcpus = numa_num_task_cpus();
    std::cout << "numa_available() " << numa_available() << std::endl;
    numa_set_localalloc();

    bitmask* bm = numa_bitmask_alloc(numcpus);
    for (int i=0;i<=numa_max_node();++i)
    {
        numa_node_to_cpus(i, bm);
        std::cout << "numa node " << i << " " << *bm << " " << numa_node_size(i, 0) << std::endl;
    }
    numa_bitmask_free(bm);

    void* x;
    size_t N(10000000);
    size_t M(50);

    boost::thread t1(boost::bind(&thread1, &x, 0, N, M));
    t1.join();

    for (size_t i(0);i<numcpus;++i)
    {
        boost::thread t2(boost::bind(&thread2, x, i, N, M));
        t2.join();
    }

    numa_free(x, N);

    return 0;
}

The Output

g++ -o numatest -pthread -lboost_thread -lnuma -O0 numatest.cpp

./numatest

numa_available() 0                    <-- NUMA is available on this system
numa node 0 10101010 12884901888      <-- cores 0,2,4,6 are on NUMA node 0, which is about 12 Gb
numa node 1 01010101 12874584064      <-- cores 1,3,5,7 are on NUMA node 1, which is slightly smaller than node 0

Elapsed read/write by same thread that allocated on core 0: 00:00:01.767428
Elapsed read/write by thread on core 0: 00:00:01.760554
Elapsed read/write by thread on core 1: 00:00:01.719686
Elapsed read/write by thread on core 2: 00:00:01.708830
Elapsed read/write by thread on core 3: 00:00:01.691560
Elapsed read/write by thread on core 4: 00:00:01.686912
Elapsed read/write by thread on core 5: 00:00:01.691917
Elapsed read/write by thread on core 6: 00:00:01.686509
Elapsed read/write by thread on core 7: 00:00:01.689928

Doing 50 iterations reading and writing over array x takes about 1.7 seconds, no matter which core is doing the reading and writing.

Update:

The cache size on my CPUs is 8Mb, so maybe 10Mb array x is not big enough to eliminate cache effecs. I tried 100Mb array x, and
I've tried issuing a full memory fence with __sync_synchronize() inside my innermost loops. It still doesn't reveal any asymmetry between NUMA nodes.

Update 2:

I've tried reading and writing to array x with __sync_fetch_and_add(). Still nothing.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

清欢 2024-12-09 02:41:03

我要指出的第一件事是,您可能需要仔细检查每个节点上有哪些核心。我不记得核心和节点是这样交错的。
另外,由于 HT,您应该有 16 个线程。 (除非你禁用它)

另一件事:

socket 1366 Xeon 机器只有轻微的 NUMA。所以很难看出区别。 NUMA 效果在 4P Opterons 上更加明显。

在像您这样的系统上,节点到节点的带宽实际上比 CPU 到内存的带宽更快。由于您的访问模式是完全顺序的,因此无论数据是否是本地的,您都可以获得完整的带宽。更好的衡量指标是延迟。尝试随机访问 1 GB 的块,而不是按顺序流式传输。

最后一件事:

根据编译器优化的积极程度,您的循环可能会被优化,因为它不执行任何操作:

c = ((char*)x)[j];
((char*)x)[j] = c;

这样的事情将保证它不会被编译器消除:

((char*)x)[j] += 1;

The first thing I want to point out is that you might want to double-check which cores are on each node. I don't recall cores and nodes being interleaved like that.
Also, you should have 16 threads due to HT. (unless you disabled it)

Another thing:

The socket 1366 Xeon machines are only slightly NUMA. So it will be hard to see the difference. The NUMA effect is much more noticeable on the 4P Opterons.

On systems like yours, the node-to-node bandwidth is actually faster than the CPU-to-memory bandwidth. Since your access pattern is completely sequential, you are getting the full bandwidth regardless of whether or not the data is local. A better thing to measure is the latency. Try random accessing a block of 1 GB instead of streaming it sequentially.

Last thing:

Depending on how aggressively your compiler optimizes, your loop might be optimized out since it doesn't do anything:

c = ((char*)x)[j];
((char*)x)[j] = c;

Something like this will guarantee that it won't be eliminated by the compiler:

((char*)x)[j] += 1;
古镇旧梦 2024-12-09 02:41:03

啊哈!玄妙说得对!不知何故,硬件预取正在优化我的读/写。

如果它是缓存优化,那么强制内存屏障将使优化失效:

c = __sync_fetch_and_add(((char*)x) + j, 1);

但这没有任何区别。真正起作用的是将我的迭代器索引乘以素数 1009 以击败预取优化:

*(((char*)x) + ((j * 1009) % N)) += 1;

通过这一更改,NUMA 不对称性清晰可见:

numa_available() 0
numa node 0 10101010 12884901888
numa node 1 01010101 12874584064
Elapsed read/write by same thread that allocated on core 0: 00:00:00.961725
Elapsed read/write by thread on core 0: 00:00:00.942300
Elapsed read/write by thread on core 1: 00:00:01.216286
Elapsed read/write by thread on core 2: 00:00:00.909353
Elapsed read/write by thread on core 3: 00:00:01.218935
Elapsed read/write by thread on core 4: 00:00:00.898107
Elapsed read/write by thread on core 5: 00:00:01.211413
Elapsed read/write by thread on core 6: 00:00:00.898021
Elapsed read/write by thread on core 7: 00:00:01.207114

至少我认为这就是正在发生的事情。

谢谢神秘!

编辑:结论 ~133%

对于那些只是浏览这篇文章以粗略了解 NUMA 性能特征的人来说,这是根据我的测试得出的底线:

对非本地 NUMA 节点的内存访问延迟约为本地节点的 1.33 倍。

Ah hah! Mysticial is right! Somehow, hardware pre-fetching is optimizing my read/writes.

If it were a cache optimization, then forcing a memory barrier would defeat the optimization:

c = __sync_fetch_and_add(((char*)x) + j, 1);

but that doesn't make any difference. What does make a difference is multiplying my iterator index by prime 1009 to defeat the pre-fetching optimization:

*(((char*)x) + ((j * 1009) % N)) += 1;

With that change, the NUMA asymmetry is clearly revealed:

numa_available() 0
numa node 0 10101010 12884901888
numa node 1 01010101 12874584064
Elapsed read/write by same thread that allocated on core 0: 00:00:00.961725
Elapsed read/write by thread on core 0: 00:00:00.942300
Elapsed read/write by thread on core 1: 00:00:01.216286
Elapsed read/write by thread on core 2: 00:00:00.909353
Elapsed read/write by thread on core 3: 00:00:01.218935
Elapsed read/write by thread on core 4: 00:00:00.898107
Elapsed read/write by thread on core 5: 00:00:01.211413
Elapsed read/write by thread on core 6: 00:00:00.898021
Elapsed read/write by thread on core 7: 00:00:01.207114

At least I think that's what's going on.

Thanks Mysticial!

EDIT: CONCLUSION ~133%

For anyone who is just glancing at this post to get a rough idea of the performance characteristics of NUMA, here is the bottom line according to my tests:

Memory access to a non-local NUMA node has about 1.33 times the latency of memory access to a local node.

漫漫岁月 2024-12-09 02:41:03

感谢这个基准代码。我已经采用了您的“固定”版本,并将其更改为纯 C + OpenMP,并添加了一些测试来了解内存系统在争用情况下的行为。您可以在此处找到新代码。

以下是 Quad Opteron 的一些示例结果:

num cpus: 32
numa available: 0
numa node 0 10001000100010000000000000000000 - 15.9904 GiB
numa node 1 00000000000000001000100010001000 - 16 GiB
numa node 2 00010001000100010000000000000000 - 16 GiB
numa node 3 00000000000000000001000100010001 - 16 GiB
numa node 4 00100010001000100000000000000000 - 16 GiB
numa node 5 00000000000000000010001000100010 - 16 GiB
numa node 6 01000100010001000000000000000000 - 16 GiB
numa node 7 00000000000000000100010001000100 - 16 GiB

sequential core 0 -> core 0 : BW 4189.87 MB/s
sequential core 1 -> core 0 : BW 2409.1 MB/s
sequential core 2 -> core 0 : BW 2495.61 MB/s
sequential core 3 -> core 0 : BW 2474.62 MB/s
sequential core 4 -> core 0 : BW 4244.45 MB/s
sequential core 5 -> core 0 : BW 2378.34 MB/s
sequential core 6 -> core 0 : BW 2442.93 MB/s
sequential core 7 -> core 0 : BW 2468.61 MB/s
sequential core 8 -> core 0 : BW 4220.48 MB/s
sequential core 9 -> core 0 : BW 2442.88 MB/s
sequential core 10 -> core 0 : BW 2388.11 MB/s
sequential core 11 -> core 0 : BW 2481.87 MB/s
sequential core 12 -> core 0 : BW 4273.42 MB/s
sequential core 13 -> core 0 : BW 2381.28 MB/s
sequential core 14 -> core 0 : BW 2449.87 MB/s
sequential core 15 -> core 0 : BW 2485.48 MB/s
sequential core 16 -> core 0 : BW 2938.08 MB/s
sequential core 17 -> core 0 : BW 2082.12 MB/s
sequential core 18 -> core 0 : BW 2041.84 MB/s
sequential core 19 -> core 0 : BW 2060.47 MB/s
sequential core 20 -> core 0 : BW 2944.13 MB/s
sequential core 21 -> core 0 : BW 2111.06 MB/s
sequential core 22 -> core 0 : BW 2063.37 MB/s
sequential core 23 -> core 0 : BW 2082.75 MB/s
sequential core 24 -> core 0 : BW 2958.05 MB/s
sequential core 25 -> core 0 : BW 2091.85 MB/s
sequential core 26 -> core 0 : BW 2098.73 MB/s
sequential core 27 -> core 0 : BW 2083.7 MB/s
sequential core 28 -> core 0 : BW 2934.43 MB/s
sequential core 29 -> core 0 : BW 2048.68 MB/s
sequential core 30 -> core 0 : BW 2087.6 MB/s
sequential core 31 -> core 0 : BW 2014.68 MB/s

all-contention core 0 -> core 0 : BW 1081.85 MB/s
all-contention core 1 -> core 0 : BW 299.177 MB/s
all-contention core 2 -> core 0 : BW 298.853 MB/s
all-contention core 3 -> core 0 : BW 263.735 MB/s
all-contention core 4 -> core 0 : BW 1081.93 MB/s
all-contention core 5 -> core 0 : BW 299.177 MB/s
all-contention core 6 -> core 0 : BW 299.63 MB/s
all-contention core 7 -> core 0 : BW 263.795 MB/s
all-contention core 8 -> core 0 : BW 1081.98 MB/s
all-contention core 9 -> core 0 : BW 299.177 MB/s
all-contention core 10 -> core 0 : BW 300.149 MB/s
all-contention core 11 -> core 0 : BW 262.905 MB/s
all-contention core 12 -> core 0 : BW 1081.89 MB/s
all-contention core 13 -> core 0 : BW 299.173 MB/s
all-contention core 14 -> core 0 : BW 299.025 MB/s
all-contention core 15 -> core 0 : BW 263.865 MB/s
all-contention core 16 -> core 0 : BW 432.156 MB/s
all-contention core 17 -> core 0 : BW 233.12 MB/s
all-contention core 18 -> core 0 : BW 232.889 MB/s
all-contention core 19 -> core 0 : BW 202.48 MB/s
all-contention core 20 -> core 0 : BW 434.299 MB/s
all-contention core 21 -> core 0 : BW 233.274 MB/s
all-contention core 22 -> core 0 : BW 233.144 MB/s
all-contention core 23 -> core 0 : BW 202.505 MB/s
all-contention core 24 -> core 0 : BW 434.295 MB/s
all-contention core 25 -> core 0 : BW 233.274 MB/s
all-contention core 26 -> core 0 : BW 233.169 MB/s
all-contention core 27 -> core 0 : BW 202.49 MB/s
all-contention core 28 -> core 0 : BW 434.295 MB/s
all-contention core 29 -> core 0 : BW 233.309 MB/s
all-contention core 30 -> core 0 : BW 233.169 MB/s
all-contention core 31 -> core 0 : BW 202.526 MB/s

two-contention core 0 -> core 0 : BW 3306.11 MB/s
two-contention core 1 -> core 0 : BW 2199.7 MB/s

two-contention core 0 -> core 0 : BW 3286.21 MB/s
two-contention core 2 -> core 0 : BW 2220.73 MB/s

two-contention core 0 -> core 0 : BW 3302.24 MB/s
two-contention core 3 -> core 0 : BW 2182.81 MB/s

two-contention core 0 -> core 0 : BW 3605.88 MB/s
two-contention core 4 -> core 0 : BW 3605.88 MB/s

two-contention core 0 -> core 0 : BW 3297.08 MB/s
two-contention core 5 -> core 0 : BW 2217.82 MB/s

two-contention core 0 -> core 0 : BW 3312.69 MB/s
two-contention core 6 -> core 0 : BW 2227.04 MB/s

two-contention core 0 -> core 0 : BW 3287.93 MB/s
two-contention core 7 -> core 0 : BW 2209.48 MB/s

two-contention core 0 -> core 0 : BW 3660.05 MB/s
two-contention core 8 -> core 0 : BW 3660.05 MB/s

two-contention core 0 -> core 0 : BW 3339.63 MB/s
two-contention core 9 -> core 0 : BW 2223.84 MB/s

two-contention core 0 -> core 0 : BW 3303.77 MB/s
two-contention core 10 -> core 0 : BW 2197.99 MB/s

two-contention core 0 -> core 0 : BW 3323.19 MB/s
two-contention core 11 -> core 0 : BW 2196.08 MB/s

two-contention core 0 -> core 0 : BW 3582.23 MB/s
two-contention core 12 -> core 0 : BW 3582.22 MB/s

two-contention core 0 -> core 0 : BW 3324.9 MB/s
two-contention core 13 -> core 0 : BW 2250.74 MB/s

two-contention core 0 -> core 0 : BW 3305.66 MB/s
two-contention core 14 -> core 0 : BW 2209.5 MB/s

two-contention core 0 -> core 0 : BW 3303.52 MB/s
two-contention core 15 -> core 0 : BW 2182.43 MB/s

two-contention core 0 -> core 0 : BW 3352.74 MB/s
two-contention core 16 -> core 0 : BW 2607.73 MB/s

two-contention core 0 -> core 0 : BW 3092.65 MB/s
two-contention core 17 -> core 0 : BW 1911.98 MB/s

two-contention core 0 -> core 0 : BW 3025.91 MB/s
two-contention core 18 -> core 0 : BW 1918.06 MB/s

two-contention core 0 -> core 0 : BW 3257.56 MB/s
two-contention core 19 -> core 0 : BW 1885.03 MB/s

two-contention core 0 -> core 0 : BW 3339.64 MB/s
two-contention core 20 -> core 0 : BW 2603.06 MB/s

two-contention core 0 -> core 0 : BW 3119.29 MB/s
two-contention core 21 -> core 0 : BW 1918.6 MB/s

two-contention core 0 -> core 0 : BW 3054.14 MB/s
two-contention core 22 -> core 0 : BW 1910.61 MB/s

two-contention core 0 -> core 0 : BW 3214.44 MB/s
two-contention core 23 -> core 0 : BW 1881.69 MB/s

two-contention core 0 -> core 0 : BW 3332.3 MB/s
two-contention core 24 -> core 0 : BW 2611.8 MB/s

two-contention core 0 -> core 0 : BW 3111.94 MB/s
two-contention core 25 -> core 0 : BW 1922.11 MB/s

two-contention core 0 -> core 0 : BW 3049.02 MB/s
two-contention core 26 -> core 0 : BW 1912.85 MB/s

two-contention core 0 -> core 0 : BW 3251.88 MB/s
two-contention core 27 -> core 0 : BW 1881.82 MB/s

two-contention core 0 -> core 0 : BW 3345.6 MB/s
two-contention core 28 -> core 0 : BW 2598.82 MB/s

two-contention core 0 -> core 0 : BW 3109.04 MB/s
two-contention core 29 -> core 0 : BW 1923.81 MB/s

two-contention core 0 -> core 0 : BW 3062.94 MB/s
two-contention core 30 -> core 0 : BW 1921.3 MB/s

two-contention core 0 -> core 0 : BW 3220.8 MB/s
two-contention core 31 -> core 0 : BW 1901.76 MB/s

如果有人有进一步的改进,我很高兴听到他们的消息。例如,这些显然不是现实世界单位中完美的带宽测量(可能会偏离一个——希望是恒定的——整数因子)。

Thanks for this benchmark code. I've taken your 'fixed' version and changed it to pure C + OpenMP and added a few tests for how the memory system behaves under contention. You can find the new code here.

Here are some sample results from a Quad Opteron:

num cpus: 32
numa available: 0
numa node 0 10001000100010000000000000000000 - 15.9904 GiB
numa node 1 00000000000000001000100010001000 - 16 GiB
numa node 2 00010001000100010000000000000000 - 16 GiB
numa node 3 00000000000000000001000100010001 - 16 GiB
numa node 4 00100010001000100000000000000000 - 16 GiB
numa node 5 00000000000000000010001000100010 - 16 GiB
numa node 6 01000100010001000000000000000000 - 16 GiB
numa node 7 00000000000000000100010001000100 - 16 GiB

sequential core 0 -> core 0 : BW 4189.87 MB/s
sequential core 1 -> core 0 : BW 2409.1 MB/s
sequential core 2 -> core 0 : BW 2495.61 MB/s
sequential core 3 -> core 0 : BW 2474.62 MB/s
sequential core 4 -> core 0 : BW 4244.45 MB/s
sequential core 5 -> core 0 : BW 2378.34 MB/s
sequential core 6 -> core 0 : BW 2442.93 MB/s
sequential core 7 -> core 0 : BW 2468.61 MB/s
sequential core 8 -> core 0 : BW 4220.48 MB/s
sequential core 9 -> core 0 : BW 2442.88 MB/s
sequential core 10 -> core 0 : BW 2388.11 MB/s
sequential core 11 -> core 0 : BW 2481.87 MB/s
sequential core 12 -> core 0 : BW 4273.42 MB/s
sequential core 13 -> core 0 : BW 2381.28 MB/s
sequential core 14 -> core 0 : BW 2449.87 MB/s
sequential core 15 -> core 0 : BW 2485.48 MB/s
sequential core 16 -> core 0 : BW 2938.08 MB/s
sequential core 17 -> core 0 : BW 2082.12 MB/s
sequential core 18 -> core 0 : BW 2041.84 MB/s
sequential core 19 -> core 0 : BW 2060.47 MB/s
sequential core 20 -> core 0 : BW 2944.13 MB/s
sequential core 21 -> core 0 : BW 2111.06 MB/s
sequential core 22 -> core 0 : BW 2063.37 MB/s
sequential core 23 -> core 0 : BW 2082.75 MB/s
sequential core 24 -> core 0 : BW 2958.05 MB/s
sequential core 25 -> core 0 : BW 2091.85 MB/s
sequential core 26 -> core 0 : BW 2098.73 MB/s
sequential core 27 -> core 0 : BW 2083.7 MB/s
sequential core 28 -> core 0 : BW 2934.43 MB/s
sequential core 29 -> core 0 : BW 2048.68 MB/s
sequential core 30 -> core 0 : BW 2087.6 MB/s
sequential core 31 -> core 0 : BW 2014.68 MB/s

all-contention core 0 -> core 0 : BW 1081.85 MB/s
all-contention core 1 -> core 0 : BW 299.177 MB/s
all-contention core 2 -> core 0 : BW 298.853 MB/s
all-contention core 3 -> core 0 : BW 263.735 MB/s
all-contention core 4 -> core 0 : BW 1081.93 MB/s
all-contention core 5 -> core 0 : BW 299.177 MB/s
all-contention core 6 -> core 0 : BW 299.63 MB/s
all-contention core 7 -> core 0 : BW 263.795 MB/s
all-contention core 8 -> core 0 : BW 1081.98 MB/s
all-contention core 9 -> core 0 : BW 299.177 MB/s
all-contention core 10 -> core 0 : BW 300.149 MB/s
all-contention core 11 -> core 0 : BW 262.905 MB/s
all-contention core 12 -> core 0 : BW 1081.89 MB/s
all-contention core 13 -> core 0 : BW 299.173 MB/s
all-contention core 14 -> core 0 : BW 299.025 MB/s
all-contention core 15 -> core 0 : BW 263.865 MB/s
all-contention core 16 -> core 0 : BW 432.156 MB/s
all-contention core 17 -> core 0 : BW 233.12 MB/s
all-contention core 18 -> core 0 : BW 232.889 MB/s
all-contention core 19 -> core 0 : BW 202.48 MB/s
all-contention core 20 -> core 0 : BW 434.299 MB/s
all-contention core 21 -> core 0 : BW 233.274 MB/s
all-contention core 22 -> core 0 : BW 233.144 MB/s
all-contention core 23 -> core 0 : BW 202.505 MB/s
all-contention core 24 -> core 0 : BW 434.295 MB/s
all-contention core 25 -> core 0 : BW 233.274 MB/s
all-contention core 26 -> core 0 : BW 233.169 MB/s
all-contention core 27 -> core 0 : BW 202.49 MB/s
all-contention core 28 -> core 0 : BW 434.295 MB/s
all-contention core 29 -> core 0 : BW 233.309 MB/s
all-contention core 30 -> core 0 : BW 233.169 MB/s
all-contention core 31 -> core 0 : BW 202.526 MB/s

two-contention core 0 -> core 0 : BW 3306.11 MB/s
two-contention core 1 -> core 0 : BW 2199.7 MB/s

two-contention core 0 -> core 0 : BW 3286.21 MB/s
two-contention core 2 -> core 0 : BW 2220.73 MB/s

two-contention core 0 -> core 0 : BW 3302.24 MB/s
two-contention core 3 -> core 0 : BW 2182.81 MB/s

two-contention core 0 -> core 0 : BW 3605.88 MB/s
two-contention core 4 -> core 0 : BW 3605.88 MB/s

two-contention core 0 -> core 0 : BW 3297.08 MB/s
two-contention core 5 -> core 0 : BW 2217.82 MB/s

two-contention core 0 -> core 0 : BW 3312.69 MB/s
two-contention core 6 -> core 0 : BW 2227.04 MB/s

two-contention core 0 -> core 0 : BW 3287.93 MB/s
two-contention core 7 -> core 0 : BW 2209.48 MB/s

two-contention core 0 -> core 0 : BW 3660.05 MB/s
two-contention core 8 -> core 0 : BW 3660.05 MB/s

two-contention core 0 -> core 0 : BW 3339.63 MB/s
two-contention core 9 -> core 0 : BW 2223.84 MB/s

two-contention core 0 -> core 0 : BW 3303.77 MB/s
two-contention core 10 -> core 0 : BW 2197.99 MB/s

two-contention core 0 -> core 0 : BW 3323.19 MB/s
two-contention core 11 -> core 0 : BW 2196.08 MB/s

two-contention core 0 -> core 0 : BW 3582.23 MB/s
two-contention core 12 -> core 0 : BW 3582.22 MB/s

two-contention core 0 -> core 0 : BW 3324.9 MB/s
two-contention core 13 -> core 0 : BW 2250.74 MB/s

two-contention core 0 -> core 0 : BW 3305.66 MB/s
two-contention core 14 -> core 0 : BW 2209.5 MB/s

two-contention core 0 -> core 0 : BW 3303.52 MB/s
two-contention core 15 -> core 0 : BW 2182.43 MB/s

two-contention core 0 -> core 0 : BW 3352.74 MB/s
two-contention core 16 -> core 0 : BW 2607.73 MB/s

two-contention core 0 -> core 0 : BW 3092.65 MB/s
two-contention core 17 -> core 0 : BW 1911.98 MB/s

two-contention core 0 -> core 0 : BW 3025.91 MB/s
two-contention core 18 -> core 0 : BW 1918.06 MB/s

two-contention core 0 -> core 0 : BW 3257.56 MB/s
two-contention core 19 -> core 0 : BW 1885.03 MB/s

two-contention core 0 -> core 0 : BW 3339.64 MB/s
two-contention core 20 -> core 0 : BW 2603.06 MB/s

two-contention core 0 -> core 0 : BW 3119.29 MB/s
two-contention core 21 -> core 0 : BW 1918.6 MB/s

two-contention core 0 -> core 0 : BW 3054.14 MB/s
two-contention core 22 -> core 0 : BW 1910.61 MB/s

two-contention core 0 -> core 0 : BW 3214.44 MB/s
two-contention core 23 -> core 0 : BW 1881.69 MB/s

two-contention core 0 -> core 0 : BW 3332.3 MB/s
two-contention core 24 -> core 0 : BW 2611.8 MB/s

two-contention core 0 -> core 0 : BW 3111.94 MB/s
two-contention core 25 -> core 0 : BW 1922.11 MB/s

two-contention core 0 -> core 0 : BW 3049.02 MB/s
two-contention core 26 -> core 0 : BW 1912.85 MB/s

two-contention core 0 -> core 0 : BW 3251.88 MB/s
two-contention core 27 -> core 0 : BW 1881.82 MB/s

two-contention core 0 -> core 0 : BW 3345.6 MB/s
two-contention core 28 -> core 0 : BW 2598.82 MB/s

two-contention core 0 -> core 0 : BW 3109.04 MB/s
two-contention core 29 -> core 0 : BW 1923.81 MB/s

two-contention core 0 -> core 0 : BW 3062.94 MB/s
two-contention core 30 -> core 0 : BW 1921.3 MB/s

two-contention core 0 -> core 0 : BW 3220.8 MB/s
two-contention core 31 -> core 0 : BW 1901.76 MB/s

If someone has further improvements, I'd be happy to hear about them. For example, these are obviously not perfect bandwidth measurements in real-world units (likely off by a--hopefully constant--integer factor).

随风而去 2024-12-09 02:41:03

一些评论:

  • 要查看系统的 NUMA 结构(在 Linux 上),您可以使用 lstopo 实用程序获得图形概述.org/projects/hwloc/" rel="noreferrer">hwloc 库。特别是,您将看到哪些核心编号是哪个 NUMA 节点(处理器套接字)的成员
  • char 可能不是测量最大 RAM 吞吐量的理想数据类型。我怀疑使用 32 位或 64 位数据类型,您可以使用相同数量的 cpu 周期获取更多数据。
  • 更一般地说,您还应该检查您的测量是否不受 CPU 速度的限制,而是受 RAM 速度的限制。例如, ramspeed 实用程序在源代码中在某种程度上显式展开内部循环:

    for(i = 0; i < blk/sizeof(UTL); i += 32) {
        b[i] = a[i]; b[i+1] = a[i+1];
        ...
        b[i+30] = a[i+30]; b[i+31] = a[i+31];
    }
    

    编辑:在支持的体系结构上ramsmp实际上甚至为这些循环使用“手写”汇编代码

  • L1/L2/L3 缓存效果:以 GByte/s 为单位测量带宽很有启发性作为块大小的函数。当增加与读取数据的位置(缓存或主内存)相对应的块大小时,您应该会看到大约四种不同的速度。您的处理器似乎有 8 MB 的 Level3(?) 缓存,因此您的 1000 万字节可能主要保留在 L3 缓存中(在一个的所有核心之间共享)

  • 内存通道:您的处理器有3 个内存通道。如果您安装的内存条可以充分利用它们(例如,参见主板手册),您可能希望同时运行多个线程。我看到的效果是,当仅使用一个线程读取时,渐近带宽接近于单个内存模块(例如DDR-1600的12.8 GByte/s),而当运行多个线程时,渐近带宽接近于数量内存通道数乘以单个内存模块的带宽。

a few comments:

  • to see the NUMA structure of your system (on linux), you can get a graphical overview using the lstopo utility from the hwloc library. In particular, you'll see which core numbers are members of which NUMA node (processor socket)
  • char is probably not the ideal data type to measure the maximum RAM throughput. I suspect that using a 32 bit or 64 bit data type, you can get more data through with the same number of cpu cycles.
  • More generally, you should also check that your measurement is not limited by the CPU speed but by the RAM speed. The ramspeed utility for example unrolls the inner loop explicitly to some extent in the source code:

    for(i = 0; i < blk/sizeof(UTL); i += 32) {
        b[i] = a[i];        b[i+1] = a[i+1];
        ...
        b[i+30] = a[i+30];  b[i+31] = a[i+31];
    }
    

    EDIT: on supported architectures ramsmp actually even uses 'hand written' assembly code for these loops

  • L1/L2/L3 Cache effects: It is instructive to measure the bandwidth in GByte/s as function of the block size. You should see that roughly four different speeds when increasing the block size corresponding to where you are reading the data from (caches or main memory). Your processor seems to have 8 MByte of Level3 (?) cache, so your 10 Million bytes might just mostly stay in the L3 cache (which is shared among all cores of one processor).

  • Memory channels: Your processor has 3 memory channels. If your memory banks are installed such you can exploit them all (see e.g. the motherboard's manual), you may want to run more than one thread at the same time. I saw effects that when reading with one thread only, the asymptotic bandwidth is close to the one of a single memory module (e.g. 12.8 GByte/s for DDR-1600), while when running multiple threads, the asymptotic bandwidth is close to the number of memory channels times the bandwidth of a single memory module.

影子的影子 2024-12-09 02:41:03

您还可以使用 numactl 来选择在哪个节点上运行进程以及从哪里分配内存:

numactl --cpubind=0 --membind=1 <process>

我将其与 LMbench 获取内存延迟数字:

numactl --cpubind=0 --membind=0  ./lat_mem_rd -t 512
numactl --cpubind=0 --membind=1  ./lat_mem_rd -t 512

You can also you use numactl to choose which node to run the process on and where to allocate memory from:

numactl --cpubind=0 --membind=1 <process>

I use this combined with LMbench to get memory latency numbers:

numactl --cpubind=0 --membind=0  ./lat_mem_rd -t 512
numactl --cpubind=0 --membind=1  ./lat_mem_rd -t 512
苍景流年 2024-12-09 02:41:03

如果其他人想尝试这个测试,这里是修改后的工作程序。我很想看到其他硬件的结果。这适用于我的 Linux 2.6.34-12-desktop、GCC 4.5.0、Boost 1.47 机器。

g++ -o numatest -pthread -lboost_thread -lnuma -O0 numatest.cpp

numatest.cpp

#include <numa.h>
#include <iostream>
#include <boost/thread/thread.hpp>
#include <boost/date_time/posix_time/posix_time.hpp>
#include <pthread.h>

void pin_to_core(size_t core)
{
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(core, &cpuset);
    pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
}

std::ostream& operator<<(std::ostream& os, const bitmask& bm)
{
    for(size_t i=0;i<bm.size;++i)
    {
        os << numa_bitmask_isbitset(&bm, i);
    }
    return os;
}

void* thread1(void** x, size_t core, size_t N, size_t M)
{
    pin_to_core(core);

    void* y = numa_alloc_local(N);

    boost::posix_time::ptime t1 = boost::posix_time::microsec_clock::universal_time();

    char c;
    for (size_t i(0);i<M;++i)
        for(size_t j(0);j<N;++j)
        {
            *(((char*)y) + ((j * 1009) % N)) += 1;
        }

    boost::posix_time::ptime t2 = boost::posix_time::microsec_clock::universal_time();

    std::cout << "Elapsed read/write by same thread that allocated on core " << core << ": " << (t2 - t1) << std::endl;

    *x = y;
}

void thread2(void* x, size_t core, size_t N, size_t M)
{
    pin_to_core(core);

    boost::posix_time::ptime t1 = boost::posix_time::microsec_clock::universal_time();

    char c;
    for (size_t i(0);i<M;++i)
        for(size_t j(0);j<N;++j)
        {
            *(((char*)x) + ((j * 1009) % N)) += 1;
        }

    boost::posix_time::ptime t2 = boost::posix_time::microsec_clock::universal_time();

    std::cout << "Elapsed read/write by thread on core " << core << ": " << (t2 - t1) << std::endl;
}

int main(int argc, const char **argv)
{
    int numcpus = numa_num_task_cpus();
    std::cout << "numa_available() " << numa_available() << std::endl;
    numa_set_localalloc();

    bitmask* bm = numa_bitmask_alloc(numcpus);
    for (int i=0;i<=numa_max_node();++i)
    {
        numa_node_to_cpus(i, bm);
        std::cout << "numa node " << i << " " << *bm << " " << numa_node_size(i, 0) << std::endl;
    }
    numa_bitmask_free(bm);

    void* x;
    size_t N(10000000);
    size_t M(5);

    boost::thread t1(boost::bind(&thread1, &x, 0, N, M));
    t1.join();

    for (size_t i(0);i<numcpus;++i)
    {
        boost::thread t2(boost::bind(&thread2, x, i, N, M));
        t2.join();
    }

    numa_free(x, N);

    return 0;
}

If anyone else wants to try this test, here is the modified, working program. I would love to see results from other hardware. This Works On My Machine with Linux 2.6.34-12-desktop, GCC 4.5.0, Boost 1.47.

g++ -o numatest -pthread -lboost_thread -lnuma -O0 numatest.cpp

numatest.cpp

#include <numa.h>
#include <iostream>
#include <boost/thread/thread.hpp>
#include <boost/date_time/posix_time/posix_time.hpp>
#include <pthread.h>

void pin_to_core(size_t core)
{
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(core, &cpuset);
    pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
}

std::ostream& operator<<(std::ostream& os, const bitmask& bm)
{
    for(size_t i=0;i<bm.size;++i)
    {
        os << numa_bitmask_isbitset(&bm, i);
    }
    return os;
}

void* thread1(void** x, size_t core, size_t N, size_t M)
{
    pin_to_core(core);

    void* y = numa_alloc_local(N);

    boost::posix_time::ptime t1 = boost::posix_time::microsec_clock::universal_time();

    char c;
    for (size_t i(0);i<M;++i)
        for(size_t j(0);j<N;++j)
        {
            *(((char*)y) + ((j * 1009) % N)) += 1;
        }

    boost::posix_time::ptime t2 = boost::posix_time::microsec_clock::universal_time();

    std::cout << "Elapsed read/write by same thread that allocated on core " << core << ": " << (t2 - t1) << std::endl;

    *x = y;
}

void thread2(void* x, size_t core, size_t N, size_t M)
{
    pin_to_core(core);

    boost::posix_time::ptime t1 = boost::posix_time::microsec_clock::universal_time();

    char c;
    for (size_t i(0);i<M;++i)
        for(size_t j(0);j<N;++j)
        {
            *(((char*)x) + ((j * 1009) % N)) += 1;
        }

    boost::posix_time::ptime t2 = boost::posix_time::microsec_clock::universal_time();

    std::cout << "Elapsed read/write by thread on core " << core << ": " << (t2 - t1) << std::endl;
}

int main(int argc, const char **argv)
{
    int numcpus = numa_num_task_cpus();
    std::cout << "numa_available() " << numa_available() << std::endl;
    numa_set_localalloc();

    bitmask* bm = numa_bitmask_alloc(numcpus);
    for (int i=0;i<=numa_max_node();++i)
    {
        numa_node_to_cpus(i, bm);
        std::cout << "numa node " << i << " " << *bm << " " << numa_node_size(i, 0) << std::endl;
    }
    numa_bitmask_free(bm);

    void* x;
    size_t N(10000000);
    size_t M(5);

    boost::thread t1(boost::bind(&thread1, &x, 0, N, M));
    t1.join();

    for (size_t i(0);i<numcpus;++i)
    {
        boost::thread t2(boost::bind(&thread2, x, i, N, M));
        t2.join();
    }

    numa_free(x, N);

    return 0;
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文