测量 NUMA(非统一内存访问)。没有可观察到的不对称性。为什么?
我尝试测量 NUMA 的不对称内存访问效果,但失败了。
实验
在 Intel Xeon X5570 @ 2.93GHz、2 个 CPU、8 核上进行。
在固定到核心 0 的线程上,我使用 numa_alloc_local 在核心 0 的 NUMA 节点上分配大小为 10,000,000 字节的数组 x。 然后我迭代数组 x 50 次并读取和写入数组中的每个字节。测量进行 50 次迭代所用的时间。
然后,在服务器中的每个其他核心上,我固定一个新线程,并再次测量进行 50 次读写迭代所用的时间 到数组 x 中的每个字节。
数组x很大,可以最大限度地减少缓存影响。我们想要测量 CPU 必须一直到 RAM 进行加载和存储时的速度,而不是缓存发挥作用时的速度。
我的服务器中有两个 NUMA 节点,因此我希望在分配数组 x 的同一节点上具有亲和力的核心具有 更快的读/写速度。我没有看到这一点。
为什么?
也许 NUMA 仅与具有 > 的系统相关。 8-12 核,正如我在其他地方看到的那样?
http://lse.sourceforge.net/numa/faq/
numatest.cpp
#include <numa.h>
#include <iostream>
#include <boost/thread/thread.hpp>
#include <boost/date_time/posix_time/posix_time.hpp>
#include <pthread.h>
void pin_to_core(size_t core)
{
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(core, &cpuset);
pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
}
std::ostream& operator<<(std::ostream& os, const bitmask& bm)
{
for(size_t i=0;i<bm.size;++i)
{
os << numa_bitmask_isbitset(&bm, i);
}
return os;
}
void* thread1(void** x, size_t core, size_t N, size_t M)
{
pin_to_core(core);
void* y = numa_alloc_local(N);
boost::posix_time::ptime t1 = boost::posix_time::microsec_clock::universal_time();
char c;
for (size_t i(0);i<M;++i)
for(size_t j(0);j<N;++j)
{
c = ((char*)y)[j];
((char*)y)[j] = c;
}
boost::posix_time::ptime t2 = boost::posix_time::microsec_clock::universal_time();
std::cout << "Elapsed read/write by same thread that allocated on core " << core << ": " << (t2 - t1) << std::endl;
*x = y;
}
void thread2(void* x, size_t core, size_t N, size_t M)
{
pin_to_core(core);
boost::posix_time::ptime t1 = boost::posix_time::microsec_clock::universal_time();
char c;
for (size_t i(0);i<M;++i)
for(size_t j(0);j<N;++j)
{
c = ((char*)x)[j];
((char*)x)[j] = c;
}
boost::posix_time::ptime t2 = boost::posix_time::microsec_clock::universal_time();
std::cout << "Elapsed read/write by thread on core " << core << ": " << (t2 - t1) << std::endl;
}
int main(int argc, const char **argv)
{
int numcpus = numa_num_task_cpus();
std::cout << "numa_available() " << numa_available() << std::endl;
numa_set_localalloc();
bitmask* bm = numa_bitmask_alloc(numcpus);
for (int i=0;i<=numa_max_node();++i)
{
numa_node_to_cpus(i, bm);
std::cout << "numa node " << i << " " << *bm << " " << numa_node_size(i, 0) << std::endl;
}
numa_bitmask_free(bm);
void* x;
size_t N(10000000);
size_t M(50);
boost::thread t1(boost::bind(&thread1, &x, 0, N, M));
t1.join();
for (size_t i(0);i<numcpus;++i)
{
boost::thread t2(boost::bind(&thread2, x, i, N, M));
t2.join();
}
numa_free(x, N);
return 0;
}
输出
g++ -o numatest -pthread -lboost_thread -lnuma -O0 numatest.cpp
./numatest
numa_available() 0 <-- NUMA is available on this system
numa node 0 10101010 12884901888 <-- cores 0,2,4,6 are on NUMA node 0, which is about 12 Gb
numa node 1 01010101 12874584064 <-- cores 1,3,5,7 are on NUMA node 1, which is slightly smaller than node 0
Elapsed read/write by same thread that allocated on core 0: 00:00:01.767428
Elapsed read/write by thread on core 0: 00:00:01.760554
Elapsed read/write by thread on core 1: 00:00:01.719686
Elapsed read/write by thread on core 2: 00:00:01.708830
Elapsed read/write by thread on core 3: 00:00:01.691560
Elapsed read/write by thread on core 4: 00:00:01.686912
Elapsed read/write by thread on core 5: 00:00:01.691917
Elapsed read/write by thread on core 6: 00:00:01.686509
Elapsed read/write by thread on core 7: 00:00:01.689928
执行情况无论哪个核心正在执行读取和写入操作,对数组 x 进行 50 次迭代读取和写入大约需要 1.7 秒。
更新:
我的 CPU 上的缓存大小为 8Mb,因此 10Mb 数组 x 可能不足以消除缓存效应。我尝试了 100Mb 数组 x,并且 我尝试在最里面的循环中使用 __sync_synchronize() 发出完整的内存栅栏。它仍然没有揭示 NUMA 节点之间的任何不对称性。
更新 2:
我尝试使用 __sync_fetch_and_add() 读取和写入数组 x 。还是什么都没有。
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
我要指出的第一件事是,您可能需要仔细检查每个节点上有哪些核心。我不记得核心和节点是这样交错的。
另外,由于 HT,您应该有 16 个线程。 (除非你禁用它)
另一件事:
socket 1366 Xeon 机器只有轻微的 NUMA。所以很难看出区别。 NUMA 效果在 4P Opterons 上更加明显。
在像您这样的系统上,节点到节点的带宽实际上比 CPU 到内存的带宽更快。由于您的访问模式是完全顺序的,因此无论数据是否是本地的,您都可以获得完整的带宽。更好的衡量指标是延迟。尝试随机访问 1 GB 的块,而不是按顺序流式传输。
最后一件事:
根据编译器优化的积极程度,您的循环可能会被优化,因为它不执行任何操作:
这样的事情将保证它不会被编译器消除:
The first thing I want to point out is that you might want to double-check which cores are on each node. I don't recall cores and nodes being interleaved like that.
Also, you should have 16 threads due to HT. (unless you disabled it)
Another thing:
The socket 1366 Xeon machines are only slightly NUMA. So it will be hard to see the difference. The NUMA effect is much more noticeable on the 4P Opterons.
On systems like yours, the node-to-node bandwidth is actually faster than the CPU-to-memory bandwidth. Since your access pattern is completely sequential, you are getting the full bandwidth regardless of whether or not the data is local. A better thing to measure is the latency. Try random accessing a block of 1 GB instead of streaming it sequentially.
Last thing:
Depending on how aggressively your compiler optimizes, your loop might be optimized out since it doesn't do anything:
Something like this will guarantee that it won't be eliminated by the compiler:
啊哈!玄妙说得对!不知何故,硬件预取正在优化我的读/写。
如果它是缓存优化,那么强制内存屏障将使优化失效:
但这没有任何区别。真正起作用的是将我的迭代器索引乘以素数 1009 以击败预取优化:
通过这一更改,NUMA 不对称性清晰可见:
至少我认为这就是正在发生的事情。
谢谢神秘!
编辑:结论 ~133%
对于那些只是浏览这篇文章以粗略了解 NUMA 性能特征的人来说,这是根据我的测试得出的底线:
对非本地 NUMA 节点的内存访问延迟约为本地节点的 1.33 倍。
Ah hah! Mysticial is right! Somehow, hardware pre-fetching is optimizing my read/writes.
If it were a cache optimization, then forcing a memory barrier would defeat the optimization:
but that doesn't make any difference. What does make a difference is multiplying my iterator index by prime 1009 to defeat the pre-fetching optimization:
With that change, the NUMA asymmetry is clearly revealed:
At least I think that's what's going on.
Thanks Mysticial!
EDIT: CONCLUSION ~133%
For anyone who is just glancing at this post to get a rough idea of the performance characteristics of NUMA, here is the bottom line according to my tests:
Memory access to a non-local NUMA node has about 1.33 times the latency of memory access to a local node.
感谢这个基准代码。我已经采用了您的“固定”版本,并将其更改为纯 C + OpenMP,并添加了一些测试来了解内存系统在争用情况下的行为。您可以在此处找到新代码。
以下是 Quad Opteron 的一些示例结果:
如果有人有进一步的改进,我很高兴听到他们的消息。例如,这些显然不是现实世界单位中完美的带宽测量(可能会偏离一个——希望是恒定的——整数因子)。
Thanks for this benchmark code. I've taken your 'fixed' version and changed it to pure C + OpenMP and added a few tests for how the memory system behaves under contention. You can find the new code here.
Here are some sample results from a Quad Opteron:
If someone has further improvements, I'd be happy to hear about them. For example, these are obviously not perfect bandwidth measurements in real-world units (likely off by a--hopefully constant--integer factor).
一些评论:
char
可能不是测量最大 RAM 吞吐量的理想数据类型。我怀疑使用 32 位或 64 位数据类型,您可以使用相同数量的 cpu 周期获取更多数据。更一般地说,您还应该检查您的测量是否不受 CPU 速度的限制,而是受 RAM 速度的限制。例如, ramspeed 实用程序在源代码中在某种程度上显式展开内部循环:
编辑:在支持的体系结构上
ramsmp
实际上甚至为这些循环使用“手写”汇编代码L1/L2/L3 缓存效果:以 GByte/s 为单位测量带宽很有启发性作为块大小的函数。当增加与读取数据的位置(缓存或主内存)相对应的块大小时,您应该会看到大约四种不同的速度。您的处理器似乎有 8 MB 的 Level3(?) 缓存,因此您的 1000 万字节可能主要保留在 L3 缓存中(在一个的所有核心之间共享)
内存通道:您的处理器有3 个内存通道。如果您安装的内存条可以充分利用它们(例如,参见主板手册),您可能希望同时运行多个线程。我看到的效果是,当仅使用一个线程读取时,渐近带宽接近于单个内存模块(例如DDR-1600的12.8 GByte/s),而当运行多个线程时,渐近带宽接近于数量内存通道数乘以单个内存模块的带宽。
a few comments:
lstopo
utility from the hwloc library. In particular, you'll see which core numbers are members of which NUMA node (processor socket)char
is probably not the ideal data type to measure the maximum RAM throughput. I suspect that using a 32 bit or 64 bit data type, you can get more data through with the same number of cpu cycles.More generally, you should also check that your measurement is not limited by the CPU speed but by the RAM speed. The ramspeed utility for example unrolls the inner loop explicitly to some extent in the source code:
EDIT: on supported architectures
ramsmp
actually even uses 'hand written' assembly code for these loopsL1/L2/L3 Cache effects: It is instructive to measure the bandwidth in GByte/s as function of the block size. You should see that roughly four different speeds when increasing the block size corresponding to where you are reading the data from (caches or main memory). Your processor seems to have 8 MByte of Level3 (?) cache, so your 10 Million bytes might just mostly stay in the L3 cache (which is shared among all cores of one processor).
Memory channels: Your processor has 3 memory channels. If your memory banks are installed such you can exploit them all (see e.g. the motherboard's manual), you may want to run more than one thread at the same time. I saw effects that when reading with one thread only, the asymptotic bandwidth is close to the one of a single memory module (e.g. 12.8 GByte/s for DDR-1600), while when running multiple threads, the asymptotic bandwidth is close to the number of memory channels times the bandwidth of a single memory module.
您还可以使用 numactl 来选择在哪个节点上运行进程以及从哪里分配内存:
我将其与 LMbench 获取内存延迟数字:
You can also you use numactl to choose which node to run the process on and where to allocate memory from:
I use this combined with LMbench to get memory latency numbers:
如果其他人想尝试这个测试,这里是修改后的工作程序。我很想看到其他硬件的结果。这适用于我的 Linux 2.6.34-12-desktop、GCC 4.5.0、Boost 1.47 机器。
numatest.cpp
If anyone else wants to try this test, here is the modified, working program. I would love to see results from other hardware. This Works On My Machine with Linux 2.6.34-12-desktop, GCC 4.5.0, Boost 1.47.
numatest.cpp