STREAM 和 GUPS(单 CPU)基准测试可以在 NUMA 机器中使用非本地内存吗

发布于 2024-08-26 05:03:42 字数 1526 浏览 8 评论 0 原文

我想从 HPCC、STREAM 和 GUPS 运行一些测试。

他们将测试内存带宽、延迟和吞吐量(随机访问方面)。

我可以在启用内存交错的 NUMA 节点上启动单 CPU 测试 STREAM 或单 CPU GUPS 吗? (HPCC - 高性能计算挑战赛的规则允许吗?)

使用非本地内存可以增加 GUPS 结果,因为它将增加 2 或 4 倍的内存库数量,可用于随机访问。 (GUPS 通常受到非理想内存子系统和缓慢的内存库打开/关闭的限制。随着内存库的增加,它可以更新一个库,而其他库正在打开/关闭。)

谢谢。

更新:

(您也不可以重新排序程序进行的内存访问)。

但是编译器可以重新排序循环嵌套吗?例如 hpcc/RandomAccess.c

  /* Perform updates to main table.  The scalar equivalent is:
   *
   *     u64Int ran;
   *     ran = 1;
   *     for (i=0; i<NUPDATE; i++) {
   *       ran = (ran << 1) ^ (((s64Int) ran < 0) ? POLY : 0);
   *       table[ran & (TableSize-1)] ^= stable[ran >> (64-LSTSIZE)];
   *     }
   */
  for (j=0; j<128; j++)
    ran[j] = starts ((NUPDATE/128) * j);
  for (i=0; i<NUPDATE/128; i++) {
/* #pragma ivdep */
    for (j=0; j<128; j++) {
      ran[j] = (ran[j] << 1) ^ ((s64Int) ran[j] < 0 ? POLY : 0);
      Table[ran[j] & (TableSize-1)] ^= stable[ran[j] >> (64-LSTSIZE)];
    }
  }

这里的主循环是 for (i=0; i ,嵌套循环是 for (j=0; j<128 ; j++) {.使用“循环交换”优化,编译器可以将此代码转换为

for (j=0; j<128; j++) {
  for (i=0; i<NUPDATE/128; i++) {
      ran[j] = (ran[j] << 1) ^ ((s64Int) ran[j] < 0 ? POLY : 0);
      Table[ran[j] & (TableSize-1)] ^= stable[ran[j] >> (64-LSTSIZE)];
  }
}

它可以完成,因为此循环嵌套是完美的循环嵌套。 HPCC规则禁止这样的优化吗?

I want to run some tests from HPCC, STREAM and GUPS.

They will test memory bandwidth, latency, and throughput (in term of random accesses).

Can I start Single CPU test STREAM or Single CPU GUPS on NUMA node with memory interleaving enabled? (Is it allowed by the rules of HPCC - High Performance Computing Challenge?)

Usage of non-local memory can increase GUPS results, because it will increase 2- or 4- fold the number of memory banks, available for random accesses. (GUPS typically limited by nonideal memory-subsystem and by slow memory bank opening/closing. With more banks it can do update to one bank, while the other banks are opening/closing.)

Thanks.

UPDATE:

(you may nor reorder the memory accesses that the program makes).

But can compiler reorder loops nesting? E.g. hpcc/RandomAccess.c

  /* Perform updates to main table.  The scalar equivalent is:
   *
   *     u64Int ran;
   *     ran = 1;
   *     for (i=0; i<NUPDATE; i++) {
   *       ran = (ran << 1) ^ (((s64Int) ran < 0) ? POLY : 0);
   *       table[ran & (TableSize-1)] ^= stable[ran >> (64-LSTSIZE)];
   *     }
   */
  for (j=0; j<128; j++)
    ran[j] = starts ((NUPDATE/128) * j);
  for (i=0; i<NUPDATE/128; i++) {
/* #pragma ivdep */
    for (j=0; j<128; j++) {
      ran[j] = (ran[j] << 1) ^ ((s64Int) ran[j] < 0 ? POLY : 0);
      Table[ran[j] & (TableSize-1)] ^= stable[ran[j] >> (64-LSTSIZE)];
    }
  }

The main loop here is for (i=0; i<NUPDATE/128; i++) { and the nested loop is for (j=0; j<128; j++) {. Using 'loop interchange' optimization, compiler can convert this code to

for (j=0; j<128; j++) {
  for (i=0; i<NUPDATE/128; i++) {
      ran[j] = (ran[j] << 1) ^ ((s64Int) ran[j] < 0 ? POLY : 0);
      Table[ran[j] & (TableSize-1)] ^= stable[ran[j] >> (64-LSTSIZE)];
  }
}

It can be done because this loop nest is perfect loop nest. Is such optimization prohibited by rules of HPCC?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

三五鸿雁 2024-09-02 05:03:42

据我所知,鉴于内存交错,这是允许的
是系统设置而不是代码修改(您也不能重新排序
程序进行的内存访问)。

如果 GUPS 实际上在非本地内存上获得了更好的性能
NUMA 机器对我来说似乎很可疑。银行冲突是否会导致延迟
真的大于节点外内存访问延迟吗?

STREAM 不应受到银行冲突的限制,但可能会
如果 CPU 具有片上存储器,则可从节点外访问中受益
控制器(如 Opterons),因为带宽随后被共享
本地内存控制器和 NUMA 互连之间。

As far as I can tell it is allowed given that the memory interleaving
is a system setting rather than a code modification (you may nor reorder
the memory accesses that the program makes).

If GUPS actually gets better performance with non-local memory on a
NUMA machine seems doubtful to me. Will bank conflict-induced latency
really be greater than the off-node memory access latency?

STREAM should not be limited by bank conflicts but will probably
benefit from off-node accesses if the CPU has an on-chip memory
controller (like the Opterons) since the bandwidth is then shared
between the local memory controller and the NUMA interconnect.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文