我应该更喜欢跨一步内存访问来进行读取还是写入?

发布于 2024-12-29 12:41:03 字数 544 浏览 1 评论 0原文

众所周知,以跨步方式访问内存最有利于性能。

的情况下

  • 在我必须访问一个内存区域进行读取、
  • 必须访问另一区域进行写入
  • ,我只能以跨步一方式访问这两个区域之一,

我应该更喜欢读取跨步一还是写入跨步一?

一个简单、具体的示例是类似 BLAS 的复制和置换操作,如 y := P x 。置换矩阵 P 完全由某个置换向量 q(i) 定义。它有一个相应的逆排列向量qinv(i)。可以将所需的循环编码为 y[qinv(i)] = x[i]y[i]=x[q(i)],其中前者从 x stride one 读取,后者写入 y stride one。

理想情况下,人们总是可以对两种可能性进行编码,在代表性条件下对它们进行分析,并选择更快的版本。假设您只能编写一个版本 - 根据现代内存架构的行为,您总是预计哪种访问模式会更快?在线程环境中工作会改变您的反应吗?

It's well known that accessing memory in a stride one fashion is best for performance.

In situations where

  • I must access one region of memory for reading,
  • I must access another region for writing, and
  • I may only access one of the two regions in a stride one fashion,

should I prefer reading stride one or writing stride one?

One simple, concrete example is a BLAS-like copy-and-permute operation like y := P x. The permutation matrix P is defined entirely by some permutation vector q(i). It has a corresponding inverse permutation vector qinv(i). One could code the required loop as y[qinv(i)] = x[i] or as y[i]=x[q(i)] where the former reads from x stride one and the latter writes to y stride one.

Ideally one could always code both possibilities, profile them under representative conditions, and choose the faster version. Pretend you could only code one version-- which access pattern would you always anticipate being faster based on the behavior of modern memory architectures? Does working in a threaded environment change your response?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

━╋う一瞬間旳綻放 2025-01-05 12:41:03

您将其命名为“写入跨步一”(y[i]=x[q(i)]) 的访问模式通常会更快。

如果内存被缓存并且您的数据块小于缓存线,则这种访问模式需要更少的内存带宽。

现代处理器通常具有比存储单元更多的加载执行单元。下一代英特尔架构名为 Haswell,仅支持 GATHER 指令,而 SCATTER 尚未包含在他们的计划中。这一切也有利于“写出一步”的模式。

在线程环境中工作不会改变这一点。

Access pattern, that you name "writes stride one" (y[i]=x[q(i)]), is usually faster.

If memory is cached and your data pieces are smaller than cache line, this access pattern requires less memory bandwidth.

It is usual for modern processors to have more load execution units, than store units. And next Intel architecture, named Haswell, supports only GATHER instruction, while SCATTER is not yet in their plans. All this is also in favor of "writes stride one" pattern.

Working in a threaded environment does not change this.

滥情稳全场 2025-01-05 12:41:03

我想分享我的简单基准测试的结果。假设我们有两个 NxN 方阵 AB,均为 double,并且我们想要执行复制转置:

A = transpose(B)

算法:

  1. 两个嵌套循环,使得读取是连续的并且写入是跨步的。
  2. 两个嵌套循环,使得读取是跨步的并且写入是连续的。
  3. 顺序 MKL 的 mkl_domatcopy

没有转置的副本用作基线。 N 的值取为 2^K + 1 以减轻缓存关联性影响。

搭载 GCC 8.3.0 (-O3 -m64 -march=native) 和英特尔 MKL 2019.0.1 的英特尔酷睿 i7-4770:

英特尔酷睿 i7-4770

搭载 GCC 7.3.0 (-O3 -m64 -march=native) 和英特尔 MKL 2017.0.1 的英特尔至强 E5-2650 v3:

英特尔至强E5-2650 v3

数字和 C++ 源代码

I'd like to share results of my simple benchmarks. Suppose we have two square NxN matrices A and B of doubles, and we want to perform a copy with a transposition:

A = transpose(B)

Algorithms:

  1. Two nested loops such that reads are contiguous and writes are strided.
  2. Two nested loops such that reads are strided and writes are contiguous.
  3. Sequential MKL's mkl_domatcopy.

Copy without transposition is used as a baseline. Values of N are taken to be 2^K + 1 to mitigate cache associativity effects.

Intel Core i7-4770 with GCC 8.3.0 (-O3 -m64 -march=native) and Intel MKL 2019.0.1:

Intel Core i7-4770

Intel Xeon E5-2650 v3 with GCC 7.3.0 (-O3 -m64 -march=native) and Intel MKL 2017.0.1:

Intel Xeon E5-2650 v3

Numbers and C++ source code

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文