我应该更喜欢跨一步内存访问来进行读取还是写入?
众所周知,以跨步方式访问内存最有利于性能。
的情况下
- 在我必须访问一个内存区域进行读取、
- 必须访问另一区域进行写入
- ,我只能以跨步一方式访问这两个区域之一,
我应该更喜欢读取跨步一还是写入跨步一?
一个简单、具体的示例是类似 BLAS 的复制和置换操作,如 y := P x 。置换矩阵 P
完全由某个置换向量 q(i)
定义。它有一个相应的逆排列向量qinv(i)
。可以将所需的循环编码为 y[qinv(i)] = x[i]
或 y[i]=x[q(i)]
,其中前者从 x
stride one 读取,后者写入 y
stride one。
理想情况下,人们总是可以对两种可能性进行编码,在代表性条件下对它们进行分析,并选择更快的版本。假设您只能编写一个版本 - 根据现代内存架构的行为,您总是预计哪种访问模式会更快?在线程环境中工作会改变您的反应吗?
It's well known that accessing memory in a stride one fashion is best for performance.
In situations where
- I must access one region of memory for reading,
- I must access another region for writing, and
- I may only access one of the two regions in a stride one fashion,
should I prefer reading stride one or writing stride one?
One simple, concrete example is a BLAS-like copy-and-permute operation like y := P x
. The permutation matrix P
is defined entirely by some permutation vector q(i)
. It has a corresponding inverse permutation vector qinv(i)
. One could code the required loop as y[qinv(i)] = x[i]
or as y[i]=x[q(i)]
where the former reads from x
stride one and the latter writes to y
stride one.
Ideally one could always code both possibilities, profile them under representative conditions, and choose the faster version. Pretend you could only code one version-- which access pattern would you always anticipate being faster based on the behavior of modern memory architectures? Does working in a threaded environment change your response?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您将其命名为“写入跨步一”(
y[i]=x[q(i)]
) 的访问模式通常会更快。如果内存被缓存并且您的数据块小于缓存线,则这种访问模式需要更少的内存带宽。
现代处理器通常具有比存储单元更多的加载执行单元。下一代英特尔架构名为 Haswell,仅支持 GATHER 指令,而 SCATTER 尚未包含在他们的计划中。这一切也有利于“写出一步”的模式。
在线程环境中工作不会改变这一点。
Access pattern, that you name "writes stride one" (
y[i]=x[q(i)]
), is usually faster.If memory is cached and your data pieces are smaller than cache line, this access pattern requires less memory bandwidth.
It is usual for modern processors to have more load execution units, than store units. And next Intel architecture, named Haswell, supports only GATHER instruction, while SCATTER is not yet in their plans. All this is also in favor of "writes stride one" pattern.
Working in a threaded environment does not change this.
我想分享我的简单基准测试的结果。假设我们有两个
NxN
方阵A
和B
,均为double
,并且我们想要执行复制转置:算法:
mkl_domatcopy
。没有转置的副本用作基线。
N
的值取为2^K + 1
以减轻缓存关联性影响。搭载 GCC 8.3.0 (
-O3 -m64 -march=native
) 和英特尔 MKL 2019.0.1 的英特尔酷睿 i7-4770:搭载 GCC 7.3.0 (
-O3 -m64 -march=native
) 和英特尔 MKL 2017.0.1 的英特尔至强 E5-2650 v3:数字和 C++ 源代码
I'd like to share results of my simple benchmarks. Suppose we have two square
NxN
matricesA
andB
ofdouble
s, and we want to perform a copy with a transposition:Algorithms:
mkl_domatcopy
.Copy without transposition is used as a baseline. Values of
N
are taken to be2^K + 1
to mitigate cache associativity effects.Intel Core i7-4770 with GCC 8.3.0 (
-O3 -m64 -march=native
) and Intel MKL 2019.0.1:Intel Xeon E5-2650 v3 with GCC 7.3.0 (
-O3 -m64 -march=native
) and Intel MKL 2017.0.1:Numbers and C++ source code