C++大型矩阵线性组合的性能优化?
我有一个大量的浮点数据张量,并带有尺寸35K(行)x 45(cols)x 150(slices)
,我存储在armadillo Cube容器中。我需要在35毫秒以下(我的应用程序必须)将所有150个切片线性地将所有150片组合在一起。线性组合浮点重量也存储在腋窝容器中。到目前为止,我最快的实现需要70毫秒,平均在30帧的窗口上进行平均,而且我似乎无法击败它。请注意,我允许 cpu 并行计算,而不是GPU。
我尝试了多种不同的方法来执行这种线性组合,但是以下代码似乎是我可以获得的(70毫秒),因为我相信我通过在每次迭代中获取最大可能的连续内存块来最大程度地提高缓存命中率。
请注意,Armadillo将数据存储在列主要格式中。因此,在张量中,它首先存储第一个通道的列,然后存储第二个通道的列,然后存储第三个通道的列,依此类推。
typedef std::chrono::system_clock Timer;
typedef std::chrono::duration<double> Duration;
int rows = 35000;
int cols = 45;
int slices = 150;
arma::fcube tensor(rows, cols, slices, arma::fill::randu);
arma::fvec w(slices, arma::fill::randu);
double overallTime = 0;
int window = 30;
for (int n = 0; n < window; n++) {
Timer::time_point start = Timer::now();
arma::fmat result(rows, cols, arma::fill::zeros);
for (int i = 0; i < slices; i++)
result += tensor.slice(i) * w(i);
Timer::time_point end = Timer::now();
Duration span = end - start;
double t = span.count();
overallTime += t;
cout << "n = " << n << " --> t = " << t * 1000.0 << " ms" << endl;
}
cout << endl << "average time = " << overallTime * 1000.0 / window << " ms" << endl;
我需要通过至少2x 来优化此代码,我非常感谢任何建议。
I have a large tensor of floating point data with the dimensions 35k(rows) x 45(cols) x 150(slices)
which I have stored in an armadillo cube container. I need to linearly combine all the 150 slices together in under 35 ms (a must for my application). The linear combination floating point weights are also stored in an armadillo container. My fastest implementation so far takes 70 ms, averaged over a window of 30 frames, and I don't seem to be able to beat that. Please note I'm allowed CPU parallel computations but not GPU.
I have tried multiple different ways of performing this linear combination but the following code seems to be the fastest I can get (70 ms) as I believe I'm maximizing the cache hit chances by fetching the largest possible contiguous memory chunk at each iteration.
Please note that Armadillo stores data in column major format. So in a tensor, it first stores the columns of the first channel, then the columns of the second channel, then third and so forth.
typedef std::chrono::system_clock Timer;
typedef std::chrono::duration<double> Duration;
int rows = 35000;
int cols = 45;
int slices = 150;
arma::fcube tensor(rows, cols, slices, arma::fill::randu);
arma::fvec w(slices, arma::fill::randu);
double overallTime = 0;
int window = 30;
for (int n = 0; n < window; n++) {
Timer::time_point start = Timer::now();
arma::fmat result(rows, cols, arma::fill::zeros);
for (int i = 0; i < slices; i++)
result += tensor.slice(i) * w(i);
Timer::time_point end = Timer::now();
Duration span = end - start;
double t = span.count();
overallTime += t;
cout << "n = " << n << " --> t = " << t * 1000.0 << " ms" << endl;
}
cout << endl << "average time = " << overallTime * 1000.0 / window << " ms" << endl;
I need to optimize this code by at least 2x and I would very much appreciate any suggestions.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
首先,我需要承认,我不熟悉
arma
框架或内存布局。至少如果语法结果 += slice(i) *权重
懒惰地评估。无论如何,两个主要问题及其解决方案在于存储器布局和内存到弧度计算率。
说
a+= b*c
是有问题的,因为它需要读取b
和a
,写 a 和最多使用两个算术操作(如果体系结构不结合乘法和累积)。如果内存布局为
float Tensor [rows] [列] [canneels]
,则该问题将转换为制作lows * longy
lengthcanneels的点> dot fordings
dot产品,应以此为代表。如果是
float Tensor [C] [H] [W]
,最好将循环展示为result+= slice+= slice(i)+slice(i+1)+... 。一次读取四个切片可将内存转移减少50%。
在其中N&lt; 16中处理4*n结果的块(从所有150个通道/切片中读取)中的结果可能更好,以便编译器可以将累加器明确或隐含地分配给SIMD寄存器。
通过用
-ffast-Math-Math
编译将切片计数填充到4或8的倍数,可能会进行较小的改进,以使融合的乘积累积(如果可用)和多线程。约束表明需要执行13.5Gflops,这在算术上是一个合理的数字(对于许多现代建筑),但这至少意味着至少54 GB/s内存带宽,可以使用FP16或16位固定点放松算术。
编辑
知道要成为
float张量的内存顺序[150] [45] [35000]
或float Tensor [kslices] [kslices] [krows * kcols == kcols * kcols * krows * krows * krows * krows ]
建议我先尝试以4(甚至5个)的方式将外循环首先展开,因为150不能除以4的特殊情况)的流。编辑2
另一个起点(在添加多线程之前)将考虑将布局更改为
下行的一面是
kslices = 150
不能再适合寄存器中的所有权重(其次,kslices不是4或8的倍数。此外,最终还需要水平。好处是,减少不再需要经历内存,这对于增加的多线程是一件大事。
这些
vec4
和vec8
是一些自定义SIMD类,它们通过interins映射到SIMD指令,或者借助编译器能够使用VEC4 =进行编译float __attribute__ __attribute__((vector_size(16)));
to efficient SIMD code.First at all I need to admit, I'm not familiar with the
arma
framework or the memory layout; the least if the syntaxresult += slice(i) * weight
evaluates lazily.Two primary problem and its solution anyway lies in the memory layout and the memory-to-arithmetic computation ratio.
To say
a+=b*c
is problematic because it needs to read theb
anda
, writea
and uses up to two arithmetic operations (two, if the architecture does not combine multiplication and accumulation).If the memory layout is of form
float tensor[rows][columns][channels]
, the problem is converted to makingrows * columns
dot products of lengthchannels
and should be expressed as such.If it's
float tensor[c][h][w]
, it's better to unroll the loop toresult+= slice(i) + slice(i+1)+...
. Reading four slices at a time reduces the memory transfers by 50%.It might even be better to process the results in chunks of 4*N results (reading from all the 150 channels/slices) where N<16, so that the accumulators can be allocated explicitly or implicitly by the compiler to SIMD registers.
There's a possibility of a minor improvement by padding the slice count to multiples of 4 or 8, by compiling with
-ffast-math
to enable fused multiply accumulate (if available) and with multithreading.The constraints indicate the need to perform 13.5GFlops, which is a reasonable number in terms of arithmetic (for many modern architectures) but also it means at least 54 Gb/s memory bandwidth, which could be relaxed with fp16 or 16-bit fixed point arithmetic.
EDIT
Knowing the memory order to be
float tensor[150][45][35000]
orfloat tensor[kSlices][kRows * kCols == kCols * kRows]
suggests to me to try first unrolling the outer loop by 4 (or maybe even 5, as 150 is not divisible by 4 requiring special case for the excess) streams.EDIT 2
Another starting point (before adding multithreading) would be consider changing the layout to
The downside now is that
kSlices = 150
can't anymore fit all the weights in registers (and secondly kSlices is not a multiple of 4 or 8). Furthermore the final reduction needs to be horizontal.The upside is that reduction no longer needs to go through memory, which is a big thing with the added multithreading.
These
vec4
andvec8
are some custom SIMD classes, which map to SIMD instructions either through intrinsics, or by virtue of the compiler being able to do compileusing vec4 = float __attribute__ __attribute__((vector_size(16)));
to efficient SIMD code.正如@hbrerkere在评论部分所建议的那样,使用
-O3
标志并进行以下更改,该性能提高了几乎65%。该代码现在以 45 ms 的速度运行,而不是最初的70毫秒。As @hbrerkere suggested in the comment section, by using the
-O3
flag and making the following changes, the performance improved by almost 65%. The code now runs at 45 ms as opposed to the initial 70 ms.没有实际的代码,我猜它
会创建一个临时对象,然后将其添加到LHS中。是的,超载运算符看起来不错,但是我会写一个函数
addto(lhs,slice1,w1,slice2,w2,.... iNROL to 4 ... prol to 4 ...),
这转化为元素上的纯循环:
如果我会让我感到惊讶那不会给你带来额外的因素。
Without having the actual code, I'm guessing that
creates a temporary object and then adds it to the lhs. Yes, overloaded operators look nice, but I would write a function
addto( lhs, slice1, w1, slice2, w2, ....unroll to 4... )
which translates to pure loops over the elements:
It would surprise me if that doesn't buy you an extra factor.