C＆＃x2B;＆＃x2B;大型矩阵线性组合的性能优化？

发布于 2025-01-23 11:06:51 字数 1412 浏览 3 评论 0原文

我有一个大量的浮点数据张量，并带有尺寸35K（行）x 45（cols）x 150（slices），我存储在armadillo Cube容器中。我需要在35毫秒以下（我的应用程序必须）将所有150个切片线性地将所有150片组合在一起。线性组合浮点重量也存储在腋窝容器中。到目前为止，我最快的实现需要70毫秒，平均在30帧的窗口上进行平均，而且我似乎无法击败它。请注意，我允许 cpu 并行计算，而不是GPU。

我尝试了多种不同的方法来执行这种线性组合，但是以下代码似乎是我可以获得的（70毫秒），因为我相信我通过在每次迭代中获取最大可能的连续内存块来最大程度地提高缓存命中率。

请注意，Armadillo将数据存储在列主要格式中。因此，在张量中，它首先存储第一个通道的列，然后存储第二个通道的列，然后存储第三个通道的列，依此类推。

typedef std::chrono::system_clock Timer;
typedef std::chrono::duration<double> Duration;

int rows = 35000;
int cols = 45;
int slices = 150;
arma::fcube tensor(rows, cols, slices, arma::fill::randu);
arma::fvec w(slices, arma::fill::randu);

double overallTime = 0;
int window = 30;
for (int n = 0; n < window; n++) {

    Timer::time_point start = Timer::now();

    arma::fmat result(rows, cols, arma::fill::zeros);
    for (int i = 0; i < slices; i++)
        result += tensor.slice(i) * w(i);

    Timer::time_point end = Timer::now();
    Duration span = end - start;
    double t = span.count();
    overallTime += t;
    cout << "n = " << n << " --> t = " << t * 1000.0 << " ms" << endl;
}

cout << endl << "average time = " << overallTime * 1000.0 / window << " ms" << endl;

我需要通过至少2x 来优化此代码，我非常感谢任何建议。

原文

I have a large tensor of floating point data with the dimensions 35k(rows) x 45(cols) x 150(slices) which I have stored in an armadillo cube container. I need to linearly combine all the 150 slices together in under 35 ms (a must for my application). The linear combination floating point weights are also stored in an armadillo container. My fastest implementation so far takes 70 ms, averaged over a window of 30 frames, and I don't seem to be able to beat that. Please note I'm allowed CPU parallel computations but not GPU.

I have tried multiple different ways of performing this linear combination but the following code seems to be the fastest I can get (70 ms) as I believe I'm maximizing the cache hit chances by fetching the largest possible contiguous memory chunk at each iteration.

Please note that Armadillo stores data in column major format. So in a tensor, it first stores the columns of the first channel, then the columns of the second channel, then third and so forth.

typedef std::chrono::system_clock Timer;
typedef std::chrono::duration<double> Duration;

int rows = 35000;
int cols = 45;
int slices = 150;
arma::fcube tensor(rows, cols, slices, arma::fill::randu);
arma::fvec w(slices, arma::fill::randu);

double overallTime = 0;
int window = 30;
for (int n = 0; n < window; n++) {

    Timer::time_point start = Timer::now();

    arma::fmat result(rows, cols, arma::fill::zeros);
    for (int i = 0; i < slices; i++)
        result += tensor.slice(i) * w(i);

    Timer::time_point end = Timer::now();
    Duration span = end - start;
    double t = span.count();
    overallTime += t;
    cout << "n = " << n << " --> t = " << t * 1000.0 << " ms" << endl;
}

cout << endl << "average time = " << overallTime * 1000.0 / window << " ms" << endl;

I need to optimize this code by at least 2x and I would very much appreciate any suggestions.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

淡水深流 2025-01-30 11:06:51

首先，我需要承认，我不熟悉arma框架或内存布局。至少如果语法结果 += slice（i） *权重懒惰地评估。

无论如何，两个主要问题及其解决方案在于存储器布局和内存到弧度计算率。

说a+= b*c是有问题的，因为它需要读取b和a，写 a 和最多使用两个算术操作（如果体系结构不结合乘法和累积）。

如果内存布局为float Tensor [rows] [列] [canneels]，则该问题将转换为制作lows * longy length canneels的点> dot fordings dot产品，应以此为代表。

如果是float Tensor [C] [H] [W]，最好将循环展示为result+= slice+= slice（i）+slice（i+1）+... 。一次读取四个切片可将内存转移减少50％。

在其中N＆lt; 16中处理4*n结果的块（从所有150个通道/切片中读取）中的结果可能更好，以便编译器可以将累加器明确或隐含地分配给SIMD寄存器。

通过用-ffast-Math-Math编译将切片计数填充到4或8的倍数，可能会进行较小的改进，以使融合的乘积累积（如果可用）和多线程。

约束表明需要执行13.5Gflops，这在算术上是一个合理的数字（对于许多现代建筑），但这至少意味着至少54 GB/s内存带宽，可以使用FP16或16位固定点放松算术。

编辑

知道要成为float张量的内存顺序[150] [45] [35000]或float Tensor [kslices] [kslices] [krows * kcols == kcols * kcols * krows * krows * krows * krows ]建议我先尝试以4（甚至5个）的方式将外循环首先展开，因为150不能除以4的特殊情况）的流。

void blend(int kCols, int kRows, float const *tensor, float *result, float const *w) {
    // ensure that the cols*rows is a multiple of 4 (pad if necessary)
    // - allows the auto vectorizer to skip handling the 'excess' code where the data
    //   length mod simd width != 0
    // one could try even SIMD width of 16*4, as clang 14
    // can further unroll the inner loop to 4 ymm registers
    auto const stride = (kCols * kRows + 3) & ~3;
    // try also s+=6, s+=3, or s+=4, which would require a dedicated inner loop (for s+=2)
    for (int s = 0; s < 150; s+=5) {
        auto src0 = tensor  + s * stride;
        auto src1 = src0 + stride;
        auto src2 = src1 + stride;
        auto src3 = src2 + stride;
        auto src4 = src3 + stride;
        auto dst = result;
        for (int x = 0; x < stride; x++) {
            // clang should be able to optimize caching the weights
            // to registers outside the innerloop
            auto add = src0[x] * w[s] +
                       src1[x] * w[s+1] +
                       src2[x] * w[s+2] +
                       src3[x] * w[s+3] +
                       src4[x] * w[s+4];
            // clang should be able to optimize this comparison
            // out of the loop, generating two inner kernels
            if (s == 0) {
                dst[x] = add;
            } else {
                dst[x] += add;
            }
        }
    }
}

编辑2

另一个起点（在添加多线程之前）将考虑将布局更改为

float tensor[kCols][kRows][kSlices + kPadding]; // padding is optional

下行的一面是kslices = 150不能再适合寄存器中的所有权重（其次，kslices不是4或8的倍数。此外，最终还需要水平。

好处是，减少不再需要经历内存，这对于增加的多线程是一件大事。

void blendHWC(float const *tensor, float const *w, float *dst, int n, int c) {
     // each thread will read from 4 positions in order
     // to share the weights -- finding the best distance
     // might need some iterations
     auto src0 = tensor;
     auto src1 = src0 + c;
     auto src2 = src1 + c;
     auto src3 = src2 + c; 
     for (int i = 0; i < n/4; i++) {
         vec8 acc0(0.0f), acc1(0.0f), acc2(0.0f), acc3(0.0f);
         // #pragma unroll?
         for (auto j = 0; j < c / 8; c++) {
             vec8 w(w + j);
             acc0 += w * vec8(src0 + j);
             acc1 += w * vec8(src1 + j);
             acc2 += w * vec8(src2 + j);
             acc3 += w * vec8(src3 + j);
         }
         vec4 sum = horizontal_reduct(acc0,acc1,acc2,acc3);
         sum.store(dst); dst+=4;
     } 
}

这些vec4和vec8是一些自定义SIMD类，它们通过interins映射到SIMD指令，或者借助编译器能够使用VEC4 =进行编译float __attribute__ __attribute__((vector_size(16))); to efficient SIMD code.

First at all I need to admit, I'm not familiar with the arma framework or the memory layout; the least if the syntax result += slice(i) * weight evaluates lazily.

Two primary problem and its solution anyway lies in the memory layout and the memory-to-arithmetic computation ratio.

To say a+=b*c is problematic because it needs to read the b and a, write a and uses up to two arithmetic operations (two, if the architecture does not combine multiplication and accumulation).

If the memory layout is of form float tensor[rows][columns][channels], the problem is converted to making rows * columns dot products of length channels and should be expressed as such.

If it's float tensor[c][h][w], it's better to unroll the loop to result+= slice(i) + slice(i+1)+.... Reading four slices at a time reduces the memory transfers by 50%.

It might even be better to process the results in chunks of 4*N results (reading from all the 150 channels/slices) where N<16, so that the accumulators can be allocated explicitly or implicitly by the compiler to SIMD registers.

There's a possibility of a minor improvement by padding the slice count to multiples of 4 or 8, by compiling with -ffast-math to enable fused multiply accumulate (if available) and with multithreading.

The constraints indicate the need to perform 13.5GFlops, which is a reasonable number in terms of arithmetic (for many modern architectures) but also it means at least 54 Gb/s memory bandwidth, which could be relaxed with fp16 or 16-bit fixed point arithmetic.

EDIT

Knowing the memory order to be float tensor[150][45][35000] or float tensor[kSlices][kRows * kCols == kCols * kRows] suggests to me to try first unrolling the outer loop by 4 (or maybe even 5, as 150 is not divisible by 4 requiring special case for the excess) streams.

void blend(int kCols, int kRows, float const *tensor, float *result, float const *w) {
    // ensure that the cols*rows is a multiple of 4 (pad if necessary)
    // - allows the auto vectorizer to skip handling the 'excess' code where the data
    //   length mod simd width != 0
    // one could try even SIMD width of 16*4, as clang 14
    // can further unroll the inner loop to 4 ymm registers
    auto const stride = (kCols * kRows + 3) & ~3;
    // try also s+=6, s+=3, or s+=4, which would require a dedicated inner loop (for s+=2)
    for (int s = 0; s < 150; s+=5) {
        auto src0 = tensor  + s * stride;
        auto src1 = src0 + stride;
        auto src2 = src1 + stride;
        auto src3 = src2 + stride;
        auto src4 = src3 + stride;
        auto dst = result;
        for (int x = 0; x < stride; x++) {
            // clang should be able to optimize caching the weights
            // to registers outside the innerloop
            auto add = src0[x] * w[s] +
                       src1[x] * w[s+1] +
                       src2[x] * w[s+2] +
                       src3[x] * w[s+3] +
                       src4[x] * w[s+4];
            // clang should be able to optimize this comparison
            // out of the loop, generating two inner kernels
            if (s == 0) {
                dst[x] = add;
            } else {
                dst[x] += add;
            }
        }
    }
}

EDIT 2

Another starting point (before adding multithreading) would be consider changing the layout to

float tensor[kCols][kRows][kSlices + kPadding]; // padding is optional

The downside now is that kSlices = 150 can't anymore fit all the weights in registers (and secondly kSlices is not a multiple of 4 or 8). Furthermore the final reduction needs to be horizontal.

The upside is that reduction no longer needs to go through memory, which is a big thing with the added multithreading.

void blendHWC(float const *tensor, float const *w, float *dst, int n, int c) {
     // each thread will read from 4 positions in order
     // to share the weights -- finding the best distance
     // might need some iterations
     auto src0 = tensor;
     auto src1 = src0 + c;
     auto src2 = src1 + c;
     auto src3 = src2 + c; 
     for (int i = 0; i < n/4; i++) {
         vec8 acc0(0.0f), acc1(0.0f), acc2(0.0f), acc3(0.0f);
         // #pragma unroll?
         for (auto j = 0; j < c / 8; c++) {
             vec8 w(w + j);
             acc0 += w * vec8(src0 + j);
             acc1 += w * vec8(src1 + j);
             acc2 += w * vec8(src2 + j);
             acc3 += w * vec8(src3 + j);
         }
         vec4 sum = horizontal_reduct(acc0,acc1,acc2,acc3);
         sum.store(dst); dst+=4;
     } 
}

These vec4 and vec8 are some custom SIMD classes, which map to SIMD instructions either through intrinsics, or by virtue of the compiler being able to do compile using vec4 = float __attribute__ __attribute__((vector_size(16))); to efficient SIMD code.

回复收藏 0 原文

稀香 2025-01-30 11:06:51

正如@hbrerkere在评论部分所建议的那样，使用-O3标志并进行以下更改，该性能提高了几乎65％。该代码现在以 45 ms 的速度运行，而不是最初的70毫秒。

int lastStep = (slices / 4 - 1) * 4;
int i = 0;
while (i <= lastStep) {
    result += tensor.slice(i) * w_id(i) + tensor.slice(i + 1) * w_id(i + 1) + tensor.slice(i + 2) * w_id(i + 2) + tensor.slice(i + 3) * w_id(i + 3);
    i += 4;
}
while (i < slices) {
    result += tensor.slice(i) * w_id(i);
    i++;
}

As @hbrerkere suggested in the comment section, by using the -O3 flag and making the following changes, the performance improved by almost 65%. The code now runs at 45 ms as opposed to the initial 70 ms.

int lastStep = (slices / 4 - 1) * 4;
int i = 0;
while (i <= lastStep) {
    result += tensor.slice(i) * w_id(i) + tensor.slice(i + 1) * w_id(i + 1) + tensor.slice(i + 2) * w_id(i + 2) + tensor.slice(i + 3) * w_id(i + 3);
    i += 4;
}
while (i < slices) {
    result += tensor.slice(i) * w_id(i);
    i++;
}

回复收藏 0 原文

反话 2025-01-30 11:06:51

没有实际的代码，我猜它

+= tensor.slice(i) * w_id(i)

会创建一个临时对象，然后将其添加到LHS中。是的，超载运算符看起来不错，但是我会写一个函数

addto（lhs，slice1，w1，slice2，w2，.... iNROL to 4 ... prol to 4 ...），

这转化为元素上的纯循环：

for (i=....)
  for (j=...)
    lhs[i][j] += slice1[i][j]*w1[j] + slice2[i][j] &c

如果我会让我感到惊讶那不会给你带来额外的因素。

Without having the actual code, I'm guessing that

+= tensor.slice(i) * w_id(i)

creates a temporary object and then adds it to the lhs. Yes, overloaded operators look nice, but I would write a function

addto( lhs, slice1, w1, slice2, w2, ....unroll to 4... )

which translates to pure loops over the elements:

for (i=....)
  for (j=...)
    lhs[i][j] += slice1[i][j]*w1[j] + slice2[i][j] &c

It would surprise me if that doesn't buy you an extra factor.

回复收藏 0 原文

~没有更多了~

关于作者

梦年海沫深

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

C＆＃x2B;＆＃x2B;大型矩阵线性组合的性能优化？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

夢野间

百度③文鱼

小草泠泠

zhuwenyan

weirdo

坚持沉默

友情链接

C＆＃x2B;＆＃x2B;大型矩阵线性组合的性能优化？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

夢野间

百度③文鱼

小草泠泠

zhuwenyan

weirdo

坚持沉默

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。