为什么这段代码效率不高?

发布于 2024-12-21 20:13:30 字数 2159 浏览 1 评论 0原文

我想改进下一个代码,计算平均值:

void calculateMeanStDev8x8Aux(cv::Mat* patch, int sx, int sy, int& mean, float& stdev)
{

    unsigned sum=0;
    unsigned sqsum=0;
    const unsigned char* aux=patch->data + sy*patch->step + sx;
    for (int j=0; j< 8; j++) {
        const unsigned char* p = (const unsigned char*)(j*patch->step + aux ); //Apuntador al inicio de la matrix           

        for (int i=0; i<8; i++) {
            unsigned f = *p++;
            sum += f;
            sqsum += f*f;
        }           
    }       

    mean = sum >> 6;
    int r = (sum*sum) >> 6;
    stdev = sqrtf(sqsum - r);

    if (stdev < .1) {
        stdev=0;
    }
}

我还使用 NEON 内在函数改进了下一个循环:

 for (int i=0; i<8; i++) {
            unsigned f = *p++;
            sum += f;
            sqsum += f*f;
        }

这是为另一个循环改进的代码:

        int32x4_t vsum= { 0 };
        int32x4_t vsum2= { 0 };

        int32x4_t vsumll = { 0 };
        int32x4_t vsumlh = { 0 };
        int32x4_t vsumll2 = { 0 };
        int32x4_t vsumlh2 = { 0 };

        uint8x8_t  f= vld1_u8(p); // VLD1.8 {d0}, [r0]

        //int 16 bytes /8 elementos
        int16x8_t val =  (int16x8_t)vmovl_u8(f);

        //int 32 /4 elementos *2 
        int32x4_t vall = vmovl_s16(vget_low_s16(val));
        int32x4_t valh = vmovl_s16(vget_high_s16(val));

        // update 4 partial sum of products vectors

        vsumll2 = vmlaq_s32(vsumll2, vall, vall);
        vsumlh2 = vmlaq_s32(vsumlh2, valh, valh);

        // sum 4 partial sum of product vectors
        vsum = vaddq_s32(vall, valh);
        vsum2 = vaddq_s32(vsumll2, vsumlh2);

        // do scalar horizontal sum across final vector

        sum += vgetq_lane_s32(vsum, 0);
        sum += vgetq_lane_s32(vsum, 1);
        sum += vgetq_lane_s32(vsum, 2);
        sum += vgetq_lane_s32(vsum, 3);

        sqsum += vgetq_lane_s32(vsum2, 0);
        sqsum += vgetq_lane_s32(vsum2, 1);
        sqsum += vgetq_lane_s32(vsum2, 2);
        sqsum += vgetq_lane_s32(vsum2, 3);

但它或多或少慢了 30 毫秒。有谁知道为什么?

所有代码都工作正常。

I want to improve the next code, calculating the mean:

void calculateMeanStDev8x8Aux(cv::Mat* patch, int sx, int sy, int& mean, float& stdev)
{

    unsigned sum=0;
    unsigned sqsum=0;
    const unsigned char* aux=patch->data + sy*patch->step + sx;
    for (int j=0; j< 8; j++) {
        const unsigned char* p = (const unsigned char*)(j*patch->step + aux ); //Apuntador al inicio de la matrix           

        for (int i=0; i<8; i++) {
            unsigned f = *p++;
            sum += f;
            sqsum += f*f;
        }           
    }       

    mean = sum >> 6;
    int r = (sum*sum) >> 6;
    stdev = sqrtf(sqsum - r);

    if (stdev < .1) {
        stdev=0;
    }
}

I also improved the next loop with NEON intrinsics:

 for (int i=0; i<8; i++) {
            unsigned f = *p++;
            sum += f;
            sqsum += f*f;
        }

This is the code improved for the other loop:

        int32x4_t vsum= { 0 };
        int32x4_t vsum2= { 0 };

        int32x4_t vsumll = { 0 };
        int32x4_t vsumlh = { 0 };
        int32x4_t vsumll2 = { 0 };
        int32x4_t vsumlh2 = { 0 };

        uint8x8_t  f= vld1_u8(p); // VLD1.8 {d0}, [r0]

        //int 16 bytes /8 elementos
        int16x8_t val =  (int16x8_t)vmovl_u8(f);

        //int 32 /4 elementos *2 
        int32x4_t vall = vmovl_s16(vget_low_s16(val));
        int32x4_t valh = vmovl_s16(vget_high_s16(val));

        // update 4 partial sum of products vectors

        vsumll2 = vmlaq_s32(vsumll2, vall, vall);
        vsumlh2 = vmlaq_s32(vsumlh2, valh, valh);

        // sum 4 partial sum of product vectors
        vsum = vaddq_s32(vall, valh);
        vsum2 = vaddq_s32(vsumll2, vsumlh2);

        // do scalar horizontal sum across final vector

        sum += vgetq_lane_s32(vsum, 0);
        sum += vgetq_lane_s32(vsum, 1);
        sum += vgetq_lane_s32(vsum, 2);
        sum += vgetq_lane_s32(vsum, 3);

        sqsum += vgetq_lane_s32(vsum2, 0);
        sqsum += vgetq_lane_s32(vsum2, 1);
        sqsum += vgetq_lane_s32(vsum2, 2);
        sqsum += vgetq_lane_s32(vsum2, 3);

But it is more or less 30 ms more slow. Does anyone know why?

All the code is working right.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

夏至、离别 2024-12-28 20:13:30

添加到伦丁。是的,像 ARM 这样的指令集,您有基于寄存器的索引或具有立即索引的某些范围,您可能会受益于鼓励编译器使用索引。此外,尽管 ARM 例如可以在加载指令中递增其指针寄存器,但基本上是在一条指令中增加 *p++。

使用 p[i] 或 p[i++] 与 *p 或 *p++ 总是一个折腾,有些指令集采取哪条路径更加明显。

同样你的索引。如果您不使用倒数而不是向上倒数,则可以在每个循环中节省一条指令,甚至更多。有些人可能会这样做:

inc reg
cmp reg,#7
bne loop_top

如果你正在倒计时,尽管你可能会在每个循环中保存一条指令:

dec reg
bne loop_top

或者甚至是我所知道的一个处理器

decrement_and_jump_if_not_zero  loop_top

编译器通常知道这一点,你不必鼓励他们。但是,如果您使用内存读取顺序很重要的 p[i] 形式,那么编译器不能或至少不应该任意更改读取的顺序。因此,对于这种情况,您需要对代码进行倒计时。

所以我尝试了所有这些事情:

unsigned fun1 ( const unsigned char *p, unsigned *x )
{
    unsigned sum;
    unsigned sqsum;
    int i;
    unsigned f;


    sum = 0;
    sqsum = 0;
    for(i=0; i<8; i++)
    {
        f = *p++;
        sum += f;
        sqsum += f*f;
    }
    //to keep the compiler from optimizing
    //stuff out
    x[0]=sum;
    return(sqsum);
}

unsigned fun2 ( const unsigned char *p, unsigned *x  )
{
    unsigned sum;
    unsigned sqsum;
    int i;
    unsigned f;


    sum = 0;
    sqsum = 0;
    for(i=8;i--;)
    {
        f = *p++;
        sum += f;
        sqsum += f*f;
    }
    //to keep the compiler from optimizing
    //stuff out
    x[0]=sum;
    return(sqsum);
}

unsigned fun3 ( const unsigned char *p, unsigned *x )
{
    unsigned sum;
    unsigned sqsum;
    int i;

    sum = 0;
    sqsum = 0;
    for(i=0; i<8; i++)
    {
        sum += (unsigned)p[i];
        sqsum += ((unsigned)p[i])*((unsigned)p[i]);
    }
    //to keep the compiler from optimizing
    //stuff out
    x[0]=sum;
    return(sqsum);
}

unsigned fun4 ( const unsigned char *p, unsigned *x )
{
    unsigned sum;
    unsigned sqsum;
    int i;

    sum = 0;
    sqsum = 0;
    for(i=8; i;i--)
    {
        sum += (unsigned)p[i-1];
        sqsum += ((unsigned)p[i-1])*((unsigned)p[i-1]);
    }
    //to keep the compiler from optimizing
    //stuff out
    x[0]=sum;
    return(sqsum);
}

使用 gcc 和 llvm (clang)。当然,两者都展开了循环,因为它是一个常量。 gcc,对于每个实验都会产生相同的代码,以防细微的寄存器混合变化。我认为这是一个错误,因为至少其中一个读取不符合代码描述的顺序。

所有四个的 gcc 解决方案都是这样的,有一些读取重新排序,请注意源代码中的读取是无序的。如果这是针对依赖于按代码描述的顺序进行读取的硬件/逻辑,那么您将遇到大问题。

00000000 <fun1>:
   0:   e92d05f0    push    {r4, r5, r6, r7, r8, sl}
   4:   e5d06001    ldrb    r6, [r0, #1]
   8:   e00a0696    mul sl, r6, r6
   c:   e4d07001    ldrb    r7, [r0], #1
  10:   e02aa797    mla sl, r7, r7, sl
  14:   e5d05001    ldrb    r5, [r0, #1]
  18:   e02aa595    mla sl, r5, r5, sl
  1c:   e5d04002    ldrb    r4, [r0, #2]
  20:   e02aa494    mla sl, r4, r4, sl
  24:   e5d0c003    ldrb    ip, [r0, #3]
  28:   e02aac9c    mla sl, ip, ip, sl
  2c:   e5d02004    ldrb    r2, [r0, #4]
  30:   e02aa292    mla sl, r2, r2, sl
  34:   e5d03005    ldrb    r3, [r0, #5]
  38:   e02aa393    mla sl, r3, r3, sl
  3c:   e0876006    add r6, r7, r6
  40:   e0865005    add r5, r6, r5
  44:   e0854004    add r4, r5, r4
  48:   e5d00006    ldrb    r0, [r0, #6]
  4c:   e084c00c    add ip, r4, ip
  50:   e08c2002    add r2, ip, r2
  54:   e082c003    add ip, r2, r3
  58:   e023a090    mla r3, r0, r0, sl
  5c:   e080200c    add r2, r0, ip
  60:   e5812000    str r2, [r1]
  64:   e1a00003    mov r0, r3
  68:   e8bd05f0    pop {r4, r5, r6, r7, r8, sl}
  6c:   e12fff1e    bx  lr

加载索引和微妙的寄存器混合是 gcc 函数之间的唯一区别,所有操作都以相同的顺序进行。

llvm/clang:

00000000 <fun1>:
   0:   e92d41f0    push    {r4, r5, r6, r7, r8, lr}
   4:   e5d0e000    ldrb    lr, [r0]
   8:   e5d0c001    ldrb    ip, [r0, #1]
   c:   e5d03002    ldrb    r3, [r0, #2]
  10:   e5d08003    ldrb    r8, [r0, #3]
  14:   e5d04004    ldrb    r4, [r0, #4]
  18:   e5d05005    ldrb    r5, [r0, #5]
  1c:   e5d06006    ldrb    r6, [r0, #6]
  20:   e5d07007    ldrb    r7, [r0, #7]
  24:   e08c200e    add r2, ip, lr
  28:   e0832002    add r2, r3, r2
  2c:   e0882002    add r2, r8, r2
  30:   e0842002    add r2, r4, r2
  34:   e0852002    add r2, r5, r2
  38:   e0862002    add r2, r6, r2
  3c:   e0870002    add r0, r7, r2
  40:   e5810000    str r0, [r1]
  44:   e0010e9e    mul r1, lr, lr
  48:   e0201c9c    mla r0, ip, ip, r1
  4c:   e0210393    mla r1, r3, r3, r0
  50:   e0201898    mla r0, r8, r8, r1
  54:   e0210494    mla r1, r4, r4, r0
  58:   e0201595    mla r0, r5, r5, r1
  5c:   e0210696    mla r1, r6, r6, r0
  60:   e0201797    mla r0, r7, r7, r1
  64:   e8bd41f0    pop {r4, r5, r6, r7, r8, lr}
  68:   e1a0f00e    mov pc, lr

更容易阅读和遵循,也许考虑缓存并一次完成所有读取。 llvm 至少在一种情况下也会导致读取乱序。

00000144 <fun4>:
 144:   e92d40f0    push    {r4, r5, r6, r7, lr}
 148:   e5d0c007    ldrb    ip, [r0, #7]
 14c:   e5d03006    ldrb    r3, [r0, #6]
 150:   e5d02005    ldrb    r2, [r0, #5]
 154:   e5d05004    ldrb    r5, [r0, #4]
 158:   e5d0e000    ldrb    lr, [r0]
 15c:   e5d04001    ldrb    r4, [r0, #1]
 160:   e5d06002    ldrb    r6, [r0, #2]
 164:   e5d00003    ldrb    r0, [r0, #3]

是的,对于来自 ram 的一些值的平均,顺序不是问题,继续。

因此编译器选择展开路径并且不关心微观优化。由于循环的大小,两者都选择烧毁一堆保存每个循环加载的值之一的寄存器,然后执行这些临时读取的加法或乘法。如果我们稍微增加循环的大小,我希望在展开的循环中看到 sum 和 sqsum 累积,因为它用完了寄存器,或者在他们选择不展开循环的地方达到阈值。

如果我传入长度,并将上面代码中的 8 替换为传入的长度,则会强制编译器从中生成一个循环。您可以看到优化,使用了这样的指令:

  a4:   e4d35001    ldrb    r5, [r3], #1

并且作为arm,它们在一个地方对循环寄存器进行修改,并且如果不等于稍后的指令数则分支......因为它们可以。

虽然这是一个数学函数,但使用 float 很痛苦。使用乘法是痛苦的,除法更糟糕,幸运的是使用了移位。幸运的是,这是无符号的,因此您可以使用移位(如果您对有符号数使用除法,编译器会/应该知道使用算术移位(如果可用))。

因此,基本上重点关注内部循环的微观优化,因为它会运行多次,如果可以更改它,那么它会变成移位和添加(如果可能的话),或者排列数据,以便您可以将其从循环中取出(如果可能的话) ,不要在其他地方浪费其他复制循环来执行此操作)

const unsigned char* p = (const unsigned char*)(j*patch->step + aux );

您可以获得一些速度。我没有尝试,但因为它是循环中的循环,编译器可能不会展开该循环...

长话短说,根据针对愚蠢编译器的指令集,您可能会获得一些收益,但这段代码并不是很糟糕,所以编译器可以像你一样优化它。

Add to Lundin. Yes, instruction sets like ARM where you have a register based index or some reach with an immediate index you might benefit encouraging the compiler to use indexing. Also though the ARM for example can increment its pointer register in the load instruction, basically *p++ in one instruction.

it is always a toss up using p[i] or p[i++] vs *p or *p++, some instruction sets are much more obvious which path to take.

Likewise your index. if you are not using it counting down instead of up can save an instruction per loop, maybe more. Some might do this:

inc reg
cmp reg,#7
bne loop_top

If you were counting down though you might save an instruction per loop:

dec reg
bne loop_top

or even one processor I know of

decrement_and_jump_if_not_zero  loop_top

The compilers usually know this and you dont have to encourage them. BUT if you use the p[i] form where the memory read order is important, then the compiler cant or at least should not arbitrarily change the order of the reads. So for that case you would want to have the code count down.

So I tried all of these things:

unsigned fun1 ( const unsigned char *p, unsigned *x )
{
    unsigned sum;
    unsigned sqsum;
    int i;
    unsigned f;


    sum = 0;
    sqsum = 0;
    for(i=0; i<8; i++)
    {
        f = *p++;
        sum += f;
        sqsum += f*f;
    }
    //to keep the compiler from optimizing
    //stuff out
    x[0]=sum;
    return(sqsum);
}

unsigned fun2 ( const unsigned char *p, unsigned *x  )
{
    unsigned sum;
    unsigned sqsum;
    int i;
    unsigned f;


    sum = 0;
    sqsum = 0;
    for(i=8;i--;)
    {
        f = *p++;
        sum += f;
        sqsum += f*f;
    }
    //to keep the compiler from optimizing
    //stuff out
    x[0]=sum;
    return(sqsum);
}

unsigned fun3 ( const unsigned char *p, unsigned *x )
{
    unsigned sum;
    unsigned sqsum;
    int i;

    sum = 0;
    sqsum = 0;
    for(i=0; i<8; i++)
    {
        sum += (unsigned)p[i];
        sqsum += ((unsigned)p[i])*((unsigned)p[i]);
    }
    //to keep the compiler from optimizing
    //stuff out
    x[0]=sum;
    return(sqsum);
}

unsigned fun4 ( const unsigned char *p, unsigned *x )
{
    unsigned sum;
    unsigned sqsum;
    int i;

    sum = 0;
    sqsum = 0;
    for(i=8; i;i--)
    {
        sum += (unsigned)p[i-1];
        sqsum += ((unsigned)p[i-1])*((unsigned)p[i-1]);
    }
    //to keep the compiler from optimizing
    //stuff out
    x[0]=sum;
    return(sqsum);
}

with both gcc and llvm (clang). And of course both unrolled the loop since it was a constant. gcc, for each of the experiments produce the same code, in cases a subtle register mix change. And I would argue a bug as at least one of them the reads were not in the order described by the code.

gcc solutions for all four were this, with some read reordering, notice the reads being out of order from the source code. If this were against hardware/logic that relied on the reads being in the order described by the code, you would have a big problem.

00000000 <fun1>:
   0:   e92d05f0    push    {r4, r5, r6, r7, r8, sl}
   4:   e5d06001    ldrb    r6, [r0, #1]
   8:   e00a0696    mul sl, r6, r6
   c:   e4d07001    ldrb    r7, [r0], #1
  10:   e02aa797    mla sl, r7, r7, sl
  14:   e5d05001    ldrb    r5, [r0, #1]
  18:   e02aa595    mla sl, r5, r5, sl
  1c:   e5d04002    ldrb    r4, [r0, #2]
  20:   e02aa494    mla sl, r4, r4, sl
  24:   e5d0c003    ldrb    ip, [r0, #3]
  28:   e02aac9c    mla sl, ip, ip, sl
  2c:   e5d02004    ldrb    r2, [r0, #4]
  30:   e02aa292    mla sl, r2, r2, sl
  34:   e5d03005    ldrb    r3, [r0, #5]
  38:   e02aa393    mla sl, r3, r3, sl
  3c:   e0876006    add r6, r7, r6
  40:   e0865005    add r5, r6, r5
  44:   e0854004    add r4, r5, r4
  48:   e5d00006    ldrb    r0, [r0, #6]
  4c:   e084c00c    add ip, r4, ip
  50:   e08c2002    add r2, ip, r2
  54:   e082c003    add ip, r2, r3
  58:   e023a090    mla r3, r0, r0, sl
  5c:   e080200c    add r2, r0, ip
  60:   e5812000    str r2, [r1]
  64:   e1a00003    mov r0, r3
  68:   e8bd05f0    pop {r4, r5, r6, r7, r8, sl}
  6c:   e12fff1e    bx  lr

the index for the loads and subtle register mixing was the only difference between functions from gcc, all of the operations were the same in the same order.

llvm/clang:

00000000 <fun1>:
   0:   e92d41f0    push    {r4, r5, r6, r7, r8, lr}
   4:   e5d0e000    ldrb    lr, [r0]
   8:   e5d0c001    ldrb    ip, [r0, #1]
   c:   e5d03002    ldrb    r3, [r0, #2]
  10:   e5d08003    ldrb    r8, [r0, #3]
  14:   e5d04004    ldrb    r4, [r0, #4]
  18:   e5d05005    ldrb    r5, [r0, #5]
  1c:   e5d06006    ldrb    r6, [r0, #6]
  20:   e5d07007    ldrb    r7, [r0, #7]
  24:   e08c200e    add r2, ip, lr
  28:   e0832002    add r2, r3, r2
  2c:   e0882002    add r2, r8, r2
  30:   e0842002    add r2, r4, r2
  34:   e0852002    add r2, r5, r2
  38:   e0862002    add r2, r6, r2
  3c:   e0870002    add r0, r7, r2
  40:   e5810000    str r0, [r1]
  44:   e0010e9e    mul r1, lr, lr
  48:   e0201c9c    mla r0, ip, ip, r1
  4c:   e0210393    mla r1, r3, r3, r0
  50:   e0201898    mla r0, r8, r8, r1
  54:   e0210494    mla r1, r4, r4, r0
  58:   e0201595    mla r0, r5, r5, r1
  5c:   e0210696    mla r1, r6, r6, r0
  60:   e0201797    mla r0, r7, r7, r1
  64:   e8bd41f0    pop {r4, r5, r6, r7, r8, lr}
  68:   e1a0f00e    mov pc, lr

much easier to read and follow, perhaps thinking about a cache and getting the reads all in one shot. llvm in at least one case got the reads out of order as well.

00000144 <fun4>:
 144:   e92d40f0    push    {r4, r5, r6, r7, lr}
 148:   e5d0c007    ldrb    ip, [r0, #7]
 14c:   e5d03006    ldrb    r3, [r0, #6]
 150:   e5d02005    ldrb    r2, [r0, #5]
 154:   e5d05004    ldrb    r5, [r0, #4]
 158:   e5d0e000    ldrb    lr, [r0]
 15c:   e5d04001    ldrb    r4, [r0, #1]
 160:   e5d06002    ldrb    r6, [r0, #2]
 164:   e5d00003    ldrb    r0, [r0, #3]

Yes, for averaging some values from ram, order is not an issue, moving on.

So the compiler choose the unrolled path and didnt care about the micro-optmizations. because of the size of the loop both choose to burn a bunch of registers holding one of the loaded values per loop then either performing the adds from those temporary reads or the multiplies. if we increase the size of the loop a little I would expect to see sum and sqsum accumulations within the unrolled loop as it runs out of registers, or the threshold will be reached where they choose not to unroll the loop.

If I pass the length in, and replace the 8's in the code above with that passed in length, forcing the compiler to make a loop out of this. You sorta see the optimizations, instructions like this are used:

  a4:   e4d35001    ldrb    r5, [r3], #1

And being arm they do a modification of the loop register in one place and branch if not equal a number of instructions later...because they can.

Granted this is a math function, but using float is painful. And using multplies is painful, divides are much worse, fortunately a shift was used. and fortunately this was unsigned so that you could use the shift (the compiler would/should have known to use an arithmetic shift if available if you used a divide against a signed number).

So basically focus on micro-optmizations of the inner loop since it gets run multiple times, and if this can be changed so it becomes shifts and adds, if possible, or arranging the data so you can take it out of the loop (if possible, dont waste other copy loops elsewhere to do this)

const unsigned char* p = (const unsigned char*)(j*patch->step + aux );

you could get some speed. I didnt try it but because it is a loop in a loop the compiler probably wont unroll that loop...

Long story short, you might get some gains depending on the instruction set against a dumber compiler, but this code is not really bad so the compiler can optimize it as well as you can.

半边脸i 2024-12-28 20:13:30

首先,如果您在代码审查上发帖,您可能会得到关于此类内容的非常好的、详细的答案。

关于效率和可疑变量类型的一些评论:

unsigned f = *p++; 如果通过数组索引访问 p,然后使用 p[i] 来访问,情况可能会更好。访问数据。这高度依赖于编译器、缓存优化等(在这件事上,一些 ARM 专家可以提供比我更好的建议)。

顺便说一句,整个 const char 到 int 看起来非常可疑。我认为这些字符应该被视为8位无符号整数?标准 C uint8_t 可能是一个更好的类型,char 有各种您想要避免的未定义符号问题。

另外,为什么要疯狂混合 unsignedint ?您正在询问隐式整数平衡错误。

<代码>stdev < .1。只是一件小事:将其更改为 .1f 或者强制将 float 隐式提升为 double,因为 .1 是 double 文字。

First of all, you will probably get very good, detailed answers on stuff like this if you post at Code review instead.

Some comments regarding efficiency and suspicious variable types:

unsigned f = *p++; You will probably be better off if you access p through array indexing and then use p[i] to access the data. This is highly dependent on compiler, cache memory optimizations etc (some ARM guru can give a better advise than me in this matter).

Btw the whole const char to int looks highly suspicious. I take it those chars are to be regarded as 8-bit unsigned integers? Standard C uint8_t is likely a better type to for this, char has various undefined signedness issues that you want to avoid.

Also, why are you doing wild mixing of unsigned and int? You are asking for implicit integer balancing bugs.

stdev < .1. Just a minor thing: change this to .1f or you enforce an implicit promotion of your float to double, since .1 is a double literal.

月牙弯弯 2024-12-28 20:13:30

当您的数据以 8 字节为一组读取时,根据您的硬件总线和数组本身的对齐情况,您可能可以通过一次长长读取读取内部循环,然后手动拆分数据,从而获得一些收益。将数字转换为单独的值,或者使用 ARM 内在函数使用 add8 指令与某些内联汇编并行执行加法(在 1 个寄存器中一次将 4 个数字相加),或者进行一些移位并使用 add16 来允许值溢出进入16 位的空间。还有一个双符号乘法和累加指令,只需一点帮助即可通过 ARM 几乎完美地支持您的第一个累加循环。此外,如果传入的数据可以被处理为 16 位值,也可以加快速度。

至于为什么 NEON 速度较慢,我的猜测是设置向量以及使用较大类型推送的添加数据的开销正在扼杀它可能通过如此小的数据集获得的任何性能。原始代码一开始就对 ARM 非常友好,这意味着设置开销可能会让您丧命。如有疑问,请查看汇编输出。这会告诉你到底发生了什么。也许编译器在尝试使用内在函数时会到处推送和弹出数据 - 这不是我第一次看到这种行为。

As your data is being read in in groups of 8 bytes, depending on your hardware bus and the alignment of the array itself, you can probably get some gains by reading the inner loop in via a single long long read, then either manually splitting the numbers into seperate values, or using ARM intrinsics to do the adds in parallel with some inline asm using the add8 instruction (adds 4 numbers together at a time in 1 register) or do a touch of shifting and use add16 to allow the values to overflow into 16-bits worth of space. There is also a dual signed multiply and accumulate instruction which makes your first accumulation loop nearly perfectly supported via ARM with just a little help. Also, if the data coming in could be massaged to be 16-bit values, that could also speed this up.

As to why the NEON is slower, my guess is the overhead in setting up the vectors along with the added data you are pushing around with larger types is killing any performance it might gain with such a small set of data. The original code is very ARM friendly to begin with, which means the setup overhead is probably killing you. When in doubt, look at the assembly output. That will tell you what's truly going on. Perhaps the compiler is pushing and popping data all over the place when trying to use the intrinsics - wouldn't be the first time I've see this sort of behavior.

终难愈 2024-12-28 20:13:30

感谢伦丁、德韦尔奇和米歇尔。
我做了下一个改进,它似乎最适合我的代码。
我正在尝试减少改进缓存访问的周期数,因为只访问缓存一次。

int step=patch->step;
 for (int j=0; j< 8; j++) {
        p = (uint8_t*)(j*step + aux ); /

        i=8;
        do {                
            f=p[i];
            sum += f;
            sqsum += f*f;

        } while(--i);

}

Thanks to Lundin, dwelch and Michel.
I made the next improvement and it seems the best for my code.
I´m trying to decrease the number of cycles improving the cache access, because is only accessing to cache one time.

int step=patch->step;
 for (int j=0; j< 8; j++) {
        p = (uint8_t*)(j*step + aux ); /

        i=8;
        do {                
            f=p[i];
            sum += f;
            sqsum += f*f;

        } while(--i);

}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文