从ARM NEON到Intel Interins,以获得8x UINT8_T的绝对差异的总和

发布于 2025-02-04 04:09:22 字数 1455 浏览 1 评论 0 原文

我正在尝试使用ARM NEON INTENSICS转换一些代码,以使用Intel Interinsics。

我立即被卡住了,并试图找到适当的英特尔内在物质来替代霓虹灯内在的。 My first hurdle is to translate the following function:

void sad_row_8(uint8_t *row1, uint8_t *row2, int *result)
{
    *result = 0;
    uint8x8_t vec1 = vld1_u8(row1);
    uint8x8_t vec2 = vld1_u8(row2);
    uint8x8_t absvec = vabd_u8(vec1, vec2);
    *result += vaddlv_u8(absvec);
}

In the code above, row1 and row2 are pointers to rows of at least 8 consecutive uin8_t elements.该函数计算UINT8_T元素的两个行之间的绝对差异之和。

在使用霓虹灯内在的编写代码时,我使用找到适当的内在系统,我从来没有太多麻烦找到我需要的东西。 为了找到正确的Intel Intersics来翻译上述代码,我尝试使用https://www.intel.com /Content/www/us/en/docs/intrinsics-guide/index.html#techs=mmx,sse,SSE2,SSE2,SSE3,SSE3,SSE4_1,SSE4_1,SSE4_2,avx,avx,avx,avx,avx,avx2,avx2,avx2,avx_512 。 在这里,我试图找到与我在霓虹灯解决方案中使用的相应的内在物质,但没有很多运气。

我正在寻找的是关于我如何更好地解决这个问题的帮助/建议,也许是指我的方法(可能?)明显的缺陷。

我的处理器是Intel Core i5-11400F,根据英特尔的说法,该指令集扩展Intel®SSE4.1,Intel®SSE4.2,Intel®AVX2,Intel®Avx-512。

I am trying to convert some code using ARM NEON intrinsics to use Intel intrinsics instead.

I immediately got stuck and am trying to find the appropriate Intel intrinsics to replace the NEON intrinsics. My first hurdle is to translate the following function:

void sad_row_8(uint8_t *row1, uint8_t *row2, int *result)
{
    *result = 0;
    uint8x8_t vec1 = vld1_u8(row1);
    uint8x8_t vec2 = vld1_u8(row2);
    uint8x8_t absvec = vabd_u8(vec1, vec2);
    *result += vaddlv_u8(absvec);
}

In the code above, row1 and row2 are pointers to rows of at least 8 consecutive uin8_t elements. The function computes the sum of absolute differences between two rows of uint8_t elements.

When writing code using NEON intrinsics I used https://developer.arm.com/architectures/instruction-sets/intrinsics/#f:@navigationhierarchiessimdisa=[Neon] to find appropriate intrinsics, and I never had much trouble finding what I needed.
In my attempt to find the correct Intel intrinsics to translate the code above, I have attempted using https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#techs=MMX,SSE,SSE2,SSE3,SSSE3,SSE4_1,SSE4_2,AVX,AVX2,AVX_512&ig_expand=54,6050&cats=Load .
Here, I have tried to find corresponding intrinsics to the ones I have used in the NEON solution, but without much luck.

What I am looking for is help/advice on how I can better approach this problem, perhaps by pointing out some (possibly?) obvious flaws in my approach.

My processor is an Intel Core i5-11400F, which according to Intel has the instruction set extensions Intel® SSE4.1, Intel® SSE4.2, Intel® AVX2, Intel® AVX-512.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

败给现实 2025-02-11 04:09:22

绝对差异的总和在英特尔方面有所不同。

在霓虹灯编程中,人们使用传统的每车道 abd 操作,最好是扩大累积 - 然后是最终的水平减少。

在英特尔中,内在 _mm_sad_epu8 并行执行两个ABD+水平减少:

1 2 3 1 2 3 1 2 | 0 1 0 1 4 2 1 0 | < register A
0 0 0 0 1 1 1 1 | 2 2 2 2 3 3 3 3 | < register B
-----------------------------------
1 2 3 1 1 2 0 1   2 1 2 1 1 1 2 3   < -- Neon vabdq_u8(A,B)
11  (uint64_t)    13   (uint64_t)   < -- Intel _mm_sad_epu8(A,B)

相应的Intel例程将为

void sad_row_8(uint8_t *row1, uint8_t *row2, int *result)
{
    *result = 0;
    __m128i vec1 = _mm_loadu_si64(row1);
    __m128i vec2 = _mm_loadu_si64(row2);
    __m128i absvec = _mm_sad_epu8(vec1, vec2);
    *result += _mm_cvtsi128_si32(absvec);
}

Sum of absolute differences is done a bit differently in Intel.

In Neon programming one uses traditional per-lane abd operation and preferably widening accumulation - then a final horizontal reduction.

In Intel the intrinsic _mm_sad_epu8 instead performs two abd+horizontal reductions in parallel:

1 2 3 1 2 3 1 2 | 0 1 0 1 4 2 1 0 | < register A
0 0 0 0 1 1 1 1 | 2 2 2 2 3 3 3 3 | < register B
-----------------------------------
1 2 3 1 1 2 0 1   2 1 2 1 1 1 2 3   < -- Neon vabdq_u8(A,B)
11  (uint64_t)    13   (uint64_t)   < -- Intel _mm_sad_epu8(A,B)

The corresponding intel routine would be

void sad_row_8(uint8_t *row1, uint8_t *row2, int *result)
{
    *result = 0;
    __m128i vec1 = _mm_loadu_si64(row1);
    __m128i vec2 = _mm_loadu_si64(row2);
    __m128i absvec = _mm_sad_epu8(vec1, vec2);
    *result += _mm_cvtsi128_si32(absvec);
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文