SIMD-如何添加来自不同元素宽度的2个向量的相应值（char或uint8_t添加到int）

发布于 2025-01-25 20:39:41 字数 1364 浏览 1 评论 0原文

请告诉我如何从同一类型的SIMD向量中添加值，但是这些值本身，这些值本身是由这些SIMD向量中不同数量的字节占据的。

这是一个示例：

int main()
{
    //--------------------------------------------------------------
    int my_int_sequence[16] = { 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 };


    __m128i my_int_sequence_m128i_1 = _mm_loadu_si128((__m128i*) & my_int_sequence[0]);
    __m128i my_int_sequence_m128i_2 = _mm_loadu_si128((__m128i*) & my_int_sequence[4]);
    __m128i my_int_sequence_m128i_3 = _mm_loadu_si128((__m128i*) & my_int_sequence[8]);
    __m128i my_int_sequence_m128i_4 = _mm_loadu_si128((__m128i*) & my_int_sequence[12]);
    //--------------------------------------------------------------



    //-----------------------------------------------------------------------
    char my_char_mask[16] = { 1,0,1,1,0,1,0,1,1,1,0,1,0,1,0,1 };

    __m128i my_char_mask_m128i = _mm_loadu_si128((__m128i*) &my_char_mask[0]);
    //-----------------------------------------------------------------------

}

也就是说，我在my_int_sequence数组中有一个int值数组 - 由于所有16个int值都不适合一个__m128i向量，因此我将这些值4值加载到第四__m128i vectors中。

我也有16个字节的数组，我还将其加载到my_char_mask_my_m128i vector中。

现在，我想将MY_INT_SECORES_M128I_X向量的每个4个字节值添加到每个4个字节值中，就好像来自my_char_mask_my_my_m128i vector的相应单字节值一样。

这个问题很明显，我需要加起来不同的维度。是否可以？

也许我需要向量my_char_mask_my_m128i的每个字节 - 如何将其转换为4个字节？

原文

Please tell me how can add values from a SIMD vector of the same type, but the values themselves, which are occupied by a different number of bytes in these SIMD vectors.

Here's an example:

int main()
{
    //--------------------------------------------------------------
    int my_int_sequence[16] = { 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 };


    __m128i my_int_sequence_m128i_1 = _mm_loadu_si128((__m128i*) & my_int_sequence[0]);
    __m128i my_int_sequence_m128i_2 = _mm_loadu_si128((__m128i*) & my_int_sequence[4]);
    __m128i my_int_sequence_m128i_3 = _mm_loadu_si128((__m128i*) & my_int_sequence[8]);
    __m128i my_int_sequence_m128i_4 = _mm_loadu_si128((__m128i*) & my_int_sequence[12]);
    //--------------------------------------------------------------



    //-----------------------------------------------------------------------
    char my_char_mask[16] = { 1,0,1,1,0,1,0,1,1,1,0,1,0,1,0,1 };

    __m128i my_char_mask_m128i = _mm_loadu_si128((__m128i*) &my_char_mask[0]);
    //-----------------------------------------------------------------------

}

That is, I have an array of int values in the my_int_sequence array - and since all 16 int values will not fit in one __m128i vector, I load these values 4 values into the 4th __m128i vectors.

I also have an array of 16 bytes, which I also loaded into the my_char_mask_my_m128i vector.

And now I want to add to each 4 byte value of the my_int_sequence_m128i_x vectors, as if the corresponding one-byte value from the my_char_mask_my_m128i vector.

The problem is obvious that I need to add up, as it were, different dimensions. Is it possible?

Perhaps I need each byte of the vector my_char_mask_my_m128i - how to transform it into 4 bytes?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

反差帅 2025-02-01 20:39:41

也许我需要向量my_char_mask_my_m128i的每个字节 - 如何将其转换为4个字节？

您正在寻找sse4.1 intinsic _mm_cvtepi8_epi32（），该在SSE矢量中占据了第一个4（已签名的）8位整数，并将其签名到32位整数中。将其与一些转换结合起来，以将接下来的4移动到下一个扩展程序中，并且您会得到类似的内容

#include <iostream>
#include <cstdint>
#include <emmintrin.h>
#include <smmintrin.h>

void print_int4(__m128i vec) {
  alignas(16) std::int32_t ints[4];
  _mm_store_si128(reinterpret_cast<__m128i*>(ints), vec);
  std::cout << '[' << ints[0] << ", " << ints[1] << ", " << ints[2] << ", "
            << ints[3] << ']';
}

int main(void) {
  alignas(16) std::int32_t 
    my_int_sequence[16] = { 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 };
  alignas(16) std::int8_t
    my_char_mask[16] = { 1,0,1,1,0,1,0,1,1,1,0,1,0,1,0,1 };

  __m128i char_mask = _mm_load_si128(reinterpret_cast<__m128i*>(my_char_mask));
  
  // Loop through the 32-bit int array 4 at a time
  for (int n = 0; n < 16; n += 4) {
    // Load the next 4 ints
    __m128i vec =
      _mm_load_si128(reinterpret_cast<__m128i*>(my_int_sequence + n));
    // Convert the next 4 chars to ints
    __m128i chars_to_add = _mm_cvtepi8_epi32(char_mask);
    // Shift out those 4 chars
    char_mask = _mm_srli_si128(char_mask, 4);
    // And add together
    __m128i sum = _mm_add_epi32(vec, chars_to_add);

    print_int4(vec);
    std::cout << " + ";
    print_int4(chars_to_add);
    std::cout << " = ";
    print_int4(sum);
    std::cout << '\n';
  }
}

（请注意，通常必须告诉编译器生成SSE 4.1指令 - 使用g ++和clang ++使用适当的-march = xxxx选项或-msse4.1）：

$ g++ -O -Wall -Wextra -std=gnu++11 -msse4.1 demo.cc
$ ./a.out
[0, 1, 2, 3] + [1, 0, 1, 1] = [1, 1, 3, 4]
[4, 5, 6, 7] + [0, 1, 0, 1] = [4, 6, 6, 8]
[8, 9, 10, 11] + [1, 1, 0, 1] = [9, 10, 10, 12]
[12, 13, 14, 15] + [0, 1, 0, 1] = [12, 14, 14, 16]

彼得·科德斯（Peter Cordes）建议的替代版本，如果您的编译器足够了拥有_mm_loadu_si32（）：

  // Loop through the 32-bit int array 4 at a time
  for (int n = 0; n < 16; n += 4) {
    // Load the next 4 ints
    __m128i vec =
      _mm_load_si128(reinterpret_cast<__m128i*>(my_int_sequence + n));
    // Load the next 4 chars
    __m128i char_mask = _mm_loadu_si32(my_char_mask + n);
    // Convert them to ints
    __m128i chars_to_add = _mm_cvtepi8_epi32(char_mask);
    // And add together
    __m128i sum = _mm_add_epi32(vec, chars_to_add);
 
    // Do more stuff
}

Perhaps I need each byte of the vector my_char_mask_my_m128i - how to transform it into 4 bytes?

You're looking for the SSE4.1 intrinsic _mm_cvtepi8_epi32(), which takes the first 4 (signed) 8-bit integers in the SSE vector and sign-extends them into 32-bit integers. Combine that with some shifting to move the next 4 into place for the next extension, and you get something like:

#include <iostream>
#include <cstdint>
#include <emmintrin.h>
#include <smmintrin.h>

void print_int4(__m128i vec) {
  alignas(16) std::int32_t ints[4];
  _mm_store_si128(reinterpret_cast<__m128i*>(ints), vec);
  std::cout << '[' << ints[0] << ", " << ints[1] << ", " << ints[2] << ", "
            << ints[3] << ']';
}

int main(void) {
  alignas(16) std::int32_t 
    my_int_sequence[16] = { 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 };
  alignas(16) std::int8_t
    my_char_mask[16] = { 1,0,1,1,0,1,0,1,1,1,0,1,0,1,0,1 };

  __m128i char_mask = _mm_load_si128(reinterpret_cast<__m128i*>(my_char_mask));
  
  // Loop through the 32-bit int array 4 at a time
  for (int n = 0; n < 16; n += 4) {
    // Load the next 4 ints
    __m128i vec =
      _mm_load_si128(reinterpret_cast<__m128i*>(my_int_sequence + n));
    // Convert the next 4 chars to ints
    __m128i chars_to_add = _mm_cvtepi8_epi32(char_mask);
    // Shift out those 4 chars
    char_mask = _mm_srli_si128(char_mask, 4);
    // And add together
    __m128i sum = _mm_add_epi32(vec, chars_to_add);

    print_int4(vec);
    std::cout << " + ";
    print_int4(chars_to_add);
    std::cout << " = ";
    print_int4(sum);
    std::cout << '\n';
  }
}

Example (Note that you usually have to tell your compiler to generate SSE 4.1 instructions - with g++ and clang++ use the appropriate -march=XXXX option or -msse4.1):

$ g++ -O -Wall -Wextra -std=gnu++11 -msse4.1 demo.cc
$ ./a.out
[0, 1, 2, 3] + [1, 0, 1, 1] = [1, 1, 3, 4]
[4, 5, 6, 7] + [0, 1, 0, 1] = [4, 6, 6, 8]
[8, 9, 10, 11] + [1, 1, 0, 1] = [9, 10, 10, 12]
[12, 13, 14, 15] + [0, 1, 0, 1] = [12, 14, 14, 16]

Alternative version suggested by Peter Cordes if your compiler is new enough to have _mm_loadu_si32():

  // Loop through the 32-bit int array 4 at a time
  for (int n = 0; n < 16; n += 4) {
    // Load the next 4 ints
    __m128i vec =
      _mm_load_si128(reinterpret_cast<__m128i*>(my_int_sequence + n));
    // Load the next 4 chars
    __m128i char_mask = _mm_loadu_si32(my_char_mask + n);
    // Convert them to ints
    __m128i chars_to_add = _mm_cvtepi8_epi32(char_mask);
    // And add together
    __m128i sum = _mm_add_epi32(vec, chars_to_add);
 
    // Do more stuff
}

回复收藏 0 原文

~没有更多了~