非常基础的 SSE

发布于 2024-12-13 08:31:37 字数 507 浏览 6 评论 0原文

我有一个非常简单的程序，我正在尝试提高性能。我知道有帮助的一种方法是利用 SSE3（因为我正在工作的机器支持这一点），但我完全不知道如何做到这一点。这是一个代码片段（c++）：

int sum1, sum2, sum3, sum4;
for (int i=0; i<length; i+=4) {
  for (int j=0; j<length; j+=4) {
    sum1 = sum1 + input->value[i][j];
    sum2 = sum2 + input->value[i+1][j+1];
    sum3 = sum3 + input->value[i+2][j+3];
    sum4 = sum4 + input->value[i+3][j+4];    
  {
}

我已经阅读了一些相关内容，并理解了这个想法，但我完全不知道如何实现它。有人可以帮我吗？我认为这相当简单，特别是对于我的简单程序来说，但有时入门是最困难的部分。

谢谢！

原文

I have a very simple program that I am trying to improve performance. One way that I know will help is to utilize SSE3 (since the machine that I am working supports this), but I have absolutely no idea how to to do this. Here is a code snippet (c++):

int sum1, sum2, sum3, sum4;
for (int i=0; i<length; i+=4) {
  for (int j=0; j<length; j+=4) {
    sum1 = sum1 + input->value[i][j];
    sum2 = sum2 + input->value[i+1][j+1];
    sum3 = sum3 + input->value[i+2][j+3];
    sum4 = sum4 + input->value[i+3][j+4];    
  {
}

I've read a little about this, and understand the idea, but I have absolutely no idea how to implement this. Can somebody help me please? I think that this is fairly simple, particularly for my simple program, but sometimes getting started is the hardest part.

Thanks!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

隔岸观火 2024-12-20 08:31:38

事实上，就你的情况而言，事情并没有那么简单。就目前而言，您的代码不能可矢量化。（至少在没有显着的循环转换的情况下）

这样做的原因是您也在内部循环内更改了索引i。这会破坏对 j 迭代进行向量化的任何机会，因为内存位置不再相邻并且位于矩阵的不同行中。（因为你似乎沿着矩阵对角线运行）

但是，我感觉你正在尝试总结矩阵中的所有元素，并且你实际上希望你的循环是这样的（并且你有很多拼写错误太）：

int sum1 = 0, sum2 = 0, sum3 = 0, sum4 = 0;
for (int i=0; i<length; i++) {
  for (int j=0; j<length; j+=4) {
    sum1 = sum1 + input->value[i][j];
    sum2 = sum2 + input->value[i][j+1];
    sum3 = sum3 + input->value[i][j+2];
    sum4 = sum4 + input->value[i][j+3];    
  }
}

int total = sum1 + sum2 + sum3 + sum4;

如果这是你想要的，那么它是非常可矢量化的。
在使用内在函数的 C/C++ 中，可以仅使用 SSE2 按如下方式完成此操作：

__m128i sum = _mm_setzero_si128();
for (int i=0; i<length; i++) {
  for (int j=0; j<length; j+=4) {
    __m128i val = _mm_load_si128(&input->value[i][j]);
    sum = _mm_add_epi32(sum,val);
  }
}

请注意，将应用对齐限制。通过进一步展开循环可以获得更多的加速。

Actually, in your case, it is not that simple. As it stands right now, your code is NOT vectorizable. (at least not without significant loop transformations)

The reason for this is that you are changing the index i as well inside the inner loop. The breaks any chance of being able to vectorize the j iteration because the memory locations are no longer adjacent and are in different rows of the matrix. (as you seem to be running down the matrix diagonally)

However, I get the feeling that you are trying to sum up all the elements in your matrix, and you actually intended your loop to be like this (and you had a number of typos too):

int sum1 = 0, sum2 = 0, sum3 = 0, sum4 = 0;
for (int i=0; i<length; i++) {
  for (int j=0; j<length; j+=4) {
    sum1 = sum1 + input->value[i][j];
    sum2 = sum2 + input->value[i][j+1];
    sum3 = sum3 + input->value[i][j+2];
    sum4 = sum4 + input->value[i][j+3];    
  }
}

int total = sum1 + sum2 + sum3 + sum4;

If this is what you wanted, then it is very vectorizable.
In C/C++ using intrinsics, this can be done as follows using just SSE2:

__m128i sum = _mm_setzero_si128();
for (int i=0; i<length; i++) {
  for (int j=0; j<length; j+=4) {
    __m128i val = _mm_load_si128(&input->value[i][j]);
    sum = _mm_add_epi32(sum,val);
  }
}

Note that alignment restrictions will apply. And a lot more speedup can be gained by further unrolling the loop.

回复收藏 0 原文

~没有更多了~