删除初始化会导致AVX2 FMA性能下降。为什么?

发布于 2025-02-02 21:57:37 字数 2831 浏览 4 评论 0原文

我在此处放置了一个链接: https://godbolt.org/z/d6bx9vh1s 。您可以自由浏览,编辑和检查速度。

我编写了一件代码来测试AVX2 FMA的最大速度。但是,我发现删除XOR部分会导致巨大的性能下降(从100+ gflops降至〜1Gflops)。

#include <chrono>
#include <iostream>

int main() {
  int t = 1 << 20;

  std::chrono::high_resolution_clock::time_point t1 =
      std::chrono::high_resolution_clock::now();
  asm volatile(R"(
vxorps %%ymm0, %%ymm0, %%ymm0
vxorps %%ymm1, %%ymm1, %%ymm1
vxorps %%ymm2, %%ymm2, %%ymm2
vxorps %%ymm3, %%ymm3, %%ymm3
vxorps %%ymm4, %%ymm4, %%ymm4
vxorps %%ymm5, %%ymm5, %%ymm5
vxorps %%ymm6, %%ymm6, %%ymm6
vxorps %%ymm7, %%ymm7, %%ymm7
vxorps %%ymm8, %%ymm8, %%ymm8
vxorps %%ymm9, %%ymm9, %%ymm9

loop:

vfmadd231ps %%ymm0, %%ymm0, %%ymm0
vfmadd231ps %%ymm1, %%ymm1, %%ymm1
vfmadd231ps %%ymm2, %%ymm2, %%ymm2
vfmadd231ps %%ymm3, %%ymm3, %%ymm3
vfmadd231ps %%ymm4, %%ymm4, %%ymm4
vfmadd231ps %%ymm5, %%ymm5, %%ymm5
vfmadd231ps %%ymm6, %%ymm6, %%ymm6
vfmadd231ps %%ymm7, %%ymm7, %%ymm7
vfmadd231ps %%ymm8, %%ymm8, %%ymm8
vfmadd231ps %%ymm9, %%ymm9, %%ymm9

addl $-1, %0
jne loop
  )" ::"r"(t));
  std::chrono::high_resolution_clock::time_point t2 =
      std::chrono::high_resolution_clock::now();

  int64_t flops_per_iter = 10 * 8 * 2;
  int64_t flops = flops_per_iter * t;
  double seconds =
      std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1)
          .count();
  double flops_per_second = flops / seconds;
  printf("%.4f GFLOPS\n", flops_per_second / (1e9));

  return 0;
}

结果应大约为100多个Gflops。但是,如果您删除XOR部分:

#include <chrono>
#include <iostream>

int main() {
  int t = 1 << 20;

  std::chrono::high_resolution_clock::time_point t1 =
      std::chrono::high_resolution_clock::now();
  asm volatile(R"(
loop:

vfmadd231ps %%ymm0, %%ymm0, %%ymm0
vfmadd231ps %%ymm1, %%ymm1, %%ymm1
vfmadd231ps %%ymm2, %%ymm2, %%ymm2
vfmadd231ps %%ymm3, %%ymm3, %%ymm3
vfmadd231ps %%ymm4, %%ymm4, %%ymm4
vfmadd231ps %%ymm5, %%ymm5, %%ymm5
vfmadd231ps %%ymm6, %%ymm6, %%ymm6
vfmadd231ps %%ymm7, %%ymm7, %%ymm7
vfmadd231ps %%ymm8, %%ymm8, %%ymm8
vfmadd231ps %%ymm9, %%ymm9, %%ymm9

addl $-1, %0
jne loop
  )" ::"r"(t));
  std::chrono::high_resolution_clock::time_point t2 =
      std::chrono::high_resolution_clock::now();

  int64_t flops_per_iter = 10 * 8 * 2;
  int64_t flops = flops_per_iter * t;
  double seconds =
      std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1)
          .count();
  double flops_per_second = flops / seconds;
  printf("%.4f GFLOPS\n", flops_per_second / (1e9));

  return 0;
}

性能下降到近1个Gflops。

这太奇怪了。

I put a link here: https://godbolt.org/z/d6bx9vh1s. You can freely browse, edit and check speed.

I wrote a piece of code to test AVX2 FMA's maximum speed. But, I found that deleting the xor section leads to a huge performance drop (from 100+ GFLOPs down to ~1GFLOPs).

#include <chrono>
#include <iostream>

int main() {
  int t = 1 << 20;

  std::chrono::high_resolution_clock::time_point t1 =
      std::chrono::high_resolution_clock::now();
  asm volatile(R"(
vxorps %%ymm0, %%ymm0, %%ymm0
vxorps %%ymm1, %%ymm1, %%ymm1
vxorps %%ymm2, %%ymm2, %%ymm2
vxorps %%ymm3, %%ymm3, %%ymm3
vxorps %%ymm4, %%ymm4, %%ymm4
vxorps %%ymm5, %%ymm5, %%ymm5
vxorps %%ymm6, %%ymm6, %%ymm6
vxorps %%ymm7, %%ymm7, %%ymm7
vxorps %%ymm8, %%ymm8, %%ymm8
vxorps %%ymm9, %%ymm9, %%ymm9

loop:

vfmadd231ps %%ymm0, %%ymm0, %%ymm0
vfmadd231ps %%ymm1, %%ymm1, %%ymm1
vfmadd231ps %%ymm2, %%ymm2, %%ymm2
vfmadd231ps %%ymm3, %%ymm3, %%ymm3
vfmadd231ps %%ymm4, %%ymm4, %%ymm4
vfmadd231ps %%ymm5, %%ymm5, %%ymm5
vfmadd231ps %%ymm6, %%ymm6, %%ymm6
vfmadd231ps %%ymm7, %%ymm7, %%ymm7
vfmadd231ps %%ymm8, %%ymm8, %%ymm8
vfmadd231ps %%ymm9, %%ymm9, %%ymm9

addl $-1, %0
jne loop
  )" ::"r"(t));
  std::chrono::high_resolution_clock::time_point t2 =
      std::chrono::high_resolution_clock::now();

  int64_t flops_per_iter = 10 * 8 * 2;
  int64_t flops = flops_per_iter * t;
  double seconds =
      std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1)
          .count();
  double flops_per_second = flops / seconds;
  printf("%.4f GFLOPS\n", flops_per_second / (1e9));

  return 0;
}

The result should be around 100+ GFLOPs. But if you delete the xor part:

#include <chrono>
#include <iostream>

int main() {
  int t = 1 << 20;

  std::chrono::high_resolution_clock::time_point t1 =
      std::chrono::high_resolution_clock::now();
  asm volatile(R"(
loop:

vfmadd231ps %%ymm0, %%ymm0, %%ymm0
vfmadd231ps %%ymm1, %%ymm1, %%ymm1
vfmadd231ps %%ymm2, %%ymm2, %%ymm2
vfmadd231ps %%ymm3, %%ymm3, %%ymm3
vfmadd231ps %%ymm4, %%ymm4, %%ymm4
vfmadd231ps %%ymm5, %%ymm5, %%ymm5
vfmadd231ps %%ymm6, %%ymm6, %%ymm6
vfmadd231ps %%ymm7, %%ymm7, %%ymm7
vfmadd231ps %%ymm8, %%ymm8, %%ymm8
vfmadd231ps %%ymm9, %%ymm9, %%ymm9

addl $-1, %0
jne loop
  )" ::"r"(t));
  std::chrono::high_resolution_clock::time_point t2 =
      std::chrono::high_resolution_clock::now();

  int64_t flops_per_iter = 10 * 8 * 2;
  int64_t flops = flops_per_iter * t;
  double seconds =
      std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1)
          .count();
  double flops_per_second = flops / seconds;
  printf("%.4f GFLOPS\n", flops_per_second / (1e9));

  return 0;
}

The performance drops to nearly 1 GFLOPs.

This is so strange.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文