如何优化SYCL内核

发布于 2025-02-12 00:42:30 字数 1153 浏览 1 评论 0原文

我正在大学学习SYCL，我对代码的性能有疑问。特别是我有此c/c ++代码：

我需要在具有并行化的SYCL内核中翻译它，我这样做：

#include <sycl/sycl.hpp>
#include <vector>
#include <iostream>
using namespace sycl;
constexpr int size = 131072; // 2^17
int main(int argc, char** argv) {
  // Create a vector with size elements and initialize them to 1
  std::vector<float> dA(size); 
  try {
    queue gpuQueue{ gpu_selector{} };
    buffer<float, 1> bufA(dA.data(), range<1>(dA.size()));
    gpuQueue.submit([&](handler& cgh) {
                    accessor inA{ bufA,cgh };
                    cgh.parallel_for(range<1>(size),
                                     [=](id<1> i) { inA[i] = inA[i] + 2; }
                    );
    });
    gpuQueue.wait_and_throw();
  }
  catch (std::exception& e) { throw e; }
}

所以我的问题是关于c值，在此中情况我直接使用了两个值，但这会在我运行代码时会影响性能吗？我需要创建一个变量，或者以这种方式是正确的，并且性能良好？

原文

I'm studying SYCL at university and I have a question about performance of a code.
In particular I have this C/C++ code:

And I need to translate it in a SYCL kernel with parallelization and I do this:

#include <sycl/sycl.hpp>
#include <vector>
#include <iostream>
using namespace sycl;
constexpr int size = 131072; // 2^17
int main(int argc, char** argv) {
  // Create a vector with size elements and initialize them to 1
  std::vector<float> dA(size); 
  try {
    queue gpuQueue{ gpu_selector{} };
    buffer<float, 1> bufA(dA.data(), range<1>(dA.size()));
    gpuQueue.submit([&](handler& cgh) {
                    accessor inA{ bufA,cgh };
                    cgh.parallel_for(range<1>(size),
                                     [=](id<1> i) { inA[i] = inA[i] + 2; }
                    );
    });
    gpuQueue.wait_and_throw();
  }
  catch (std::exception& e) { throw e; }
}

So my question is about c value, in this case I use directly the value two but this will impact on the performance when I'll run the code? I need to create a variable or in this way is correct and the performance are good?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

扛刀软妹 2025-02-19 00:42:30

有趣的问题。在这种情况下，值2将是SYCL内核中指令中的字面意义 - 我认为这是尽可能高的效率！您有一个轻微的并发症，您将隐式铸件从int到float。我的猜测是，您可能会在设备组装中以float文字 2.0 。您的SYCL设备不必在运行时或类似的东西中从内存或铸造中获取该2，它只是存在于说明中。

同样，如果您有：

constexpr int c = 2;
// the rest of your code
[=](id<1> i) { inA[i] = inA[i] + c; }
// etc

编译器几乎可以肯定足够聪明，可以将c的常数值传播到内核代码中。因此，同样，2.0字面的说明最终出现在说明中。

我用DPC ++编制了您的示例，并提取了LLVM IR，并找到了以下行：

  %5 = load float, float addrspace(4)* %arrayidx.ascast.i.i, align 4, !tbaa !17
  %add.i = fadd float %5, 2.000000e+00
  store float %add.i, float addrspace(4)* %arrayidx.ascast.i.i, align 4, !tbaa !17

这显示了float Load＆amp;在同一地址存储/从同一地址存储，并在两者之间使用“添加2.0”指令。如果我修改以使用我所展示的变量c，则获得相同的LLVM IR。

结论：您已经达到了最大的效率，并且编译器很聪明！

Interesting question. In this case the value 2 will be a literal in the instruction in your SYCL kernel - this is as efficient as it gets, I think! There's the slight complication that you have an implicit cast from int to float. My guess is that you'll probably end up with a float literal 2.0 in your device assembly. Your SYCL device won't have to fetch that 2 from memory or cast at runtime or anything like that, it just lives in the instruction.

Equally, if you had:

constexpr int c = 2;
// the rest of your code
[=](id<1> i) { inA[i] = inA[i] + c; }
// etc

The compiler is almost certainly smart enough to propagate the constant value of c into the kernel code. So, again, the 2.0 literal ends up in the instruction.

I compiled your example with DPC++ and extracted the LLVM IR, and found the following lines:

  %5 = load float, float addrspace(4)* %arrayidx.ascast.i.i, align 4, !tbaa !17
  %add.i = fadd float %5, 2.000000e+00
  store float %add.i, float addrspace(4)* %arrayidx.ascast.i.i, align 4, !tbaa !17

This shows a float load & store to/from the same address, with an 'add 2.0' instruction in between. If I modify to use the variable c like I demonstrated, I get the same LLVM IR.

Conclusion: you've already achieved maximum efficiency, and compilers are smart!

回复收藏 0 原文

~没有更多了~