如何优化SYCL内核
我正在大学学习SYCL,我对代码的性能有疑问。 特别是我有此c/c ++代码:
我需要在具有并行化的SYCL内核中翻译它,我这样做:
#include <sycl/sycl.hpp>
#include <vector>
#include <iostream>
using namespace sycl;
constexpr int size = 131072; // 2^17
int main(int argc, char** argv) {
// Create a vector with size elements and initialize them to 1
std::vector<float> dA(size);
try {
queue gpuQueue{ gpu_selector{} };
buffer<float, 1> bufA(dA.data(), range<1>(dA.size()));
gpuQueue.submit([&](handler& cgh) {
accessor inA{ bufA,cgh };
cgh.parallel_for(range<1>(size),
[=](id<1> i) { inA[i] = inA[i] + 2; }
);
});
gpuQueue.wait_and_throw();
}
catch (std::exception& e) { throw e; }
}
所以我的问题是关于c
值,在此中情况我直接使用了两个值,但这会在我运行代码时会影响性能吗?我需要创建一个变量,或者以这种方式是正确的,并且性能良好?
I'm studying SYCL at university and I have a question about performance of a code.
In particular I have this C/C++ code:
And I need to translate it in a SYCL kernel with parallelization and I do this:
#include <sycl/sycl.hpp>
#include <vector>
#include <iostream>
using namespace sycl;
constexpr int size = 131072; // 2^17
int main(int argc, char** argv) {
// Create a vector with size elements and initialize them to 1
std::vector<float> dA(size);
try {
queue gpuQueue{ gpu_selector{} };
buffer<float, 1> bufA(dA.data(), range<1>(dA.size()));
gpuQueue.submit([&](handler& cgh) {
accessor inA{ bufA,cgh };
cgh.parallel_for(range<1>(size),
[=](id<1> i) { inA[i] = inA[i] + 2; }
);
});
gpuQueue.wait_and_throw();
}
catch (std::exception& e) { throw e; }
}
So my question is about c
value, in this case I use directly the value two but this will impact on the performance when I'll run the code? I need to create a variable or in this way is correct and the performance are good?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
有趣的问题。在这种情况下,值
2
将是SYCL内核中指令中的字面意义 - 我认为这是尽可能高的效率!您有一个轻微的并发症,您将隐式铸件从int
到float
。我的猜测是,您可能会在设备组装中以float
文字 2.0 。您的SYCL设备不必在运行时或类似的东西中从内存或铸造中获取该2,它只是存在于说明中。同样,如果您有:
编译器几乎可以肯定足够聪明,可以将
c
的常数值传播到内核代码中。因此,同样,2.0
字面的说明最终出现在说明中。我用DPC ++编制了您的示例,并提取了LLVM IR,并找到了以下行:
这显示了float Load&amp;在同一地址存储/从同一地址存储,并在两者之间使用“添加2.0”指令。如果我修改以使用我所展示的变量
c
,则获得相同的LLVM IR。结论:您已经达到了最大的效率,并且编译器很聪明!
Interesting question. In this case the value
2
will be a literal in the instruction in your SYCL kernel - this is as efficient as it gets, I think! There's the slight complication that you have an implicit cast fromint
tofloat
. My guess is that you'll probably end up with afloat
literal2.0
in your device assembly. Your SYCL device won't have to fetch that 2 from memory or cast at runtime or anything like that, it just lives in the instruction.Equally, if you had:
The compiler is almost certainly smart enough to propagate the constant value of
c
into the kernel code. So, again, the2.0
literal ends up in the instruction.I compiled your example with DPC++ and extracted the LLVM IR, and found the following lines:
This shows a float load & store to/from the same address, with an 'add 2.0' instruction in between. If I modify to use the variable
c
like I demonstrated, I get the same LLVM IR.Conclusion: you've already achieved maximum efficiency, and compilers are smart!