c++如何编写编译器可以轻松针对 SIMD 进行优化的代码?

发布于 2024-09-29 00:30:29 字数 643 浏览 12 评论 0原文

我正在 Visual Studio 2008 中工作,在项目设置中我看到“激活扩展指令集”选项,我可以将其设置为“无”、“SSE”或“SSE2”,

因此编译器将尝试将指令批处理在一起,以便使用 SIMD指示?

在如何优化代码方面是否可以遵循任何规则,以便编译器可以使用这些扩展来生成高效的汇编程序?

例如,目前我正在研究光线追踪器。着色器接受一些输入并根据输入计算输出颜色,如下所示:

PixelData data = RayTracer::gatherPixelData(pixel.x, pixel.y);
Color col = shadePixel(data);

例如,编写着色器代码以便在一个指令调用中对 4 个不同的像素进行着色是否有益?像这样的事情:

PixelData data1 = RayTracer::gatherPixelData(pixel1.x, pixel1.y);
...
shadePixels(data1, data2, data3, data4, &col1out, &col2out, &col3out, &col4out);

一次处理多个数据单元。这对于让编译器使用 SSE 指令有好处吗?

谢谢!

i'm working in Visual Studio 2008 and in the project settings I see the option for "activate Extended Instruction set" which I can set to None, SSE or SSE2

So the compiler will try to batch instructions together in order to make use of SIMD instructions?

Are there any rules one can follow in how to optimize code such that the compiler can make effiecient assembler using these extensions?

For example currently i'm working on a raytracer. A shader takes some input and calculates from the input an output color, like this:

PixelData data = RayTracer::gatherPixelData(pixel.x, pixel.y);
Color col = shadePixel(data);

would it for example be beneficial to write the shadercode such that it would shade 4 different pixels within one instruction call? something like this:

PixelData data1 = RayTracer::gatherPixelData(pixel1.x, pixel1.y);
...
shadePixels(data1, data2, data3, data4, &col1out, &col2out, &col3out, &col4out);

to process multiple dataunits at once. would This be beneficial for making the compiler use SSE instructions?

thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

过去的过去 2024-10-06 00:30:29

我正在使用 Visual Studio 2008,在项目设置中我看到“激活扩展指令集”选项,我可以将其设置为 None、SSE 或 SSE2

所以编译器会尝试将指令批处理在一起以便使用 SIMD 指令?

不,编译器不会自行使用向量指令。它将使用标量 SSE 指令而不是 x87 指令。

您所描述的称为“自动矢量化”。 Microsoft 编译器不会这样做,Intel 编译器会这样做。

在 Microsoft 编译器上,您可以使用 内在函数 执行手动 SSE 优化。

i'm working in Visual Studio 2008 and in the project settings I see the option for "activate Extended Instruction set" which I can set to None, SSE or SSE2

So the compiler will try to batch instructions together in order to make use of SIMD instructions?

No, the compiler will not use vector instructions on its own. It will use scalar SSE instructions instead of x87 ones.

What you describe is called "automatic vectorization". Microsoft compilers do not do this, Intel compilers do.

On Microsoft compiler you can use intrinsics to perform manual SSE optimizations.

星光不落少年眉 2024-10-06 00:30:29

三个观察。

  1. 最好的加速不是来自优化,而是来自好的算法。因此,请确保首先正确完成该部分。通常这意味着只为您的特定领域使用正确的库。

  2. 一旦你的算法正确,就可以测量了。工作中通常存在 80/20 规则。 20% 的代码将占用 80% 的执行时间。但为了找到该部分,您需要一个好的分析器。 Intel VTune 可以为您提供每个功能的采样配置文件和精美的报告找出性能杀手。如果您拥有 AMD CPU,另一个免费替代方案是 AMD CodeAnalyst。< /p>

  3. 编译器自动向量化功能并不是灵丹妙药。尽管它会非常努力(尤其是Intel C++),但你会通常需要通过以向量形式重写算法来帮助它。通过手工制作一小部分瓶颈代码以使用 SIMD 指令,通常可以获得更好的结果。您可以使用内在函数或使用内联汇编在 C 代码中执行此操作(请参阅上面的 VJo 链接)。

当然,第 2 部分和第 3 部分形成了一个迭代过程。如果您真的对此很认真,那么英特尔人员有一些关于该主题的好书,例如 软件优化手册和处理器参考手册。

Three observations.

  1. The best speedups are not coming from optimizations but from good algorithms. So make sure you get that part right first. Often this means just using the right libraries for your specific domain.

  2. Once you get your algorithms right it is time to Measure. Often there is an 80/20 rule at work. 20% of your code will take 80% of the execution time. But in order to locate that part you need a good profiler. Intel VTune can give you sampling profile from every function and nice reports that pinpoint the performance killers. Another free alternative is AMD CodeAnalyst if you have an AMD CPU.

  3. The compiler autovectorization capability is not a silver bullet. Although it will try really hard (especially Intel C++) you will often need to help it by rewriting the algorithms in vector form. You can often get much better results by handcrafting small portions of the bottleneck code to use SIMD instructions. You can do that in C code (see VJo's link above) using intrinsics or use inline assembly.

Of course parts 2 and 3 form an iterative process. If you are really serious about this then there are some good books on the subject by Intel folks such as The Software Optimization Cookbook and the processor reference manuals.

埋情葬爱 2024-10-06 00:30:29

编译器并不强大,它也有一些限制。如果可以(并且如果将正确的标志传递给它),它将使用 SSE 指令。了解它做了什么的唯一方法是检查编译器生成的汇编代码。

另一种选择是使用 C SSE/SSE2 指令。对于 Windows,您可以在此处找到它们:

http://msdn .microsoft.com/en-us/library/y0dh78ez%28VS.80%29.aspx

The compiler is not all mighty, and it has some limitations. If it can (and if right flags are passed to it), it will use SSE instructions. The only way to see what it did is to examine the assembly code generated by the compiler.

Another option is to use C SSE/SSE2 instructions. For windows you can find them here:

http://msdn.microsoft.com/en-us/library/y0dh78ez%28VS.80%29.aspx

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文