利用 SSE 和其他 CPU 扩展

发布于 2024-08-14 11:08:22 字数 473 浏览 9 评论 0原文

在我的代码库中有几个地方,对于大型数据集,相同的操作会重复很多次。在某些情况下,处理这些需要花费相当长的时间。

我相信使用SSE来实现这些循环应该可以显着提高它们的性能,特别是在对同一组数据执行许多操作的情况下,因此一旦数据最初被读入缓存,就不应该有任何缓存未命中而停顿它。不过我不确定是否要去做这件事。

  • 是否有独立于编译器和操作系统的方式编写代码以利用 SSE 指令?我喜欢 VC++ 内在函数,其中包括 SSE 操作,但我还没有找到任何交叉编译器解决方案。

  • 我仍然需要支持一些没有 SSE 支持或 SSE 支持有限的 CPU(例如 Intel Celeron)。是否有某种方法可以避免必须制作不同版本的程序,例如拥有某种“运行时链接器”,可以在进程启动时根据运行该代码的 CPU 来链接基本代码或 SSE 优化代码?

  • 其他CPU扩展呢,看看Intel、AMD各种CPU的指令集,有几个?

Theres are couple of places in my code base where the same operation is repeated a very large number of times for a large data set. In some cases it's taking a considerable time to process these.

I believe that using SSE to implement these loops should improve their performance significantly, especially where many operations are carried out on the same set of data, so once the data is read into the cache initially, there shouldn't be any cache misses to stall it. However I'm not sure about going about this.

  • Is there a compiler and OS independent way writing the code to take advantage of SSE instructions? I like the VC++ intrinsics, which include SSE operations, but I haven't found any cross compiler solutions.

  • I still need to support some CPU's that either have no or limited SSE support (eg Intel Celeron). Is there some way to avoid having to make different versions of the program, like having some kind of "run time linker" that links in either the basic or SSE optimised code based on the CPU running it when the process is started?

  • What about other CPU extensions, looking at the instruction sets of various Intel and AMD CPU's shows there are a few of them?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

や莫失莫忘 2024-08-21 11:08:22

对于你的第二点,有几种解决方案,只要你可以将差异分离到不同的函数中:

  • 普通旧的 C 函数指针
  • 动态链接(通常依赖于 C 函数指针)
  • 如果你使用 C++,有不同的类来表示对不同架构的支持和使用虚拟功能对此有很大帮助。

请注意,因为您依赖于间接函数调用,所以抽象不同操作的函数通常需要表示更高级别的功能,否则您可能会失去从调用开销中的优化指令中获得的任何收益(换句话说,不要这样做)不要抽象各个 SSE 操作 - 抽象您正在做的工作)。

这是使用函数指针的示例:

typedef int (*scale_func_ptr)( int scalar, int* pData, int count);


int non_sse_scale( int scalar, int* pData, int count)
{
    // do whatever work needs done, without SSE so it'll work on older CPUs

    return 0;
}

int sse_scale( int scalar, in pData, int count)
{
    // equivalent code, but uses SSE

    return 0;
}


// at initialization

scale_func_ptr scale_func = non_sse_scale;

if (useSSE) {
    scale_func = sse_scale;
}


// now, when you want to do the work:

scale_func( 12, theData_ptr, 512);  // this will call the routine that tailored to SSE 
                                    // if the CPU supports it, otherwise calls the non-SSE
                                    // version of the function

For your second point there are several solutions as long as you can separate out the differences into different functions:

  • plain old C function pointers
  • dynamic linking (which generally relies on C function pointers)
  • if you're using C++, having different classes that represent the support for different architectures and using virtual functions can help immensely with this.

Note that because you'd be relying on indirect function calls, the functions that abstract the different operations generally need to represent somewhat higher level functionality or you may lose whatever gains you get from the optimized instruction in the call overhead (in other words don't abstract the individual SSE operations - abstract the work you're doing).

Here's an example using function pointers:

typedef int (*scale_func_ptr)( int scalar, int* pData, int count);


int non_sse_scale( int scalar, int* pData, int count)
{
    // do whatever work needs done, without SSE so it'll work on older CPUs

    return 0;
}

int sse_scale( int scalar, in pData, int count)
{
    // equivalent code, but uses SSE

    return 0;
}


// at initialization

scale_func_ptr scale_func = non_sse_scale;

if (useSSE) {
    scale_func = sse_scale;
}


// now, when you want to do the work:

scale_func( 12, theData_ptr, 512);  // this will call the routine that tailored to SSE 
                                    // if the CPU supports it, otherwise calls the non-SSE
                                    // version of the function
葬﹪忆之殇 2024-08-21 11:08:22

关于这个主题的好读物:停止指令集战争

简短概述:抱歉,无法以简单且最兼容(Intel vs. AMD)的方式解决您的问题。

Good reading on the subject: Stop the instruction set war

Short overview: Sorry, it is not possible to solve your problem in simple and most compatible (Intel vs. AMD) way.

明月夜 2024-08-21 11:08:22

SSE 内在函数可与 Visual C++、GCC 和 intel 编译器配合使用。这些天使用它们没有问题。

请注意,您应该始终保留不使用 SSE 的代码版本,并不断根据 SSE 实现对其进行检查。

这不仅有助于调试,如果您想要支持不支持所需 SSE 版本的 CPU 或体系结构,它也很有用。

The SSE intrinsics work with visual c++, GCC and the intel compiler. There is no problem to use them these days.

Note that you should always keep a version of your code that does not use SSE and constantly check it against your SSE implementation.

This helps not only for debugging, it is also usefull if you want to support CPUs or architectures that don't support your required SSE versions.

梅窗月明清似水 2024-08-21 11:08:22

回复您的评论:

如此有效,只要我不尝试实际执行包含不受支持的指令的代码就可以了,并且我可以摆脱“if(see2Supported){...}else{...}”类型开关?

视情况而定。只要 SSE 指令不被执行,就可以存在于二进制文件中。 CPU 没有这个问题。

但是,如果您在编译器中启用 SSE 支持,它很可能会将许多“正常”指令交换为其 SSE 等效项(例如标量浮点运算),因此即使是常规非 SSE 代码的块也会崩溃在不支持它的CPU上。

因此,您要做的很可能是在启用 SSE 的情况下单独编译一个或两个文件,并让它们包含所有 SSE 例程。然后将其与应用程序的其余部分链接,该应用程序是在没有 SSE 支持的情况下编译的。

In answer to your comment:

So effectively, as long as I don't try to actually execute code containing unsupported instructions I'm fine, and I could get away with an "if(see2Supported){...}else{...}" type switch?

Depends. It's fine for SSE instructions to exist in the binary as long as they're not executed. The CPU has no problem with that.

However, if you enable SSE support in the compiler, it will most likely swap a number of "normal" instructions for their SSE equivalents (scalar floating-point ops, for example), so even chunks of your regular non-SSE code will blow up on a CPU that doesn't support it.

So what you'll have to do is most likely compile on or two files separately, with SSE enabled, and let them contain all your SSE routines. Then link that with the rest of the app, which is compiled without SSE support.

苏佲洛 2024-08-21 11:08:22

我强烈建议您查看 OpenCL。它是一个供应商中立的便携式跨平台系统,适用于计算密集型应用程序(并且高度符合流行语!)。您可以在专为矢量化操作设计的 C99 子集中编写算法,这比手动编码 SSE 容易得多。最重要的是,OpenCL 将在运行时生成最佳实现,以便在 GPU 上或在 CPU 上执行。所以基本上你会得到为你编写的 SSE 代码。

我的代码库中有几个地方对大型数据集重复了很多次相同的操作。在某些情况下,处理这些需要花费相当长的时间。

您的应用程序听起来正是 OpenCL 旨在解决的问题。在SSE中编写替代函数固然会提高执行速度,但编写和调试的工作量很大。

是否有独立于编译器和操作系统的方式编写代码以利用 SSE 指令?我喜欢 VC++ 内在函数,其中包括 SSE 操作,但我还没有找到任何交叉编译器解决方案。

是的。 SSE 内在函数本质上已由 Intel 标准化,因此相同的函数在 Windows、Linux 和 Mac(特别是 Visual C++ 和 GNU g++)之间的工作方式相同。

我仍然需要支持一些没有 SSE 支持或 SSE 支持有限的 CPU(例如 Intel Celeron)。是否有某种方法可以避免必须制作不同版本的程序,例如拥有某种“运行时链接器”,可以在进程启动时根据运行该代码的 CPU 来链接基本代码或 SSE 优化代码?

您可以这样做(例如使用dlopen()),但这是一个非常复杂的解决方案。更简单的是(在 C 中)定义函数接口并通过函数指针调用优化函数的适当版本,或者在 C++ 中根据检测到的 CPU 使用不同的实现类。

使用 OpenCL 则无需这样做,因为代码是在运行时针对给定架构生成的。

其他 CPU 扩展呢,查看各种 Intel 和 AMD CPU 的指令集,发现有一些这样的扩展?

在 SSE 指令集中,有很多风格。当某些指令不存在时,在 SSE 的不同子集中编写相同的算法可能非常困难。我建议(至少在开始时)您选择最低支持级别,例如 SSE2,并回退到旧计算机上的标量实现。

这也是单元/回归测试的理想情况,这对于确保不同的实现产生相同的结果非常重要。拥有输入数据和已知良好输出数据的测试套件,并通过两个版本的处理函数运行相同的数据。您可能需要进行精确度测试才能通过(例如,结果与正确答案之间的差异 epsilon 低于 1e6)。这将极大地帮助调试,如果您在测试框架中构建高分辨率计时,您可以同时比较性能改进。

Rather than hand-coding an alternative SSE implementation to your scalar code, I strongly suggest you have a look at OpenCL. It is a vendor-neutral portable, cross-platform system for computationally intensive applications (and is highly buzzword-compliant!). You can write your algorithm in a subset of C99 designed for vectorised operations, which is much easier than hand-coding SSE. And best of all, OpenCL will generate the best implementation at runtime, to execute either on the GPU or on the CPU. So basically you get the SSE code written for you.

Theres are couple of places in my code base where the same operation is repeated a very large number of times for a large data set. In some cases it's taking a considerable time to process these.

Your application sounds like just the kind of problem that OpenCL is designed to address. Writing alternative functions in SSE would certainly improve the execution speed, but it is a great deal of work to write and debug.

Is there a compiler and OS independent way writing the code to take advantage of SSE instructions? I like the VC++ intrinsics, which include SSE operations, but I haven't found any cross compiler solutions.

Yes. The SSE intrinsics have been essentially standardised by Intel, so the same functions work the same between Windows, Linux and Mac (specifically with Visual C++ and GNU g++).

I still need to support some CPU's that either have no or limited SSE support (eg Intel Celeron). Is there some way to avoid having to make different versions of the program, like having some kind of "run time linker" that links in either the basic or SSE optimised code based on the CPU running it when the process is started?

You could do that (eg. using dlopen()) but it is a very complex solution. Much simpler would be (in C) to define a function interface and call the appropriate version of the optimised function via function pointer, or in C++ to use different implementation classes, depending on the CPU detected.

With OpenCL it is not necessary to do this, as the code is generated at runtime for the given architecture.

What about other CPU extensions, looking at the instruction sets of various Intel and AMD CPU's shows there are a few of them?

Within the SSE instruction set, there are many flavours. It can be quite difficult to code the same algorithm in different subsets of SSE when certain instructions are not present. I suggest (at least to begin with) that you choose a minimum supported level, such as SSE2, and fall back to the scalar implementation on older machines.

This is also an ideal situation for unit/regression testing, which is very important to ensure your different implementations produce the same results. Have a test suite of input data and known good output data, and run the same data through both versions of the processing function. You may need to have a precision test for passing (ie. the difference epsilon between the result and the correct answer is below 1e6, for example). This will greatly aid in debugging, and if you build in high-resolution timing to your testing framework, you can compare the performance improvements at the same time.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文