当前位置：文江博客话题详情

在内联汇编中更快实现的简单 C 函数的示例是什么？

发布于 2024-07-27 15:31:20 字数 92 浏览 1 评论 0原文

我很难使用内联汇编来击败我的编译器。

对于编译器很难使其真正非常快速和简单的函数来说，什么是好的、非人为的示例？但使用内联汇编实现起来相对简单。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

江南烟雨〆相思醉 2024-08-03 15:31:20

如果您不认为 SIMD 操作作弊，您通常可以编写比编译器自动矢量化能力更好的 SIMD 程序集（如果它甚至具有自动矢量化！）

这是一个非常基本的 SSE（x86 的 SIMD 指令集之一）教程。它用于 Visual C++ 内联汇编。

编辑：如果您想亲自尝试，这里有一小对功能。这是 n 长度点积的计算。一个是使用 SSE 2 内联指令（GCC 内联语法），另一个是非常基本的 C。

它非常非常简单，如果一个好的编译器不能向量化简单的 C 循环，我会感到非常惊讶，但如果它你不应该看到 SSE2 的加速吗？如果我使用更多寄存器，SSE 2 版本可能会更快，但我不想扩展我非常薄弱的 SSE 技能:)。

 float dot_asm(float *a, float*b, int n)
{
  float ans = 0;
  int i; 
  // I'm not doing checking for size % 8 != 0 arrays.
  while( n > 0) {
    float tmp[4] __attribute__ ((aligned(16)));

     __asm__ __volatile__(
            "xorps      %%xmm0, %%xmm0\n\t"
            "movups     (%0), %%xmm1\n\t"
            "movups     16(%0), %%xmm2\n\t"
            "movups     (%1), %%xmm3\n\t"
            "movups     16(%1), %%xmm4\n\t"
            "add        $32,%0\n\t"
            "add        $32,%1\n\t"
            "mulps      %%xmm3, %%xmm1\n\t"
            "mulps      %%xmm4, %%xmm2\n\t"
            "addps      %%xmm2, %%xmm1\n\t"
            "addps      %%xmm1, %%xmm0"
            :"+r" (a), "+r" (b)
            :
            :"xmm0", "xmm1", "xmm2", "xmm3", "xmm4");

    __asm__ __volatile__(
        "movaps     %%xmm0, %0"
        : "=m" (tmp)
        : 
        :"xmm0", "memory" );             

   for(i = 0; i < 4; i++) {
      ans += tmp[i];
   }
   n -= 8;
  }
  return ans;
}

float dot_c(float *a, float *b, int n) {

  float ans = 0;
  int i;
  for(i = 0;i < n; i++) {
    ans += a[i]*b[i];
  }
  return ans;
}

If you don't consider SIMD operations cheating, you can usually write SIMD assembly that performs much better than your compilers autovectorization abilities (If it even has autovectorization!)

Here's a very basic SSE(One of x86's SIMD instruction sets) tutorial. It's for Visual C++ in-line assembly.

Edit: Here's a small pair of functions if you want to try for yourself. It's the calculation of an n length dot product. One is using SSE 2 instructions in-line (GCC in-line syntax) the other is very basic C.

It's very very simple and I'd be very surprised if a good compiler couldn't vectorize the simple C loop, but if it doesn't you should see a speed up in the SSE2. The SSE 2 version could probably be faster if I used more registers but I don't want to stretch my very weak SSE skills :).

 float dot_asm(float *a, float*b, int n)
{
  float ans = 0;
  int i; 
  // I'm not doing checking for size % 8 != 0 arrays.
  while( n > 0) {
    float tmp[4] __attribute__ ((aligned(16)));

     __asm__ __volatile__(
            "xorps      %%xmm0, %%xmm0\n\t"
            "movups     (%0), %%xmm1\n\t"
            "movups     16(%0), %%xmm2\n\t"
            "movups     (%1), %%xmm3\n\t"
            "movups     16(%1), %%xmm4\n\t"
            "add        $32,%0\n\t"
            "add        $32,%1\n\t"
            "mulps      %%xmm3, %%xmm1\n\t"
            "mulps      %%xmm4, %%xmm2\n\t"
            "addps      %%xmm2, %%xmm1\n\t"
            "addps      %%xmm1, %%xmm0"
            :"+r" (a), "+r" (b)
            :
            :"xmm0", "xmm1", "xmm2", "xmm3", "xmm4");

    __asm__ __volatile__(
        "movaps     %%xmm0, %0"
        : "=m" (tmp)
        : 
        :"xmm0", "memory" );             

   for(i = 0; i < 4; i++) {
      ans += tmp[i];
   }
   n -= 8;
  }
  return ans;
}

float dot_c(float *a, float *b, int n) {

  float ans = 0;
  int i;
  for(i = 0;i < n; i++) {
    ans += a[i]*b[i];
  }
  return ans;
}

回复收藏 0 原文

夏了南城 2024-08-03 15:31:20

因为它与 iPhone 和汇编代码相关，所以我将给出一个与 iPhone 世界相关的示例（而不是某些 sse 或 x86 asm）。
如果有人决定为某些现实世界的应用程序编写汇编代码，那么很可能这将是某种数字信号处理或图像处理。示例：转换 RGB 像素的色彩空间，将图像编码为 jpeg/png 格式，或将声音编码为 mp3、amr 或 g729 以用于 voip 应用。
在声音编码的情况下，有许多例程无法被编译器转换为高效的 asm 代码，它们在 C 中根本没有等效项。声音处理中常用的东西的示例：饱和数学、乘法累加例程、矩阵乘法。

饱和加法示例：32 位有符号 int 的范围为：0x8000 0000 <= int32 <= 0x7fff ffff。如果添加两个整数，结果可能会溢出，但在数字信号处理的某些情况下，这可能是不可接受的。基本上，如果结果溢出或下溢饱和加法应返回 0x8000 0000 或 0x7fff ffff。这将是一个完整的 c 函数来检查这一点。
饱和添加的优化版本可能是：

int saturated_add(int a, int b)
{
    int result = a + b;

    if (((a ^ b) & 0x80000000) == 0)
    {
        if ((result ^ a) & 0x80000000)
        {
            result = (a < 0) ? 0x80000000 : 0x7fffffff;
        }
    }
    return result;
}

您还可以执行多个 if/else 来检查溢出，或者在 x86 上您可以检查溢出标志（这也要求您使用 asm）。 iPhone使用armv6或v7 cpu，带有dsp asm。因此，具有多个分支（if/else 语句）和 2 个 32 位常量的 saturate_add 函数可能是一条仅使用一个 cpu 周期的简单 asm 指令。
因此，简单地使 saturated_add 使用 asm 指令可以使整个算法快两到三倍（并且尺寸更小）。这是 QADD 手册：
QADD

其他经常在长循环中执行的代码示例

res1 = a + b1*c1;
res2 = a + b2*c2;
res3 = a + b3*c3;

似乎在这里没有什么是不能优化的，但是在 ARM cpu 上，您可以使用特定的 dsp 指令，这些指令比执行简单的乘法花费的周期更少！没错，带有特定指令的 a+b * c 可以比简单的 a*b 执行得更快。对于这种情况，编译器根本无法理解代码的逻辑，也无法直接使用这些 dsp 指令，这就是为什么您需要手动编写 asm 来优化代码，但您应该只手动编写某些确实需要的代码部分优化。如果您开始手动编写简单的循环，那么几乎可以肯定您不会击败编译器！
网络上有很多关于内联汇编来编码 fir 过滤器、amr 编码/解码等的好论文。

Since it's related to the iPhone and assembly code then I'll give an example that would be relevant in iPhone world (and not some sse or x86 asm).
If anybody decides to write assembly code for some real world app, then most likely this is going to be some sort of digital signal processing or image manipulation. Examples: converting colorspace of RGB pixels, encoding images to jpeg/png format, or encoding sound to mp3, amr or g729 for voip applications.
In case of sound encoding there are many routines that cannot be translated by the compiler to efficient asm code, they simply have no equivalent in C. Examples of the commonly used stuff in sound processing: saturated math, multiply-accumulate routines, matrix multiplication.

Example of saturated add: 32-bit signed int has range: 0x8000 0000 <= int32 <= 0x7fff ffff. If you add two ints result could overflow, but this could be unacceptable in certain cases in digital signal processing. Basically, if result overflows or underflows saturated add should return 0x8000 0000 or 0x7fff ffff. That would be a full c function to check that.
an optimized version of saturated add could be:

int saturated_add(int a, int b)
{
    int result = a + b;

    if (((a ^ b) & 0x80000000) == 0)
    {
        if ((result ^ a) & 0x80000000)
        {
            result = (a < 0) ? 0x80000000 : 0x7fffffff;
        }
    }
    return result;
}

you may also do multiple if/else to check for overflow or on x86 you may check overflow flag (which also requires you to use asm). iPhone uses armv6 or v7 cpu which have dsp asm. So, the saturated_add function with multiple brunches (if/else statements) and 2 32-bit constants could be one simple asm instruction that uses only one cpu cycle.
So, simply making saturated_add to use asm instruction could make entire algorithm two-three times faster (and smaller in size). Here's the QADD manual:
QADD

other examples of code that often executed in long loops are

res1 = a + b1*c1;
res2 = a + b2*c2;
res3 = a + b3*c3;

seems like nothing can't be optimized here, but on ARM cpu you can use specific dsp instructions that take less cycles than to do simple multiplication! That's right, a+b * c with specific instructions could execute faster than simple a*b. For this kind of cases compilers simply cannot understand logic of your code and can't use these dsp instructions directly and that's why you need to manually write asm to optimize code, BUT you should only manually write some parts of code that do need to be optimized. If you start writing simple loops manually then almost certainly you won't beat the compiler!
There are multiple good papers on the web for inline assembly to code fir filters, amr encoding/decoding etc.

回复收藏 0 原文

橘和柠 2024-08-03 15:31:20

除非您是汇编大师，否则击败编译器的几率非常低。

上述链接的片段，

例如，面向位的“XOR
%EAX, %EAX" 指令是
将寄存器设置为零的最快方法
在 x86 的早期几代中，
但大多数代码是由生成的
编译器和编译器很少
生成异或指令。所以IA
设计师决定将
经常出现的编译器
生成的指令到前面
组合解码逻辑
制作文字“MOVL $0, %EAX”
指令执行速度比
异或指令。

回复收藏 0 原文

帅哥哥的热头脑 2024-08-03 15:31:20

我使用通用的“strait C”实现实现了一个简单的互相关。然后，当它花费的时间比我可用的时间片更长时，我诉诸算法的显式并行化并使用处理器内部来强制在计算中使用特定指令。对于这种特殊情况，计算时间从>30ms减少到略多于4ms。在下一次数据采集发生之前，我有 15 毫秒的时间来完成处理。

这是 VLWI 处理器上的 SIMD 类型优化。这只需要 4 个左右的处理器内在函数，它们基本上是汇编语言指令，在源代码中给出函数调用的外观。您可以对内联汇编执行相同的操作，但处理器内部函数的语法和寄存器管理要好一些。

除此之外，如果大小很重要，汇编器就是王道。我和一个人一起上学，他写了一个不到 512 字节的全屏文本编辑器。

回复收藏 0 原文

煮酒 2024-08-03 15:31:20

我有一个校验和算法，需要将字旋转一定数量的位。为了实现它，我有这个宏：

//rotate word n right by b bits
#define ROR16(n,b) (((n)>>(b))|(((n)<<(16-(b)))&0xFFFF))

//... and inside the inner loop: 
sum ^= ROR16(val, pos);

VisualStudio发布版本扩展为：(val is in ax, pos is in dx, sum 在 bx 中）

mov         ecx,10h 
sub         ecx,edx 
mov         ebp,eax 
shl         ebp,cl 
mov         cx,dx 
sar         ax,cl 
add         esi,2 
or          bp,ax 
xor         bx,bp

更有效的等效手工生成的程序集是：

 mov       cl,dx
 ror       ax,cl
 xor       bx,ax

我还没有弄清楚如何从纯“c”代码发出 ror 指令。然而...
在写这篇文章时，我想起了编译器内在函数。我可以生成第二组指令：

sum ^= _rotr16(val,pos);

所以我的答案是：即使你认为你可以击败纯 c 编译器，在诉诸内联汇编之前检查内在函数。

I have an checksum algorithm which requires words to be rotated by a certain number of bits. To implement it, I've got this macro:

//rotate word n right by b bits
#define ROR16(n,b) (((n)>>(b))|(((n)<<(16-(b)))&0xFFFF))

//... and inside the inner loop: 
sum ^= ROR16(val, pos);

VisualStudio release build expands to this: (val is in ax, pos is in dx, sum is in bx)

mov         ecx,10h 
sub         ecx,edx 
mov         ebp,eax 
shl         ebp,cl 
mov         cx,dx 
sar         ax,cl 
add         esi,2 
or          bp,ax 
xor         bx,bp

The more efficient equivalent hand-generated assembly would be:

 mov       cl,dx
 ror       ax,cl
 xor       bx,ax

I haven't figured out how to emit the ror instruction from pure 'c' code. However...
While writing this up, I remembered compiler intrinsics. I can generate the second set of instructions with: