在内联汇编中更快实现的简单 C 函数的示例是什么?

发布于 2024-07-27 15:31:20 字数 92 浏览 1 评论 0原文

我很难使用内联汇编来击败我的编译器。

对于编译器很难使其真正非常快速和简单的函数来说,什么是好的、非人为的示例? 但使用内联汇编实现起来相对简单。

I'm having a hard time beating my compiler using inline assembly.

What's a good, non-contrived examples of a function which the compiler has a hard time making really, really fast and simple? But that's relatively simple to make with inline assembly.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

江南烟雨〆相思醉 2024-08-03 15:31:20

如果您不认为 SIMD 操作作弊,您通常可以编写比编译器自动矢量化能力更好的 SIMD 程序集(如果它甚至具有自动矢量化!)

这是一个非常基本的 SSE(x86 的 SIMD 指令集之一)教程。 它用于 Visual C++ 内联汇编。

编辑:如果您想亲自尝试,这里有一小对功能。 这是 n 长度点积的计算。 一个是使用 SSE 2 内联指令(GCC 内联语法),另一个是非常基本的 C。

它非常非常简单,如果一个好的编译器不能向量化简单的 C 循环,我会感到非常惊讶,但如果它你不应该看到 SSE2 的加速吗? 如果我使用更多寄存器,SSE 2 版本可能会更快,但我不想扩展我非常薄弱的​​ SSE 技能:)。

 float dot_asm(float *a, float*b, int n)
{
  float ans = 0;
  int i; 
  // I'm not doing checking for size % 8 != 0 arrays.
  while( n > 0) {
    float tmp[4] __attribute__ ((aligned(16)));

     __asm__ __volatile__(
            "xorps      %%xmm0, %%xmm0\n\t"
            "movups     (%0), %%xmm1\n\t"
            "movups     16(%0), %%xmm2\n\t"
            "movups     (%1), %%xmm3\n\t"
            "movups     16(%1), %%xmm4\n\t"
            "add        $32,%0\n\t"
            "add        $32,%1\n\t"
            "mulps      %%xmm3, %%xmm1\n\t"
            "mulps      %%xmm4, %%xmm2\n\t"
            "addps      %%xmm2, %%xmm1\n\t"
            "addps      %%xmm1, %%xmm0"
            :"+r" (a), "+r" (b)
            :
            :"xmm0", "xmm1", "xmm2", "xmm3", "xmm4");

    __asm__ __volatile__(
        "movaps     %%xmm0, %0"
        : "=m" (tmp)
        : 
        :"xmm0", "memory" );             

   for(i = 0; i < 4; i++) {
      ans += tmp[i];
   }
   n -= 8;
  }
  return ans;
}

float dot_c(float *a, float *b, int n) {

  float ans = 0;
  int i;
  for(i = 0;i < n; i++) {
    ans += a[i]*b[i];
  }
  return ans;
}

If you don't consider SIMD operations cheating, you can usually write SIMD assembly that performs much better than your compilers autovectorization abilities (If it even has autovectorization!)

Here's a very basic SSE(One of x86's SIMD instruction sets) tutorial. It's for Visual C++ in-line assembly.

Edit: Here's a small pair of functions if you want to try for yourself. It's the calculation of an n length dot product. One is using SSE 2 instructions in-line (GCC in-line syntax) the other is very basic C.

It's very very simple and I'd be very surprised if a good compiler couldn't vectorize the simple C loop, but if it doesn't you should see a speed up in the SSE2. The SSE 2 version could probably be faster if I used more registers but I don't want to stretch my very weak SSE skills :).

 float dot_asm(float *a, float*b, int n)
{
  float ans = 0;
  int i; 
  // I'm not doing checking for size % 8 != 0 arrays.
  while( n > 0) {
    float tmp[4] __attribute__ ((aligned(16)));

     __asm__ __volatile__(
            "xorps      %%xmm0, %%xmm0\n\t"
            "movups     (%0), %%xmm1\n\t"
            "movups     16(%0), %%xmm2\n\t"
            "movups     (%1), %%xmm3\n\t"
            "movups     16(%1), %%xmm4\n\t"
            "add        $32,%0\n\t"
            "add        $32,%1\n\t"
            "mulps      %%xmm3, %%xmm1\n\t"
            "mulps      %%xmm4, %%xmm2\n\t"
            "addps      %%xmm2, %%xmm1\n\t"
            "addps      %%xmm1, %%xmm0"
            :"+r" (a), "+r" (b)
            :
            :"xmm0", "xmm1", "xmm2", "xmm3", "xmm4");

    __asm__ __volatile__(
        "movaps     %%xmm0, %0"
        : "=m" (tmp)
        : 
        :"xmm0", "memory" );             

   for(i = 0; i < 4; i++) {
      ans += tmp[i];
   }
   n -= 8;
  }
  return ans;
}

float dot_c(float *a, float *b, int n) {

  float ans = 0;
  int i;
  for(i = 0;i < n; i++) {
    ans += a[i]*b[i];
  }
  return ans;
}
夏了南城 2024-08-03 15:31:20

因为它与 iPhone 和汇编代码相关,所以我将给出一个与 iPhone 世界相关的示例(而不是某些 sse 或 x86 asm)。
如果有人决定为某些现实世界的应用程序编写汇编代码,那么很可能这将是某种数字信号处理或图像处理。 示例:转换 RGB 像素的色彩空间,将图像编码为 jpeg/png 格式,或将声音编码为 mp3、amr 或 g729 以用于 voip 应用。
在声音编码的情况下,有许多例程无法被编译器转换为高效的 asm 代码,它们在 C 中根本没有等效项。声音处理中常用的东西的示例:饱和数学、乘法累加例程、矩阵乘法。

饱和加法示例:32 位有符号 int 的范围为:0x8000 0000 <= int32 <= 0x7fff ffff。 如果添加两个整数,结果可能会溢出,但在数字信号处理的某些情况下,这可能是不可接受的。 基本上,如果结果溢出或下溢饱和加法应返回 0x8000 0000 或 0x7fff ffff。 这将是一个完整的 c 函数来检查这一点。
饱和添加的优化版本可能是:

int saturated_add(int a, int b)
{
    int result = a + b;

    if (((a ^ b) & 0x80000000) == 0)
    {
        if ((result ^ a) & 0x80000000)
        {
            result = (a < 0) ? 0x80000000 : 0x7fffffff;
        }
    }
    return result;
} 

您还可以执行多个 if/else 来检查溢出,或者在 x86 上您可以检查溢出标志(这也要求您使用 asm)。 iPhone使用armv6或v7 cpu,带有dsp asm。 因此,具有多个分支(if/else 语句)和 2 个 32 位常量的 saturate_add 函数可能是一条仅使用一个 cpu 周期的简单 asm 指令。
因此,简单地使 saturated_add 使用 asm 指令可以使整个算法快两到三倍(并且尺寸更小)。 这是 QADD 手册:
QADD

其他经常在长循环中执行的代码示例

res1 = a + b1*c1;
res2 = a + b2*c2;
res3 = a + b3*c3;

似乎在这里没有什么是不能优化的,但是在 ARM cpu 上,您可以使用特定的 dsp 指令,这些指令比执行简单的乘法花费的周期更少! 没错,带有特定指令的 a+b * c 可以比简单的 a*b 执行得更快。 对于这种情况,编译器根本无法理解代码的逻辑,也无法直接使用这些 dsp 指令,这就是为什么您需要手动编写 asm 来优化代码,但您应该只手动编写某些确实需要的代码部分优化。 如果您开始手动编写简单的循环,那么几乎可以肯定您不会击败编译器!
网络上有很多关于内联汇编来编码 fir 过滤器、amr 编码/解码等的好论文。

Since it's related to the iPhone and assembly code then I'll give an example that would be relevant in iPhone world (and not some sse or x86 asm).
If anybody decides to write assembly code for some real world app, then most likely this is going to be some sort of digital signal processing or image manipulation. Examples: converting colorspace of RGB pixels, encoding images to jpeg/png format, or encoding sound to mp3, amr or g729 for voip applications.
In case of sound encoding there are many routines that cannot be translated by the compiler to efficient asm code, they simply have no equivalent in C. Examples of the commonly used stuff in sound processing: saturated math, multiply-accumulate routines, matrix multiplication.

Example of saturated add: 32-bit signed int has range: 0x8000 0000 <= int32 <= 0x7fff ffff. If you add two ints result could overflow, but this could be unacceptable in certain cases in digital signal processing. Basically, if result overflows or underflows saturated add should return 0x8000 0000 or 0x7fff ffff. That would be a full c function to check that.
an optimized version of saturated add could be:

int saturated_add(int a, int b)
{
    int result = a + b;

    if (((a ^ b) & 0x80000000) == 0)
    {
        if ((result ^ a) & 0x80000000)
        {
            result = (a < 0) ? 0x80000000 : 0x7fffffff;
        }
    }
    return result;
} 

you may also do multiple if/else to check for overflow or on x86 you may check overflow flag (which also requires you to use asm). iPhone uses armv6 or v7 cpu which have dsp asm. So, the saturated_add function with multiple brunches (if/else statements) and 2 32-bit constants could be one simple asm instruction that uses only one cpu cycle.
So, simply making saturated_add to use asm instruction could make entire algorithm two-three times faster (and smaller in size). Here's the QADD manual:
QADD

other examples of code that often executed in long loops are

res1 = a + b1*c1;
res2 = a + b2*c2;
res3 = a + b3*c3;

seems like nothing can't be optimized here, but on ARM cpu you can use specific dsp instructions that take less cycles than to do simple multiplication! That's right, a+b * c with specific instructions could execute faster than simple a*b. For this kind of cases compilers simply cannot understand logic of your code and can't use these dsp instructions directly and that's why you need to manually write asm to optimize code, BUT you should only manually write some parts of code that do need to be optimized. If you start writing simple loops manually then almost certainly you won't beat the compiler!
There are multiple good papers on the web for inline assembly to code fir filters, amr encoding/decoding etc.

橘和柠 2024-08-03 15:31:20

除非您是汇编大师,否则击败编译器的几率非常低

上述链接的片段,

例如,面向位的“XOR
%EAX, %EAX" 指令是
将寄存器设置为零的最快方法
在 x86 的早期几代中,
但大多数代码是由生成的
编译器和编译器很少
生成异或指令。 所以IA
设计师决定将
经常出现的编译器
生成的指令到前面
组合解码逻辑
制作文字“MOVL $0, %EAX”
指令执行速度比
异或指令。

Unless you are an assembly guru the odds of beating the compiler are very low.

A fragment from the above link,

For example, the bit-oriented "XOR
%EAX, %EAX" instruction was the
fastest way to set a register to zero
in the early generations of the x86,
but most code is generated by
compilers and compilers rarely
generated XOR instruction. So the IA
designers, decided to move the
frequently occurring compiler
generated instructions up to the front
of the combinational decode logic
making the literal "MOVL $0, %EAX"
instruction execute faster than the
XOR instruction.

帅哥哥的热头脑 2024-08-03 15:31:20

我使用通用的“strait C”实现实现了一个简单的互相关。 然后,当它花费的时间比我可用的时间片更长时,我诉诸算法的显式并行化并使用处理器内部来强制在计算中使用特定指令。 对于这种特殊情况,计算时间从>30ms减少到略多于4ms。 在下一次数据采集发生之前,我有 15 毫秒的时间来完成处理。

这是 VLWI 处理器上的 SIMD 类型优化。 这只需要 4 个左右的处理器内在函数,它们基本上是汇编语言指令,在源代码中给出函数调用的外观。 您可以对内联汇编执行相同的操作,但处理器内部函数的语法和寄存器管理要好一些。

除此之外,如果大小很重要,汇编器就是王道。 我和一个人一起上学,他写了一个不到 512 字节的全屏文本编辑器。

I implemented a simple cross correlation using a generic "strait C" implementation. And THEN when it took longer than the timeslice I had available, I resorted to explicit parallelization of the algorithm and using processor intrinsic to force the specific instructions to be used in the calculations. For this particular case, the computation time was reduce from >30ms to just over 4ms. I had a 15ms window to complete processing before the next data acquisition occurred.

This was a SIMD type optimization on a VLWI processor. This only require 4 or so of the processor intrinsics, which are basically assembly language instructions that give the appearance of a function call in the source code. You could do the same with inline assembly but the syntax and register management is a little nicer with processor intrinsics.

Other than that if size matters, assembler is king. I went to school with a guy who wrote a full screen text editor in less than 512 bytes.

煮酒 2024-08-03 15:31:20

我有一个校验和算法,需要将字旋转一定数量的位。 为了实现它,我有这个宏:

//rotate word n right by b bits
#define ROR16(n,b) (((n)>>(b))|(((n)<<(16-(b)))&0xFFFF))

//... and inside the inner loop: 
sum ^= ROR16(val, pos);

VisualStudio发布版本扩展为:(val is in ax, pos is in dx, sum 在 bx 中)

mov         ecx,10h 
sub         ecx,edx 
mov         ebp,eax 
shl         ebp,cl 
mov         cx,dx 
sar         ax,cl 
add         esi,2 
or          bp,ax 
xor         bx,bp 

更有效的等效手工生成的程序集是:

 mov       cl,dx
 ror       ax,cl
 xor       bx,ax

我还没有弄清楚如何从纯“c”代码发出 ror 指令。 然而...
在写这篇文章时,我想起了编译器内在函数。 我可以生成第二组指令:

sum ^= _rotr16(val,pos);

所以我的答案是:即使你认为你可以击败纯 c 编译器,在诉诸内联汇编之前检查内在函数。

I have an checksum algorithm which requires words to be rotated by a certain number of bits. To implement it, I've got this macro:

//rotate word n right by b bits
#define ROR16(n,b) (((n)>>(b))|(((n)<<(16-(b)))&0xFFFF))

//... and inside the inner loop: 
sum ^= ROR16(val, pos);

VisualStudio release build expands to this: (val is in ax, pos is in dx, sum is in bx)

mov         ecx,10h 
sub         ecx,edx 
mov         ebp,eax 
shl         ebp,cl 
mov         cx,dx 
sar         ax,cl 
add         esi,2 
or          bp,ax 
xor         bx,bp 

The more efficient equivalent hand-generated assembly would be:

 mov       cl,dx
 ror       ax,cl
 xor       bx,ax

I haven't figured out how to emit the ror instruction from pure 'c' code. However...
While writing this up, I remembered compiler intrinsics. I can generate the second set of instructions with:

sum ^= _rotr16(val,pos);

So my answer is: Even if you think you can beat the pure c compiler, check the intrinsics before resorting to inline assembly.

七色彩虹 2024-08-03 15:31:20

如果您想做 SIMD 操作之类的事情,您也许可以击败编译器。 但这需要对架构和指令集有很好的了解。

If you want to do stuff like SIMD operations, you might be able to beat a compiler. This will require good knowledge of the architecture and the instruction set though.

荭秂 2024-08-03 15:31:20

我战胜编译器的最好方法是使用一个简单的 memcpy 例程...我跳过了很多基本设置内容(例如,我不需要太多堆栈帧,所以我在那里节省了几个周期),并且做了一些漂亮的毛茸茸的东西。

那是大约 6 年前的事,当时有一些质量未知的专有编译器。 我现在必须挖掘我拥有的代码并在 GCC 上尝试它; 我不知道它会变得更快,但我不排除这种可能性。

最后,尽管我的 memcpy 平均比我们的 C 库中的 memcpy 快 15 倍,但我只是把它放在我的后口袋里,以备不时之需。 这是我玩 PPC 组装的一个玩具,在我们的应用程序中不需要速度提升。

My best win out over a compiler was on a simple memcpy routine... I skipped a lot of the basic setup stuff ( e.g., I didn't need much of a stack frame, so I save a few cycles there ), and did a few pretty hairy things.

That was about 6 years ago, with some proprietary compiler of unknown quality. I'll have to dig up the code I had and try it against GCC now; I don't know that it could get any faster, but I wouldn't rule it out.

In the end, even though my memcpy was on average about 15x faster than the one in our C library, I just kept it in my back pocket in case I needed it. It was a toy for me to play with PPC assembly, and the speed boost wasn't necessary in our application.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文