在 C/C 中使用汇编语言++

发布于 2024-10-02 21:04:41 字数 410 浏览 4 评论 0 原文

我记得在某处读到过真正优化和优化的内容。加速代码的某些部分,程序员用汇编语言编写该部分。我的问题是 -

  1. 这种做法还进行吗?以及如何做到这一点?
  2. 用汇编语言写是不是有点太麻烦了?古老的?
  3. 当我们编译 C 代码(带或不带 -O3 标志)时,编译器会进行一些代码优化和优化。链接所有库和将代码转换为二进制目标文件。因此,当我们运行该程序时,它已经处于最基本的形式,即二进制。那么引入“汇编语言”有什么帮助呢?

我试图理解这个概念&非常感谢任何帮助或链接。

更新:按照 dbemerlin 的要求重新表述第 3 点 - 因为您可能能够编写比编译器生成的更有效的汇编代码,但除非您是汇编专家,否则您的代码可能会运行得更慢,因为编译器通常会优化代码比大多数人都好。

I remember reading somewhere that to really optimize & speed up certain section of the code, programmers write that section in Assembly language. My questions are -

  1. Is this practice still done? and How does one do this?
  2. Isn't writing in Assembly Language a bit too cumbersome & archaic?
  3. When we compile C code (with or without -O3 flag), the compiler does some code optimization & links all libraries & converts the code to binary object file. So when we run the program it is already in its most basic form i.e. binary. So how does inducing 'Assembly Language' help?

I am trying to understand this concept & any help or links is much appreciated.

UPDATE: Rephrasing point 3 as requested by dbemerlin- Because you might be able to write more effective assembly code than the compiler generates but unless you are an assembler expert your code will propably run slower because often the compiler optimizes the code better than most humans can.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(14

黑色毁心梦 2024-10-09 21:04:41

恢复到汇编语言唯一有用的时候是

...并且。 ..

  • 这些 CPU 指令的使用将为瓶颈代码带来一些显着且有用的性能提升。

简单地使用内联汇编来执行可以轻松用 C++ 表达的操作(例如添加两个值或在字符串中搜索)会适得其反,因为:

  • 编译器知道如何同样很好地执行此操作
    • 要验证这一点,请查看其汇编输出(例如 gcc -S)或反汇编机器代码
  • 您人为地限制了其有关寄存器分配、CPU 指令等的选择,因此它可能需要更长的时间来为 CPU 寄存器准备执行硬编码指令所需的值,然后需要更长的时间才能恢复到未来指令的最佳分配
      编译器优化器可以在指定不同寄存器的等效性能指令之间进行选择,以最大程度地减少它们之间的复制,并且可以选择寄存器,使单个内核可以在一个周期内处理多个指令,而强制所有内容通过特定寄存器将对其进行序列化

      • 公平地说,GCC 有办法表达对特定类型寄存器的需求,而不会将 CPU 限制到精确的寄存器,仍然允许此类优化,但它是我见过的唯一解决此问题的内联汇编
  • 在新 CPU 上 解决这个问题的内联汇编。模型明年会推出另一条指令,该指令对于相同的逻辑操作速度快 1000%,那么编译器供应商更有可能更新其编译器以使用该指令,因此您的程序一旦重新编译就会受益,而不是您(或任何维护者)然后是软件)
  • 编译器将为它所讲述的目标架构选择一种最佳方法:如果您对一个解决方案进行硬编码,那么它将需要是最低公分母或#ifdef-ed你的平台
  • 汇编语言不像 C++ 那样可移植,无论是跨 CPU 还是跨编译器,即使你看似移植了一条指令,也可能会犯一个错误,重新注册可以安全破坏的寄存器、参数传递约定等。
  • 其他程序员可能会不了解或不熟悉汇编

我认为值得牢记的一个观点是,当 C 被引入时,它必须赢得许多对生成的机器代码大惊小怪的核心汇编语言程序员的支持。那时的机器的 CPU 能力和 RAM 较少,你可以打赌人们会为最微小的事情而大惊小怪。优化器变得非常复杂并且不断改进,而像 x86 这样的处理器的汇编语言却变得越来越复杂,它们的执行管道、缓存和其他影响性能的因素也变得越来越复杂。您不能再仅从每条指令的周期表中添加值。编译器编写者花时间考虑所有这些微妙的因素(特别是那些为 CPU 制造商工作的因素,但这也增加了其他编译​​器的压力)。现在,对于汇编程序员来说,在任何重要的应用程序上平均获得比良好优化编译器生成的代码效率明显更高的代码效率是不切实际的,而且他们极有可能做得更糟。因此,装配的使用应限制在真正产生可测量且有用的差异、值得耦合和维护成本的次数。

The only time it's useful to revert to assembly language is when

  • the CPU instructions don't have functional equivalents in C++ (e.g. single-instruction-multiple-data instructions, BCD or decimal arithmetic operations)

    OR

  • for some inexplicable reason - the optimiser is failing to use the best CPU instructions

...AND...

  • the use of those CPU instructions would give some significant and useful performance boost to bottleneck code.

Simply using inline assembly to do an operation that can easily be expressed in C++ - like adding two values or searching in a string - is actively counterproductive, because:

  • the compiler knows how to do this equally well
    • to verify this, look at its assembly output (e.g. gcc -S) or disassemble the machine code
  • you're artificially restricting its choices regarding register allocation, CPU instructions etc., so it may take longer to prepare the CPU registers with the values needed to execute your hardcoded instruction, then longer to get back to an optimal allocation for future instructions
    • compiler optimisers can choose between equivalent-performance instructions specifying different registers to minimise copying between them, and may choose registers in such a way that a single core can process multiple instructions during one cycle, whereas forcing everythingt through specific registers would serialise it
      • in fairness, GCC has ways to express needs for specific types of registers without constraining the CPU to an exact register, still allowing such optimisations, but it's the only inline assembly I've ever seen that addresses this
  • if a new CPU model comes out next year with another instruction that's 1000% faster for that same logical operation, then the compiler vendor is more likely to update their compiler to use that instruction, and hence your program to benefit once recompiled, than you are (or whomever's maintaining the software then is)
  • the compiler will select an optimal approach for the target architecture its told about: if you hardcode one solution then it will need to be a lowest-common-denominator or #ifdef-ed for your platforms
  • assembly language isn't as portable as C++, both across CPUs and across compilers, and even if you seemingly port an instruction, it's possible to make a mistake re registers that are safe to clobber, argument passing conventions etc.
  • other programmers may not know or be comfortable with assembly

One perspective that I think's worth keeping in mind is that when C was introduced it had to win over a lot of hardcore assembly language programmers who fussed over the machine code generated. Machines had less CPU power and RAM back then and you can bet people fussed over the tiniest thing. Optimisers became very sophisticated and have continued to improve, whereas the assembly languages of processors like the x86 have become increasingly complicated, as have their execution pipelines, caches and other factors involved in their performance. You can't just add values from a table of cycles-per-instruction any more. Compiler writers spend time considering all those subtle factors (especially those working for CPU manufacturers, but that ups the pressure on other compilers too). It's now impractical for assembly programmers to average - over any non-trivial application - significantly better efficiency of code than that generated by a good optimising compiler, and they're overwhelmingly likely to do worse. So, use of assembly should be limited to times it really makes a measurable and useful difference, worth the coupling and maintenance costs.

不语却知心 2024-10-09 21:04:41

首先,您需要分析您的程序。然后,您可以优化 C 或 C++ 代码中最常用的路径。 除非优势很明显,否则不要用汇编程序重写。使用汇编程序会使您的代码更难维护并且更难移植 - 除非在极少数情况下,否则不值得。

First of all, you need to profile your program. Then you optimize the most used paths in C or C++ code. Unless advantages are clear you don't rewrite in assembler. Using assembler makes your code harder to maintain and much less portable - it is not worth it except in very rare situations.

心奴独伤 2024-10-09 21:04:41

(1) 是的,尝试此操作的最简单方法是使用内联汇编,这取决于编译器,但通常看起来像这样:

__asm
{
    mov eax, ebx
}

(2) 这是非常主观的

(3) 因为您可能能够编写更有效的汇编代码比编译器生成的多。

(1) Yes, the easiest way to try this out is to use inline assembly, this is compiler dependent but usually looks something like this:

__asm
{
    mov eax, ebx
}

(2) This is highly subjective

(3) Because you might be able to write more effective assembly code than the compiler generates.

昵称有卵用 2024-10-09 21:04:41

您应该阅读迈克尔·阿布拉什

在第一本书中,他概括性地解释了如何使用极限的汇编编程。在接下来的内容中,他解释说,程序员应该使用一些更高级的语言,例如 C,并且只在必要时尝试使用汇编来优化非常特定的位置。

这种想法改变的一个动机是,他发现与从高级语言(可能是使用新指令的编译器)编译的代码相比,针对一代处理器的高度优化的程序在同一处理器系列的下一代中可能会变得(有些)慢例如,或者现有处理器的性能和行为从一代处理器到另一代处理器的变化)。

另一个原因是编译器现在非常好并且正在积极优化,通常可以在将 C 代码转换为汇编的算法上获得更多性能。即使对于 GPU(图形卡处理器)编程,您也可以使用 cuda 或 OpenCL 使用 C 语言来完成。

在某些(罕见)情况下,您应该/必须使用汇编,通常是为了对硬件进行非常精细的控制。但即使在操作系统内核代码中,它通常也是非常小的部分,而且代码也不是那么多。

You should read the classic book Zen of Code Optimization and the followup Zen of Graphics Programming by Michael Abrash.

Summarily in the first book he explained how to use assembly programming pushed to the limits. In the followup he explained that programmers should rather use some higher level language like C and only try to optimize very specific spots using assembly, if necessary at all.

One motivation of this change of mind was that he saw that highly optimized programs for one generation of processor could become (somewhat) slow in the next generation of the same processor familly compared to code compiled from a high level language (maybe compiler using new instructions for instance, or performance and behavior of existing ones changing from a processor generation to another).

Another reason is that compilers are quite good and optimize aggressively nowaday, there is usually much more performance to gain working on algorithms that converting C code to assembly. Even for GPU (Graphic Cards processors) programming you can do it with C using cuda or OpenCL.

There are still some (rare) cases when you should/have to use assembly, usually to get very fine control on the hardware. But even in OS kernel code it's usually very small parts and not that much code.

2024-10-09 21:04:41

现在很少有理由使用汇编语言,即使像 SSE 和旧的 MMX 这样的低级结构在 gcc 和 MSVC 中都有内置的内在函数(我打赌 icc 也是如此,但我从未使用过它)。

老实说,现在的优化器是如此的疯狂,以至于大多数人都无法达到用汇编语言编写代码的一半性能。您可以更改数据在内存中的排序方式(针对局部性)或告诉编译器更多有关您的代码的信息(通过#pragma ),但实际上编写汇编代码......怀疑您会从中获得任何额外的东西它。

@VJo,请注意,在高级 C 代码中使用内在函数可以让您进行相同的优化,而无需使用单个汇编指令。

不管怎样,人们已经讨论了下一个 Microsoft C++ 编译器,以及他们将如何从中删除内联汇编。这充分说明了它的必要性。

There's very few reasons to use assembly language these days, even low-level constructs like SSE and the older MMX have built-in intrinsics in both gcc and MSVC (icc too I bet but I never used it).

Honestly, optimizers these days are so insanely aggressive that most people couldn't match even half their performance writing code in assembly. You can change how data is ordered in memory (for locality) or tell the compiler more about your code (through #pragma), but actually writing assembly code... doubt you'll get anything extra from it.

@VJo, note that using intrinsics in high level C code would let you do the same optimizations, without using a single assembly instruction.

And for what it's worth, there have been discussions about the next Microsoft C++ compiler, and how they'll drop inline assembly from it. That speaks volumes about the need for it.

风和你 2024-10-09 21:04:41

我认为您没有指定处理器。不同的答案取决于处理器和环境。一般的答案是肯定的,它仍然在完成,它当然不是过时的。一般原因是编译器,有时它们在一般优化方面做得很好,但对于特定目标却不太好。有些人确实擅长某一目标,但不擅长其他目标。大多数时候它已经足够好了,大多数时候您需要可移植的 C 代码而不是不可移植的汇编程序。但你仍然发现 C 库仍然会手动优化 memcpy 和其他例程,而编译器根本无法弄清楚有一种非常快速的方法来实现它。部分原因是这种极端情况不值得花时间让编译器优化,只需在汇编器中解决它,并且构建系统有很多如果该目标则使用 C 如果该目标使用 C 如果该目标使用 asm,如果目标使用asm。所以这种情况仍然存在,而且我认为在某些领域必须永远持续下去。

X86 本身就是一个有着悠久历史的野兽,我们正处于一个这样的阶段,你真的无法以实际的方式编写总是更快的一团汇编程序,你绝对可以在特定机器上的特定处理器上优化例程天,并执行编译器。除了某些特定情况外,它通常是徒劳的。有教育意义,但总的来说不值得花时间。另请注意,处理器不再是瓶颈,因此一个草率的通用 C 编译器就足够好了,可以在其他地方找到性能。

其他平台通常意味着嵌入式、arm、mips、avr、msp430、pic 等。您可能运行也可能不运行操作系统,您可能运行也可能不运行缓存或桌面具有的其他此类东西。所以编译器的弱点就会显现出来。另请注意,编程语言继续远离处理器而不是向处理器发展。即使在 C 被认为可能是低级语言的情况下,它也与指令集不匹配。总有一些时候,您可以生成优于编译器的汇编程序段。不一定是你的瓶颈部分,但在整个计划中,你经常可以在这里或那里进行改进。您仍然需要检查这样做的价值。在嵌入式环境中,它可以而且确实决定了产品的成功与失败。如果您的产品每单位投资 25 美元,用于更耗电、电路板空间、更高速的处理器,因此您不必使用汇编器,但您的竞争对手每单位花费 10 美元或更少,并且愿意将 asm 与 C 混合以使用较小的存储器,使用更少的电力、更便宜的零件等。只要 NRE 被恢复,那么从长远来看,与 ASM 混合的解决方案将会。

真正的嵌入式是一个拥有专业工程师的专业市场。另一个嵌入式市场,你的嵌入式linux roku、tivo等。嵌入式手机等都需要有便携式操作系统才能生存,因为你需要第三方开发人员。因此,该平台必须更像桌面而不是嵌入式系统。正如前面提到的,埋藏在 C 库或操作系统中可能有一些汇编程序优化,但与桌面一样,您需要尝试投入更多硬件,以便软件可以移植,而不是手动优化。如果第三方成功需要汇编程序,那么您的产品线或嵌入式操作系统将会失败。

我最担心的是这些知识正在以惊人的速度流失。因为没有人检查汇编程序,因为没有人用汇编程序编写,等等。没有人注意到编译器在生成代码时没有得到改进。开发人员经常认为他们必须购买更多硬件,而不是意识到通过了解编译器或如何更好地编程,他们可以使用相同的编译器(有时使用相同的源代码)将性能提高 5% 到数百%。 5-10% 通常使用相同的源代码和编译器。 gcc 4 并不总是生成比 gcc 3 更好的代码,我保留两者,因为有时 gcc3 做得更好。特定于目标的编译器可以(并非总是如此)围绕 gcc 运行,有时使用相同的源代码不同的编译器您可以看到数百%的改进。这一切从何而来?那些仍然费心寻找和/或使用汇编程序的人。其中一些人在编译器后端工作。前端和中间部分当然很有趣且具有教育意义,但后端是决定或破坏最终程序的质量和性能的地方。即使您从不编写汇编程序,而只是时不时地查看编译器的输出(gcc -O2 -s myprog.c),它也会使您成为更好的高级程序员,并保留一些这方面的知识。如果没有人愿意了解和编写汇编程序,那么根据定义,我们已经放弃编写和维护高级语言的编译器,并且一般软件将不复存在。

例如,使用 gcc 时,编译器的输出是汇编代码,该汇编代码将传递给汇编器,汇编器将其转换为目标代码。 C 编译器通常不生成二进制文件。对象组合成最终的二进制文件时,由链接器完成,链接器是编译器调用的另一个程序,而不是编译器的一部分。编译器将 C 或 C++ 或 ADA 或其他任何语言转换为汇编程序,然后汇编程序和链接器工具完成其余的工作。动态重新编译器,例如 tcc,必须能够以某种方式动态生成二进制文件,但我认为这是例外而不是规则。 LLVM 有自己的运行时解决方案,如果您将其用作交叉编译器,则可以非常明显地显示高级内部代码以及目标代码到二进制路径。

回到正题,是的,它已经完成了,而且比你想象的更频繁。主要与语言不直接与指令集进行比较有关,然后编译器并不总是生成足够快的代码。如果你能在频繁使用的函数(如 malloc 或 memcpy)上获得数十倍的改进。或者想在没有硬件支持的情况下在手机上安装高清视频播放器,权衡汇编器的利弊。真正的嵌入式市场仍然大量使用汇编程序,有时全部是 C 语言,但有时软件完全用汇编程序编码。对于桌面x86来说,处理器不是瓶颈。处理器是微编码的。即使你在表面上制作出漂亮的汇编程序,它也不会在所有系列的 x86 处理器上运行得非常快,草率的、足够好的代码更有可能在全面运行。

我强烈建议学习非 x86 ISA 的汇编程序,例如 arm、thumb/thumb2、mips、msp430、avr。具有编译器的目标,特别是具有 gcc 或 llvm 编译器支持的目标。学习汇编程序,了解 C 编译器的输出,并通过实际修改该输出并测试它来证明您可以做得更好。这些知识将有助于使您的桌面高级代码在没有汇编器的情况下变得更好、更快、更可靠。

I dont think you specified the processor. Different answers depending on the processor and the environment. The general answer is yes it is still done, it is not archaic certainly. The general reason is the compilers, sometimes they do a good job at optimizing in general but not really well for specific targets. Some are really good at one target and not so good at others. Most of the time it is good enough, most of the time you want portable C code and not non-portable assembler. But you still find that C libraries will still hand optimize memcpy and other routines that the compiler simply cannot figure out that there is a very fast way to implement it. In part because that corner case is not worth spending time on making the compiler optimize for, just solve it in assembler and the build system has a lot of if this target then use C if that target use C if that target use asm, if that target use asm. So it still occurs, and I argue must continue forever in some areas.

X86 is is own beast with a lot of history, we are at a point where you really cannot in a practical manner write one blob of assembler that is always faster, you can definitely optimize routines for a specific processor on a specific machine on a specific day, and out perform the compiler. Other than for some specific cases it is generally futile. Educational but overall not worth the time. Also note the processor is no longer the bottleneck, so a sloppy generic C compiler is good enough, find the performance elsewhere.

Other platforms which often means embedded, arm, mips, avr, msp430, pic, etc. You may or may not be running an operating system, you may or may not be running with a cache or other such things that your desktop has. So the weaknesses of the compiler will show. Also note that programming languages continue to evolve away from processors instead of toward them. Even in the case of C considered perhaps to be a low level language, it doesnt match the instruction set. There will always be times where you can produce segments of assembler that outperform the compiler. Not necessarily the segment that is your bottleneck but across the entire program you can often make improvements here and there. You still have to check the value of doing that. In an embedded environment it can and does make the difference between success and failure of a product. If your product has $25 per unit invested in more power hungry, board real estate, higher speed processors so you dont have to use assembler, but your competitor spends $10 or less per unit and is willing to mix asm with C to use smaller memories, use less power, cheaper parts, etc. Well so long as the NRE is recovered then the mixed with asm solution will in the long run.

True embedded is a specialized market with specialized engineers. Another embedded market, your embedded linux roku, tivo, etc. Embedded phones, etc all need to have portable operating systems to survive because you need third party developers. So the platform has to be more like a desktop than an embedded system. Buried in the C library as mentioned or the operating system there may be some assembler optimizations, but as with the desktop you want to try to throw more hardware at so the software can be portable instead of hand optimized. And your product line or embedded operating system will fail if assembler is required for third party success.

The biggest concern I have is that this knowledge is being lost at an alarming rate. Because nobody inspects the assembler, because nobody writes in assembler, etc. Nobody is noticing that the compilers have not been improving when it comes to the code being produced. Developers often think they have to buy more hardware instead of realizing that by either knowing the compiler or how to program better they can improve their performance by 5 to several hundred percent with the same compiler, sometimes with the same source code. 5-10% usually with the same source code and compiler. gcc 4 does not always produce better code than gcc 3, I keep both around because sometimes gcc3 does better. Target specific compilers can (not always do) run circles around gcc, you can see a few hundred percent improvement sometimes with the same source code different compiler. Where does all of this come from? The folks that still bother to look and/or use assembler. Some of those folks work on the compiler backends. The front end and middle are fun and educational certainly, but the backend is where you make or break quality and performance of the resulting program. Even if you never write assembler but only look at the output from the compiler from time to time (gcc -O2 -s myprog.c) it will make you a better high level programmer and will retain some of this knowledge. If nobody is willing to know and write assembler then by definition we have given up in writing and maintaining compilers for high level languages and software in general will cease to exist.

Understand that with gcc for example the output of the compiler is assembly that is passed to an assembler which turns it into object code. The C compiler does not normally produce binaries. The objects when combined into the final binary, are done by the linker, yet another program that is called by the compiler and not part of the compiler. The compiler turns C or C++ or ADA or whatever into assembler then the assembler and linker tools take it the rest of the way. Dynamic recompilers, like tcc for example, must be able to generate binaries on the fly somehow, but I see that as the exception not the rule. LLVM has its own runtime solution as well as quite visibly showing the high level to internal code to target code to binary path if you use it as a cross compiler.

So back to the point, yes it is done, more often than you think. Mostly has to do with the language not comparing directly to the instruction set, and then the compiler not always producing fast enough code. If you can get say dozens of times improvement on heavily used functions like malloc or memcpy. Or want to have a HD video player on your phone without hardware support, balance the pros and cons of assembler. Truly embedded markets still use assembler quite a bit, sometimes it is all C but sometimes the software is completely coded in assembler. For desktop x86, the processor is not the bottleneck. The processors are microcoded. Even if you make beautiful looking assembler on the surface it wont run really fast on all families x86 processors, sloppy, good enough code is more likely to run about the same across the board.

I highly recommend learning assembler for non-x86 ISAs like arm, thumb/thumb2, mips, msp430, avr. Targets that have compilers, particularly ones with gcc or llvm compiler support. Learn the assembler, learn to understand the output of the C compiler, and prove that you can do better by actually modifying that output and testing it. This knowledge will help make your desktop high level code much better without assembler, faster and more reliable.

山田美奈子 2024-10-09 21:04:41

这取决于。在某些情况下(仍然)正在这样做,但在大多数情况下,这是不值得的。现代 CPU 极其复杂,为它们编写高效的汇编代码也同样复杂。因此,大多数时候,您手动编写的程序集最终会比编译器为您生成的程序集慢。

假设最近几年发布了一个不错的编译器,您通常可以调整您的 C/C++ 代码以获得与使用汇编相同的性能优势。

这里的评论和答案中的很多人都在谈论他们在汇编中重写某些东西所获得的“N 倍加速”,但这本身并没有多大意义。通过重写用 C 语言评估流体动力学方程的 C 函数,我获得了 13 倍的加速,通过应用许多与在汇编中编写它相同的优化,通过了解硬件,并且通过分析。最后,它足够接近 CPU 的理论峰值性能,以至于没有必要在汇编中重写它。通常,限制因素不是语言,而是您编写的实际代码。只要您不使用编译器难以处理的“特殊”指令,就很难击败编写良好的 C++ 代码。

组装并没有神奇地更快。它只是让编译器跳出循环。这通常是一件坏事,除非您真的知道自己在做什么,因为编译器会执行大量优化,而手动执行这些优化真的非常痛苦。但在极少数情况下,编译器无法理解您的代码,并且无法为其生成有效的程序集,然后,您自己编写一些程序集可能会很有用。除了驱动程序开发等(您需要直接操作硬件)之外,我能想到的唯一值得编写汇编的地方是,如果您使用的编译器无法生成高效的 SSE 代码内在函数(例如 MSVC)。即使在那里,我仍然会开始使用 C++ 中的内在函数,对其进行分析并尝试尽可能地调整它,但由于编译器不太擅长于此,因此最终可能值得重写该代码在装配中。

It depends. It is (still) being done in some situations, but for the most part, it is not worth it. Modern CPUs are insanely complex, and it is equally complex to write efficient assembly code for them. So most of the time, the assembly you write by hand will end up slower than what the compiler can generate for you.

Assuming a decent compiler released within the last couple of years, you can usually tweak your C/C++ code to gain the same performance benefit as you would using assembly.

A lot of people in the comments and answers here are talking about the "N times speedup" they gained rewriting something in assembly, but that by itself doesn't mean too much. I got a 13 times speedup from rewriting a C function evaluating fluid dynamics equations in C, by applying many of the same optimizations as you would if you were to write it in assembly, by knowing the hardware, and by profiling. At the end, it got close enough to the theoretical peak performance of the CPU that there would be no point in rewriting it in assembly. Usually, it's not the language that's the limiting factor, but the actual code you've written. As long as you're not using "special" instructions that the compiler has difficulty with, it's hard to beat well-written C++ code.

Assembly isn't magically faster. It just takes the compiler out of the loop. That is often a bad thing, unless you really know what you're doing, since the compiler performs a lot of optimizations that are really really painful to do manually. But in rare cases, the compiler just doesn't understand your code, and can't generate efficient assembly for it, and then, it might be useful to write some assembly yourself. Other than driver development or the like (where you need to manipulate the hardware directly), the only place I can think of where writing assembly may be worth it is if you're stuck with a compiler that can't generate efficient SSE code from intrinsics (such as MSVC). Even there, I'd still start out using intrinsics in C++, and profile it and try to tweak it as much as possible, but because the compiler just isn't very good at this, it might eventually be worth it to rewrite that code in assembly.

染墨丶若流云 2024-10-09 21:04:41

看看此处,该人使用汇编代码将性能提高了 6 倍。所以,答案是:它仍在完成,但编译器做得相当不错。

Take a look here, where the guy improved performances 6 times using assembly code. So, the answer is : it is still being done, but the compiler is doing pretty good job.

享受孤独 2024-10-09 21:04:41
  1. “这个练习还做完吗?”
    -->它在图像处理、信号处理、人工智能(例如高效矩阵乘法)等领域完成。我敢打赌,我的 MacBook 触控板上的滚动手势的处理也是部分汇编代码,因为它是即时的。
    -->它甚至可以在 C# 应用程序中完成(请参阅 https://blogs.msdn.microsoft.com/winsdk/2015/02/09/c-and-fastcall-how-to-make-them- work-together-without-ccli-shellcode/)

  2. “用汇编语言编写是不是有点太麻烦和过时了?”
    -->它是一种类似于锤子或螺丝刀的工具,有些任务需要钟表匠螺丝刀。

  3. “当我们编译 C 代码(带或不带 -O3 标志)时,编译器会进行一些代码优化……那么引入‘汇编语言’有什么帮助呢?”
    -->我喜欢 @jalf 所说的,以编写汇编的方式编写 C 代码已经可以产生高效的代码。然而,要做到这一点,您必须考虑如何用汇编语言编写代码,例如。了解复制数据的所有位置(每次不必要时都会感到痛苦)。
    使用汇编语言,您可以确定生成了哪些指令。即使您的 C 代码是高效的,也不能保证生成的程序集对于每个编译器都是高效的。 (请参阅https://lucasmeijer.com/posts/cpp_unity/
    -->使用汇编语言,当您分发二进制文件时,您可以测试 cpu 并根据针对 AVX 或仅针对 SSE 优化的 cpu 功能进行不同的分支,但您只需要分发一个二进制文件。对于内在函数,这在 C++ 或 .NET Core 3 中也是可能的。(请参阅 https://devblogs.microsoft.com/dotnet/using-net-hardware-intrinsics-api-to-accelerate-machine-learning-scenarios/)李>
  1. "Is this practice still done?"
    --> It is done in image processing, signal processing, AI (eg. efficient matrix multiplication), and other. I would bet the processing of the scroll gesture on my macbook trackpad is also partially assembly code because it is immediate.
    --> It is even done in C# applications (see https://blogs.msdn.microsoft.com/winsdk/2015/02/09/c-and-fastcall-how-to-make-them-work-together-without-ccli-shellcode/)

  2. "Isn't writing in Assembly Language a bit too cumbersome & archaic?"
    --> It is a tool like a hammer or a screwdriver and some tasks require a watchmaker screwdriver.

    1. "When we compile C code (with or without -O3 flag), the compiler does some code optimization ... So how does inducing 'Assembly Language' help?"
      --> I like what @jalf said, that writing C code in a way you would write assembly will already lead to efficient code. However to do this you must think how you would write the code in assembly language, so eg. understand all places where data is copied (and feel pain each time it is unnecessary).
      With assembly language you can be sure which instructions are generated. Even if your C code is efficient there is no guarantee that the resulting assembly will be efficient with every compiler. (see https://lucasmeijer.com/posts/cpp_unity/)
      --> With assembly language, when you distribute a binary, you can test for the cpu and make different branches depending on the cpu features as optimized for for AVX or just for SSE, but you only need to distribute one binary. With intrinsics this is also possible in C++ or .NET Core 3. (see https://devblogs.microsoft.com/dotnet/using-net-hardware-intrinsics-api-to-accelerate-machine-learning-scenarios/)
遇到 2024-10-09 21:04:41

在我的工作中,我在嵌入式目标(微控制器)上使用汇编来进行低级访问。

但对于PC软件来说,我认为它不是很有用。

On my work, I used assembly on embedded target (micro controller) for low level access.

But for a PC software, I don't think it is very usefull.

断念 2024-10-09 21:04:41

我有一个已完成的程序集优化示例,但它又是在嵌入式目标上。您也可以看到一些针对 PC 的汇编编程示例,它创建的程序非常小且快速,但通常不值得付出努力(查找“windows 汇编”,您可以找到一些非常小而漂亮的程序)。

我的例子是,当我编写打印机控制器时,有一个函数应该每 50 微秒调用一次。它或多或少必须对位进行重新洗牌。使用 C 我可以在大约 35 微秒内完成,而使用汇编我可以在大约 8 微秒内完成。这是一个非常具体的程序,但仍然是真实且必要的。

I have an example of assembly optimization I've done, but again it's on an embedded target. You can see some examples of assembly programming for PCs too, and it creates really small and fast programs, but usually not worth the effort (Look for "assembly for windows", you can find some very small and pretty programs).

My example was when I was writing a printer controller, and there was a function that was supposed to be called every 50 micro-seconds. It has to do reshuffling of bits, more or less. Using C I've been able to do it in about 35microseconds, and with assembly I've done it in about 8 microseconds. It's a very specific procedure but still, something real and necessary.

口干舌燥 2024-10-09 21:04:41

在某些嵌入式设备(手机和 PDA)上,它很有用,因为编译器还不是很成熟,并且可能生成极其缓慢甚至不正确的代码。我个人必须解决或编写汇编代码来修复基于 ARM 的嵌入式平台的几个不同编译器的错误输出。

On some embedded devices (phones and PDAs), it's useful because the compilers are not terribly mature, and can generate extremely slow and even incorrect code. I have personally had to work around, or write assembly code to fix, the buggy output of several different compilers for ARM-based embedded platforms.

时光匆匆的小流年 2024-10-09 21:04:41
  1. 是的。使用内联汇编或链接汇编对象模块。您应该使用哪种方法取决于您需要编写多少汇编代码。通常,如果有多个函数,可以对几行使用内联汇编并切换到单独的对象模块一次。
  2. 当然可以,但有时是必要的。这里最突出的例子是操作系统编程。
  3. 如今,大多数编译器对用高级语言编写的代码的优化效果比任何人编写的汇编代码都要好得多。人们大多用它来编写用 C 等高级语言编写的代码。如果有人将它用于其他用途,则意味着他要么比现代编译器更擅长优化(我对此表示怀疑),要么就是愚蠢至极,例如,他不知道要使用什么编译器标志或函数属性。
  1. Yes. Use either inline assembly or link assembly object modules. Which method you should use depends on how much assembly code you need to write. Usually it's OK to use inline assembly for a couple of lines and switch to separate object modules once if it's more than one function.
  2. Definitely, but sometimes it's necessary. The prominent example here would be programming an operating system.
  3. Most compilers today optimize the code you write in a high-level language much better than anyone could ever write assembly code. People mostly use it to write code that would otherwise be impossible to write in a high-level language like C. If someone uses it for anything else means he is either better at optimization than a modern compiler (I doubt that) or just plain stupid, e.g. he doesn't know what compiler flags or function attributes to use.
贪了杯 2024-10-09 21:04:41

使用这个:

__asm__ __volatile__(/*汇编代码放在这里*/);

__asm__也可以只是asm。

__volatile__ 阻止编译器进行进一步优化。

use this:

__asm__ __volatile__(/*assembly code goes here*/);

the __asm__ can also just be asm.

The __volatile__ stops the compiler from making further optimizations.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文