我几乎可以肯定,以前有人问这个问题,但是我似乎找不到正确的关键字来搜索答案。我很抱歉这是重复的。
我最好尝试了解从C ++语法到二进制机器代码的C ++文件的汇编过程。此外,我还试图了解什么影响最终的机器代码。
首先,我几乎可以肯定,以下是决定最终机器代码的唯一因素(对于大多数系统而言)(如果我在这里错了,请纠正我)
- 用于编译,组装和链接的工具。
- 诸如GNU C编译器,Clang,Visual Studio,Nasm,Ect。
之类的东西
- 使用系统的内核。
- 它是Linux内核,Windows Microkernel还是其他一些内核的特定版本,例如Mac OS X One。
- 使用的操作系统正在使用。
- 我不太清楚这个。我不确定机器是否运行相同的Linux内核,但是在这种情况下,OS不同,可以说Debian vs Centos会产生不同的二进制文件。
- 最后是硬件体系结构。
- 不同的CPU体系结构,例如ARM 64,X86,Power PC,Ect。采用不同的OP代码,因此显然机器代码应该不同。
因此,这里说的是我对编译过程以及这些依赖关系中的每一个的理解。
- 我写了一个C ++文件,并使用系统可以理解的代码。一个很好的示例可能是在Windows上使用
< winsock.h>
,< sys/socket.h>
linux上的。
- 预处理器运行并执行任何预处理器宏。
- 在这里,我知道不同的预处理器将定义不同的宏,但是现在我会认为这不是太依赖机器。 (假设这可能是错误的)。
- 编译器工具运行以生成汇编文件输出。
- 在这里产生的组装取决于编译器以及它做出的优化或选择。
- 这也取决于内核,因为不同的内核在不同位置具有不同的系统调用和存储文件。这意味着组件可能会进行更改,例如调用特定于该内核的功能时不同的分支。
- 操作系统?仍然不确定操作系统如何适合此。如果两台机器具有相同的内核,则操作系统对二进制文件有什么作用?
- 最后,组装代码取决于CPU体系结构。我认为这是一个很明显的陈述。
- 一旦编译器产生一个组件。然后,我们可以调用汇编程序将我们的汇编代码转换为几乎完整的机器代码。 (我认为机器代码与二进制OPODES相同,但这可能是错误的)。
- 相应的机器代码文件(我认为通常称为对象文件)几乎包含运行或引用其他机器代码文件所需的所有说明。
- 此机器代码通常具有某种格式(我认为ELF是Linux的流行格式),并且此格式肯定取决于链接器。
- 我不认为内核,操作系统或硬件会影响对象文件的布局/格式,但这可能是错误的。如果他们这样做,请更正。
- 硬件将影响产生的实际机器代码,因为我认为这是机器代码指令的1到1映射到CPU的OPCODE。
- 我不确定内核或操作系统是否会影响链接过程,因为我认为它们的更改已经在编译步骤中纳入。
- 最后,链接步骤发生。
- 我认为这与寻找所有引用的机器代码并将其注入一个可以执行的完整的机器代码文件一样简单。
因此,有了所有这些,我需要帮助确定上述过程的不准确性,以及我可能错过的任何依赖项是CPU,OS,内核还是工具。
谢谢您,很抱歉长期以来的问题。这可能应该已经分解为多个问题,但我已经太远了。如果这不顺利,我可能会在个别问题中问每个部分。
编辑:
机器的哪些组件会影响给定C ++文件输入产生的机器代码?
I am almost certain this question has been asked before, but I can not seem to find the right keywords to search for to get an answer. My apologies if this is a duplicate.
I am better trying to understand the compilation process of say a C++ file as it goes from the C++ syntax to the binary machine code. In addition I am trying to understand what influences the resulting machine code.
First, I am nearly certain that the following are the only factors (for most systems) that dictate the final machine code (please correct me if I am wrong here)
- The tools used to compile, assemble, and link.
- Things like gnu c compiler, clang, visual studio, nasm, ect.
- The kernel of the system being used.
- Whether its a specific version of the linux kernel, windows microkernel, or some other kernel like a mac os x one.
- The operating system being used.
- This one I am less clear about. I am unsure if machines running the same linux kernel, but different os, in this case let's say debian vs centos, will they produce different binaries.
- Lastly the hardware architecture.
- Different cpu architectures like arm 64, x86, power pc, ect. take different op codes so obviously the machine code should be different.
So with that being said here is my understanding of the compilation process and where each of these dependencies show up.
- I write a C++ file and use code that my system can understand. A good example might be using
<winsock.h>
on windows and <sys/socket.h>
on linux.
- The preprocessor runs and executes any preprocessor macros.
- Here I know that different preprocessors will define different macros but for now I will assume this is not too machine dependent. (This might be wrong to assume).
- The compiler tools run to produce assembly file outputs.
- Here the assembly produced depends on the compiler and what optimizations or choices it makes.
- It also depends on the kernel because different kernels have different system calls and store files in different locations. This means the assembly might make changes such as different branching when calling functions specific to that kernel.
- The operating system? Still unsure how the operating system fits in to this. If two machines have the same kernel, what does the operating system do to the binaries?
- Finally the assembly code depends on the cpu architecture. I think that is a pretty obvious statement.
- Once the compiler produces an assembly. We can then invoke the assembler to turn our assembly code into almost complete machine code. (I think machine code is identical to binary opcodes a cpu manual lists but this might be wrong).
- The corresponding machine code files (often called object files I think) contain nearly all the instructions needed to run or reference other machine code files which will be linked in the next step.
- This machine code usually has some format (I think ELF is a popular format for linux) and this format is dependent on the linker for sure.
- I don't think the kernel, operating system, or hardware affect the layout/format of the object file but this is probably wrong. If they do please correct this.
- The hardware will affect the actual machine code produced because again I think it is a 1 to 1 mapping of machine code instructions to opcodes for a cpu.
- I am unsure if the kernel or operating system affect the linking process because I thought their changes were already incorporated in the compiling step.
- Finally the linking step occurs.
- I think this is as simple as the linker looking for all the referenced machine code and injecting it into one complete machine code file which can be executed.
- I have no clue what affects this besides the linker tool itself.
So with all that, I need help identifying inaccuracies with the procedure I described above, and any dependencies I might have missed whether it be cpu, os, kernel, or tool ones.
Thank you and sorry for the long winded question. This probably should have been broken up into multiple questions but I am too far in. If this does not go well I may ask each part in individual questions.
EDIT:
Questions with more focus.
What components of a machine affect the machine code produced given a C++ file input?
发布评论
评论(2)
实际上,这是很多问题,通常您的问题太广泛了(正如您设法自己认识的那样)。但另一方面,您表现出了深切的兴趣(仅通过编写如此漫长而深刻的问题),并且对编译程序的过程有很多正确的理解。我本人很难学习的东西是您缺少或不正确理解的事情(您可能最感兴趣的事情)。因此,我将为您提供一些重要的观点,我认为您在大局中缺少。
请注意,我非常习惯于Linux,因此我将主要描述事物在Linux上的工作方式。但是我相信,大多数事情在其他操作系统上也以类似的方式发生。
让我们从硬件开始。一台现代计算机具有一些架构的CPU。 cpu Architectures 有很多不同的不同。您提到了其中一些,例如ARM,X86等,它们是类似CPU的家族,可以通过位宽度和/或支撑的扩展分为较小的组。最终,您的处理器具有指定的指令集,该集合定义了它支持的操作以及这些Opcodes的操作。如果一个本机(编译)程序运行,则内存中有原始操作码,并且CPU在其体系结构规范之后直接执行它们。
除了CPU外,还有更多与计算机连接的硬件。通常,与此硬件进行通信很复杂,并且不是标准化。例如,如果用户程序从键盘中获取输入击键,则不必直接与键盘通信,而是通过操作系统内核进行操作。这是通过称为 syscall syscall(syscall(syscall( 2)男人页。 SYSCALLS构成了内核的应用程序二进制接口(内核ABI)。从终端或使用文件系统读取和写作是SYSCALL功能的示例。
如您所见,在内核中已经实现了很高的功能。但是,对于大多数典型应用程序,功能仍然非常有限。为了封装SYSCALLS并为内存管理,实用程序功能,数学功能以及您可能在日常程序中使用的许多其他内容提供功能,通常在程序和内核之间还有另一层。这东西称为C标准库,它是一个共享库(我们将介绍这片刻之内的内容)。在gnu/linux上是 glibc GNU/Linux系统上最重要的库(尤其不是内核 1 的一部分)。虽然它实现了C标准所需的所有功能(例如函数,例如
malloc()
或strcpy()
),但它还发出了许多其他功能是ISO C标准库的超集, posix 标准和一些扩展。该接口通常称为操作系统的应用程序编程接口(API)。虽然原则上可以绕过API并直接使用SYSCALLS,但几乎所有程序(即使使用C或C ++的其他语言编写)都使用C库。现在给自己喝咖啡,休息几分钟。现在,我们有足够的背景信息来查看如何将C ++程序转换为二进制文件,以及该二进制如何执行。
C ++程序由不同的汇编单元组成(通常每个不同的源文件都是编译单元)。每个编译单元都会遵循以下步骤,
注释:未定义的符号(尤其是函数)是未定义的。如果您说调用
malloc()
函数,则将不是被编译,而是要止于以后。因此,此步骤也不取决于操作系统。malloc()
调用仍然未化为,并存储在对象文件的符号表中。由于大多数SYSCALL都包裹在库功能中,因此汇编代码通常不会直接包含SYSCALL代码。因此,此步骤取决于CPU架构。但是,它取决于ABI 2 ,该在术语中取决于编译器和OS。检查共享库的所需符号,但尚未链接。例如,如果
malloc()
调用链接器检查,则libc中有一个malloc
符号,但可执行文件中的符号仍然未解决。此时,您有一个可执行的二进制文件。您可能会注意到,该二进制中仍然可能存在未解决的符号。因此,您不能仅将二进制装置加载到RAM中,并让CPU执行它。需要一个称为动态链接的最后一步。在Linux上执行此步骤的程序称为 dynamic linker/loader/loader/loader 。它的任务是将可执行的小精灵文件加载到内存中,查找所有所需的动态库,也将其加载到内存中(列表存储在Elf文件中)并解决其余符号。每次执行程序时,都会发生最后一步。现在,使用GLIBC共享库中的地址解决了
malloc()
符号。您在内存中具有纯的CPU指令,CPU的程序计数器寄存器(跟踪下一个指令的寄存器)设置为输入点,并且程序可以开始运行。它时不时会被打断,因为它可以制作一个syscall,或者是因为它被内核调度程序中断,以使另一个程序在该CPU核心上运行。
希望我能回答您的一些问题并满足您的好奇心。我认为您缺少最重要的部分是动态链接的发生方式。这是一个非常有趣的主题,与 sention nifdent code 之类的概念有关。我希望你能学习。
1 这也是某些人坚持调用基于Linux的Systems GNU/Linux的原因之一。 GLIBC库(以及许多其他GNU程序)定义了许多操作系统结构,与补充程序和配置文件进行交互。但是,有基于Linux的系统,没有GLIBC。其中之一是Android,使用Google bionic libc。
2 ABI与呼叫约定。这是操作系统,编程语言和编译器规范的混合物。这是原因之一(除了名称杂交,请参阅下面的PeterCordes的评论)您需要这些
extern“ c” {...}
scopes scopes in c ++ header文件中,这些在共享库中声明C函数。基本上,它是关于如何传递参数和函数之间返回值的约定。Actually that is a lot of questions and usually you're question would be much too broad for SO (as you managed to recognize by yourself). But on the other hand you showed a deep interest (just by writing such a long and profound question) and also a lot of correct understanding of the process of compiling a program. The things you are missing or not understanding correctly (and you are probably the most interested in) are those things, that I myself found hard to learn. Thus I will provide you with some important points, that I think you are missing in the big picture.
Note that I am very much used to Linux, so I will mostly describe how things work on Linux. But I believe that most things also happen in a similar way on other operating systems.
Let's begin with the hardware. A modern computer has a CPU of some architecture. There are lots of different of CPU architectures. You mentioned some of them like arm, x86, etc. which are families of similar CPUs and can be divided into smaller groups by bit width and/or supported extensions. Ultimately your processor has a specified instruction set that defines which opcodes it supports and what those opcodes do. If a native (compiled) program runs, there are raw opcodes in the memory and the CPU directly executes them following its architecture specification.
Aside from the CPU there is a lot more hardware connected to your computer. Usually communicating with this hardware is complicated and not standardized. If a user program for example gets input keystrokes from the keyboard, in does not have to directly communicate with the keyboard, but rather does this via the operating system kernel. This works by a mechanism called syscall interrupt. The kernel installs an handler routine, that is called if a user program triggers such an interrupt with a special CPU instruction. You can think of it like a language agnostic function call from the program into the kernel. For example for Linux you can find a list of all syscalls at the syscall(2) man page. The syscalls form the kernel's Application Binary Interface (kernel ABI). Reading and writing from a terminal or using a filesystem are examples for syscall functionality.
As you can see, there are already very high level functions, that are implemented in the kernel. However the functionality is still quite limited for most typical applications. To encapsulate the syscalls and provide functions for memory management, utility functions, mathematical functions and many other things you probably use in your daily programs, there is usually another layer between the program and the kernel. This thing is called the C standard library, and it is a shared library (we will cover what exactly this is in a moment). On GNU/Linux it is the glibc which is the single most important library on a GNU/Linux system (and notably not part of the kernel 1). While it implements all the features that are required by the C standard (for example functions like
malloc()
orstrcpy()
), it also ships a lot of additional functions which are a superset of the ISO C standard library, the POSIX standard and some extensions. This interface is usually called the Application Programming Interface (API) of the operating system. While it is in principle possible to bypass the API and directly use the syscalls, almost all programs (even when written in other languages than C or C++) use the C library.Now get yourself a coffee and a few minutes of rest. We now have enough background information to look at how a C++ program is transformed into a binary, and how exactly this binary is executed.
A C++ program consists of different compilation units (usually each different source file is a compilation unit). Each compilation unit undergoes the following steps
Note: Symbols (especially functions) that are not defined, are left undefined. If you say call the
malloc()
function, this will not be compiled, but left unevaluated until later. Thus this step is also not much dependent on the operating system.malloc()
call still are left unevaluated and are stored in the object file's symbol table. Because most of the syscalls are wrapped in library functions, the assembly code will usually not directly contain syscall code. Thus this step is depended on the CPU architecture. It is however dependent on the ABI2, which in term is dependent on the compiler and the OS.Shared libraries are checked for the needed symbols, but not linked yet. For example in case of the
malloc()
call, the linker checks, that there is amalloc
symbol in the glibc, but the symbol in the executable still remains unresolved.At this point you have a executable binary. As you might noticed, there might still be unresolved symbols in that binary. Thus you cannot just load that binary into RAM and let the CPU execute it. A final step called dynamic linking is needed. On Linux the program that performs this step is called the dynamic linker/loader. Its task is to load the executable ELF file into memory, look up all the needed dynamic libraries, load them into memory as well (a list is stored in the ELF file) and resolve the remaining symbols. This last step happens each time the program is executed. Now finally the
malloc()
symbol is resolved with the address in the glibc shared library.You have pure CPU instructions in memory, the CPU's program counter register (the one that tracks the next instruction) is set to the entry point, and the program can begin to run. Every now and then it is interrupted either because it makes a syscall, or because it is interrupted by the kernel scheduler to let another program run on that CPU core.
I hope I could answer some of your questions and satisfy your curiosity. I think the most important part you were missing, was how dynamic linking happens. This is a very interesting topic which is related to concepts like position independent code. I wish you could luck learning.
1 this is also one reason why some people insist on calling Linux based systems GNU/Linux. The glibc library (together with many other GNU programs) defines much of the operating system structure, interacts with supplementary programs and configuration files etc. There are however Linux based systems without glibc. One of them is Android, using Googles bionic libc.
2 The ABI is related to the calling convention. This is a mixture of operating system, programming language and compiler specification. It is one of the reasons (besides name mangling, see the comment of PeterCordes below) you need those
extern "C" {...}
scopes in C++ header files, that declare C functions in shared libraries. It basically is a convention on how to pass parameters and return values between functions.操作系统和内核都没有直接参与其中。
他们的有限参与是,如果您想使用GNU工具为X86构建Linux 64位二进制文件,那么您需要以某种方式(下载和安装或安装自己)为该目标处理器和该操作系统构建GNU工具本身。由于系统调用是特定于操作系统和目标的,以及该操作系统支持的二进制文件。严格来说,不仅仅是精灵文件格式,即只是一个容器,但是链接和Bootstrap也是特定于操作系统加载程序的。 (或为内核建立其他规则)。例如,应用程序加载程序是否会从.elf文件中的特定信息来初始化.bss和.data,或者像在MCU上一样,Bootstrap代码本身是否必须这样做?
用于Linux等目标的GNU工具的构建器以及理想情况下为您的操作系统和目标的预先构建的二进制文件,将以某种方式设置路径。 C库将具有默认的链接脚本及其亲密合作伙伴The Bootstrap。
此后,它只是一个愚蠢的工具链。包括文件在系统级别,编译器级别或程序员级别仅包含在C语言中。默认路径和GCC知道它是从哪里执行的,因此它知道普通构建中的GCC和其他库中的位置。
GCC本身实际上不是编译器,它将调用其他程序,例如预处理器,编译器本身,汇编程序和链接器。
预处理器将进行搜索并替换为Incluber和定义并最终获得一个很棒的CPP文件,然后将其传递给编译器。
编译器前端(例如,GCC的C ++语言)将其转变为一种内部语言,用此名称分配一个INT,另一个添加了两个和blah。如果可以的话,伪代码。这可以在其上完成许多优化工作,然后最终将后端(对于GNU而言,可以是X86,MIPS,ARM等,在某种程度上是前部和中间的)。 LLVM工具至少能够将中间,内部,语言暴露于外部文件(在编译器使用用于执行编译的内存的外部)中,您可以组合和优化这些字节码文件,然后将其转换为汇编或指示。在LLVM世界中反对。我认为这是一个例外不是规则,其他人只使用内部表。
尽管我认为使用汇编语言步骤是明智的,而且很理智。并非所有编译器都这样做了,也不认为所有编译器都这样做。一些输出对象。
是的,汇编自然是部分的,外部功能(标签)和变量(标签)无法在对象级别上解析。链接器必须这样做。
因此,目标(X86,ARM等)确实会影响ELF文件的构建
有某些项目,特定于目标的魔术数字。如前所述,操作系统和 /或内核确实会影响精灵,因为该内核或操作系统有二进制的规则。请记住,小精灵只是一个容器,例如tar,zip或mkv等)。
因此,您的来源。
随之而来的所有相关资源(包括系统)包括,编译器包括和您的包含。
GCC/G ++是管理步骤的包装程序。
调用预处理器扩展包括并定义为一个文件(这里没有魔术)
致电编译器将一个文件解析到内部表中,想想伪代码和数据
在这些结构上运行的许多可能的优化器
包括窥视孔优化器在内的后端将桌子变成汇编语言(至少适用于GNU)
汇编器被要求将ASM变成对象
如果指定了所有对象并告诉GCC链接,则...
链接器组合二进制的所有对象,包括Bootstrap,包括已经构建的库,存根等,以及更可能的链接脚本(Linker)脚本和引导程序具有亲密关系,它们不可分开,而不是编译器的一部分,它们是C库的一部分,等等)。
内核模块加载程序或操作系统应用程序加载器为文件提供了加载和按照加载程序的规则运行并运行程序。
Neither operating system nor kernel are directly involved in any of this.
Their limited involvement is in that if you want to build Linux 64 bit binaries for x86 using gnu tools then you need to in some way (download and install or build yourself) build the gnu tools themselves for that target processor and that operating system. As system calls are specific to the operating system and target, and also the binaries supported by that operating system. Not strictly just the elf file format, that is just a container, but the linking and possibly bootstrap is also specific to the operating systems loader. (or if building something for the kernel that would have other rules). For example, does the application loader initialize .bss and .data for you from specific information in the .elf file, or like on an mcu does the bootstrap code itself have to do this?
The builder for gnu tools for a target like linux and ideally a pre-built binary for your os and target, would have paths setup in some way. The c library would have a default linker script and its intimate partner the bootstrap.
After that point, it is just a dumb toolchain. Include files be they at the system level, compiler level, or programmer level are just includes in the C language. The default paths and gcc knows where it was executed from so it knows where in a normal build the gcc and other libraries live.
gcc itself is not a compiler actually it calls other programs like the preprocessor, the compiler itself, the assembler and linker.
The preprocessor is going to do the search and replace for includes and defines and end up with one great big cpp file, then pass that to the compiler.
The compiler front end (C++ language for gcc for example) turns that into an internal language, allocate an int with this name, and another add the two and blah. A pseudo code if you will. This gets a lot of the optimization work done on it then eventually the back end (which for gnu could be x86, mips, arm, etc independent to some extent of the front and middle). The LLVM tools, are at least capable of exposing that middle, internal, language to external files (external to the memory used by the compiler to do the compilation) and you can combine and optimize those bytecode files and then convert them to assembly or direct to object in the llvm world. I think this is an exception not a rule, others just use internal tables.
While I think it is wise and sane to use an assembly language step. Not all compilers do and do not assume that all compilers do. Some output objects.
Yes that assembly is naturally partial, external functions (labels) and variables (labels) cannot be resolved at the object level. The linker has to do that.
So the target (x86, arm, etc) does affect the construction of the elf file as
there are certain items, magic numbers specific to the target. As mentioned the operating system and or kernel do affect the elf in that there are rules for construction of the binary for that kernel or operating system. Remember that elf is just a container like tar or zip or mkv etc. Do not assume that the operating system can handle every possible choice you want to make with the contents that the linker will allow (the tools are dumb, do what they are told).
So your source.
All the relevant sources that go with it including system includes, compiler includes and your includes.
gcc/g++ is a wrapper program that manages the steps.
calls the pre-processor expands includes and defines into one file (no magic here)
call the compiler to parse that one file into internal tables, think pseudo code and data
many, many possible optimizers that operate on these structures
backend, including peephole optimizer, turns the tables into assembly language (for gnu at least)
assembler is called to turn the asm into an object
If all the objects are specified and gcc is told to link, then...
Linker combines all the objects for the binary, including the bootstrap, including already built libraries, stubs, etc, and command line or more likely a linker script (linker script and bootstrap have an intimate relationship they are not assumed to be separable and not part of the compiler they are part of a C library, etc).
Kernel module loader or operating system application loader fed the file and per the rules of that loader loads and runs the program.