当前位置：文江博客话题详情

链接器有什么作用？

发布于 2024-09-10 20:56:11 字数 169 浏览 18 评论 0原文

我一直想知道。我知道编译器会将您编写的代码转换为二进制文件，但是链接器会做什么呢？他们对我来说一直是个谜。

我大致了解什么是“链接”。将对库和框架的引用添加到二进制文件中。除此之外我什么都不明白。对我来说它“只是有效”。我也了解动态链接的基础知识，但没有太深入。

有人可以解释一下这些术语吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

又爬满兰若 2024-09-17 20:56:11

要了解链接器，首先了解当您将源文件（例如 C 或 C++ 文件）转换为可执行文件（可执行文件是可以在您的计算机或计算机上执行的文件）时“幕后”发生的情况会有所帮助。其他人的机器运行相同的机器架构）。

在幕后，当编译程序时，编译器将源文件转换为目标字节代码。该字节代码（有时称为目标代码）是只有您的计算机体系结构才能理解的助记符指令。传统上，这些文件具有 .OBJ 扩展名。

创建目标文件后，链接器就开始发挥作用。通常，执行任何有用操作的实际程序都需要引用其他文件。例如，在 C 语言中，一个将您的名字打印到屏幕上的简单程序将包括以下内容：

printf("Hello Kristina!\n");

当编译器将您的程序编译为 obj 文件时，它只是放置对 printf 函数的引用。链接器解析此引用。大多数编程语言都有一个标准例程库来涵盖该语言所期望的基本内容。链接器将 OBJ 文件与该标准库链接起来。链接器还可以将您的 OBJ 文件与其他 OBJ 文件链接。您可以创建其他 OBJ 文件，这些文件具有可由另一个 OBJ 文件调用的函数。链接器的工作方式几乎就像文字处理器的复制和粘贴一样。它“复制”出程序引用的所有必要函数并创建单个可执行文件。有时，复制出的其他库还依赖于其他 OBJ 或库文件。有时链接器必须非常递归才能完成其工作。

请注意，并非所有操作系统都会创建单个可执行文件。例如，Windows 使用 DLL 将所有这些函数保存在一个文件中。这会减少可执行文件的大小，但会使可执行文件依赖于这些特定的 DLL。 DOS 过去使用称为“覆盖”（.OVL 文件）的东西。这有很多目的，但其中一个是将常用的函数保存在一个文件中（如果您想知道，它的另一个目的是能够将大型程序放入内存中。DOS 在内存方面有限制，并且覆盖可以从内存中“卸载”，其他覆盖可以“加载”到该内存之上，因此得名“覆盖”）。 Linux 有共享库，这基本上与 DLL 的想法相同（我认识的硬核 Linux 人员会告诉我有很多很大的区别）。

To understand linkers, it helps to first understand what happens "under the hood" when you convert a source file (such as a C or C++ file) into an executable file (an executable file is a file that can be executed on your machine or someone else's machine running the same machine architecture).

Under the hood, when a program is compiled, the compiler converts the source file into object byte code. This byte code (sometimes called object code) is mnemonic instructions that only your computer architecture understands. Traditionally, these files have an .OBJ extension.

After the object file is created, the linker comes into play. More often than not, a real program that does anything useful will need to reference other files. In C, for example, a simple program to print your name to the screen would consist of:

printf("Hello Kristina!\n");

When the compiler compiled your program into an obj file, it simply puts a reference to the printf function. The linker resolves this reference. Most programming languages have a standard library of routines to cover the basic stuff expected from that language. The linker links your OBJ file with this standard library. The linker can also link your OBJ file with other OBJ files. You can create other OBJ files that have functions that can be called by another OBJ file. The linker works almost like a word processor's copy and paste. It "copies" out all the necessary functions that your program references and creates a single executable. Sometimes other libraries that are copied out are dependent on yet other OBJ or library files. Sometimes a linker has to get pretty recursive to do its job.

Note that not all operating systems create a single executable. Windows, for example, uses DLLs that keep all these functions together in a single file. This reduces the size of your executable, but makes your executable dependent on these specific DLLs. DOS used to use things called Overlays (.OVL files). This had many purposes, but one was to keep commonly used functions together in 1 file (another purpose it served, in case you're wondering, was to be able to fit large programs into memory. DOS has a limitation in memory and overlays could be "unloaded" from memory and other overlays could be "loaded" on top of that memory, hence the name, "overlays"). Linux has shared libraries, which is basically the same idea as DLLs (hard core Linux guys I know would tell me there are MANY BIG differences).

回复收藏 0 原文

十年九夏 2024-09-17 20:56:11

地址重定位最小示例

地址重定位是链接的关键功能之一。

那么让我们通过一个最小的例子来看看它是如何工作的。

0) 简介

摘要：重定位编辑目标文件的 .text 部分以将：

目标文件地址
转换为可执行文件的最终地址

这必须由链接器完成，因为编译器只能看到一个输入文件一次，但我们必须立即了解所有目标文件，以决定如何：

解析未定义的符号，例如声明的未定义函数
不与多个的多个 .text 和 .data 部分发生冲突目标文件

先决条件：至少了解：

x86-64 或 IA-32 程序集
ELF 文件的全局结构。我已经为此制作了教程

链接与 C 或 C++ 具体无关：编译器只需生成目标文件。然后链接器将它们作为输入，而不知道它们是用什么语言编译的。也可能是 Fortran。

因此，为了减少负担，让我们研究一个 NASM x86-64 ELF Linux hello world：

section .data
    hello_world db "Hello world!", 10
section .text
    global _start
    _start:

        ; sys_write
        mov rax, 1
        mov rdi, 1
        mov rsi, hello_world
        mov rdx, 13
        syscall

        ; sys_exit
        mov rax, 60
        mov rdi, 0
        syscall

编译和组装：

nasm -o hello_world.o hello_world.asm
ld -o hello_world.out hello_world.o

使用 NASM 2.10.09。

1) .o 的 .text

首先我们反编译目标文件的 .text 部分：

objdump -d hello_world.o

其中给出：

0000000000000000 <_start>:
   0:   b8 01 00 00 00          mov    $0x1,%eax
   5:   bf 01 00 00 00          mov    $0x1,%edi
   a:   48 be 00 00 00 00 00    movabs $0x0,%rsi
  11:   00 00 00
  14:   ba 0d 00 00 00          mov    $0xd,%edx
  19:   0f 05                   syscall
  1b:   b8 3c 00 00 00          mov    $0x3c,%eax
  20:   bf 00 00 00 00          mov    $0x0,%edi
  25:   0f 05                   syscall

关键行是：

   a:   48 be 00 00 00 00 00    movabs $0x0,%rsi
  11:   00 00 00

应该将 hello world 字符串的地址移动到 rsi 中 register，它被传递给 write 系统调用。

但是等等！当程序加载时，编译器如何知道“Hello world！”将在内存中结束？

嗯，它不能，特别是在我们将一堆 .o 文件与多个 .data 部分链接在一起之后。

只有链接器才能做到这一点，因为只有他才能拥有所有这些目标文件。

因此，编译器只是：

在编译的输出上放置一个占位符值 0x0
向链接器提供一些额外信息，说明如何使用正确的地址修改编译的代码。

此“额外信息”包含在.rela.text 部分

2) .rela.text

.rela.text 代表“.text 部分的重定位”。

使用重定位一词是因为链接器必须将地址从对象重定位到可执行文件中。

我们可以用以下内容反汇编 .rela.text 部分：

readelf -r hello_world.o

其中包含；

Relocation section '.rela.text' at offset 0x340 contains 1 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
00000000000c  000200000001 R_X86_64_64       0000000000000000 .data + 0

本节的格式固定记录在：http:// /www.sco.com/developers/gabi/2003-12-17/ch4.reloc.html

每一项都告诉链接器一个需要重定位的地址，这里我们只有一个字符串。

稍微简化一下，对于这一特定行，我们有以下信息：

Offset = C：此条目更改的 .text 的第一个字节是什么。< /p>
如果我们回顾一下反编译的文本，它恰好位于关键的 movabs $0x0,%rsi 内，那些了解 x86-64 指令编码的人会注意到，它编码的是 64 位指令的地址部分。
Name = .data：地址指向.data部分
Type = R_X86_64_64，它指定了计算的具体内容已完成地址翻译。
该字段实际上与处理器相关，因此记录在 AMD64 System V ABI 扩展< /a> 第 4.4 节“搬迁”。
该文档说 R_X86_64_64 可以：
- Field = word64：8 个字节，因此 00 00 00 00 00 00 00 00 位于地址 0xC
- 计算 = S + A
  - S 是重定位地址处的值，因此 00 00 00 00 00 00 00 00
  - A 是加数，这里是 0。这是重定位条目的字段。
  所以S + A == 0，我们将被重新定位到.data部分的第一个地址。

3) .out 的 .text

现在让我们看看为我们生成的可执行文件 ld 的文本区域：

objdump -d hello_world.out

给出：

00000000004000b0 <_start>:
  4000b0:   b8 01 00 00 00          mov    $0x1,%eax
  4000b5:   bf 01 00 00 00          mov    $0x1,%edi
  4000ba:   48 be d8 00 60 00 00    movabs $0x6000d8,%rsi
  4000c1:   00 00 00
  4000c4:   ba 0d 00 00 00          mov    $0xd,%edx
  4000c9:   0f 05                   syscall
  4000cb:   b8 3c 00 00 00          mov    $0x3c,%eax
  4000d0:   bf 00 00 00 00          mov    $0x0,%edi
  4000d5:   0f 05                   syscall

所以目标文件中唯一改变的是关键行：

  4000ba:   48 be d8 00 60 00 00    movabs $0x6000d8,%rsi
  4000c1:   00 00 00

现在指向地址0x6000d8（d8 00 60 00 00 00 00 00，小端字节序）而不是0x0。

这是 hello_world 字符串的正确位置吗？

为了做出决定，我们必须检查程序头，它告诉 Linux 在哪里加载每个部分。

我们用以下命令反汇编它们：

readelf -l hello_world.out

这给出：

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  LOAD           0x0000000000000000 0x0000000000400000 0x0000000000400000
                 0x00000000000000d7 0x00000000000000d7  R E    200000
  LOAD           0x00000000000000d8 0x00000000006000d8 0x00000000006000d8
                 0x000000000000000d 0x000000000000000d  RW     200000

 Section to Segment mapping:
  Segment Sections...
   00     .text
   01     .data

这告诉我们 .data 部分（第二个）从 VirtAddr = 0x06000d8 开始。

数据部分唯一的东西就是我们的 hello world 字符串。

奖金级别

PIE 链接：gcc 和 ld 中与位置无关的可执行文件的 -fPIE 选项是什么？
_start 入口点：汇编语言中的 global _start 是什么？
修复链接器脚本上的变量地址：如何将变量放置在内存中给定的绝对地址（使用 GCC）
链接器脚本定义的符号，例如 etext、edata 和 end ：符号 etext、edata 和 end 在哪里定义了？
效果是什么C++ 中的 extern "C" ？

Address relocation minimal example

Address relocation is one of the crucial functions of linking.

So let's have a look on how it works with a minimal example.

0) Introduction

Summary: relocation edits the .text section of object files to translate:

object file address
into the final address of the executable

This must be done by the linker because the compiler only sees one input file at a time, but we must know about all object files at once to decide how to:

resolve undefined symbols like declared undefined functions
not clash multiple .text and .data sections of multiple object files

Prerequisites: minimal understanding of:

x86-64 or IA-32 assembly
global structure of an ELF file. I have made a tutorial for that

Linking has nothing to do with C or C++ specifically: compilers just generate the object files. The linker then takes them as input without ever knowing what language compiled them. It might as well be Fortran.

So to reduce the crust, let's study a NASM x86-64 ELF Linux hello world:

section .data
    hello_world db "Hello world!", 10
section .text
    global _start
    _start:

        ; sys_write
        mov rax, 1
        mov rdi, 1
        mov rsi, hello_world
        mov rdx, 13
        syscall

        ; sys_exit
        mov rax, 60
        mov rdi, 0
        syscall

compiled and assembled with:

nasm -o hello_world.o hello_world.asm
ld -o hello_world.out hello_world.o

with NASM 2.10.09.

1) .text of .o

First we decompile the .text section of the object file:

objdump -d hello_world.o

which gives:

0000000000000000 <_start>:
   0:   b8 01 00 00 00          mov    $0x1,%eax
   5:   bf 01 00 00 00          mov    $0x1,%edi
   a:   48 be 00 00 00 00 00    movabs $0x0,%rsi
  11:   00 00 00
  14:   ba 0d 00 00 00          mov    $0xd,%edx
  19:   0f 05                   syscall
  1b:   b8 3c 00 00 00          mov    $0x3c,%eax
  20:   bf 00 00 00 00          mov    $0x0,%edi
  25:   0f 05                   syscall

the crucial lines are:

   a:   48 be 00 00 00 00 00    movabs $0x0,%rsi
  11:   00 00 00

which should move the address of the hello world string into the rsi register, which is passed to the write system call.

But wait! How can the compiler possibly know where "Hello world!" will end up in memory when the program is loaded?

Well, it can't, specially after we link a bunch of .o files together with multiple .data sections.

Only the linker can do that since only he will have all those object files.

So the compiler just:

puts a placeholder value 0x0 on the compiled output
gives some extra information to the linker of how to modify the compiled code with the good addresses

This "extra information" is contained in the .rela.text section of the object file

2) .rela.text

.rela.text stands for "relocation of the .text section".

The word relocation is used because the linker will have to relocate the address from the object into the executable.

We can disassemble the .rela.text section with:

readelf -r hello_world.o

which contains;

Relocation section '.rela.text' at offset 0x340 contains 1 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
00000000000c  000200000001 R_X86_64_64       0000000000000000 .data + 0

The format of this section is fixed documented at: http://www.sco.com/developers/gabi/2003-12-17/ch4.reloc.html

Each entry tells the linker about one address which needs to be relocated, here we have only one for the string.

Simplifying a bit, for this particular line we have the following information:

Offset = C: what is the first byte of the .text that this entry changes.
If we look back at the decompiled text, it is exactly inside the critical movabs $0x0,%rsi, and those that know x86-64 instruction encoding will notice that this encodes the 64-bit address part of the instruction.
Name = .data: the address points to the .data section
Type = R_X86_64_64, which specifies what exactly what calculation has to be done to translate the address.
This field is actually processor dependent, and thus documented on the AMD64 System V ABI extension section 4.4 "Relocation".
That document says that R_X86_64_64 does:
- Field = word64: 8 bytes, thus the 00 00 00 00 00 00 00 00 at address 0xC
- Calculation = S + A
  - S is value at the address being relocated, thus 00 00 00 00 00 00 00 00
  - A is the addend which is 0 here. This is a field of the relocation entry.
  So S + A == 0 and we will get relocated to the very first address of the .data section.

3) .text of .out

Now lets look at the text area of the executable ld generated for us:

objdump -d hello_world.out

gives:

00000000004000b0 <_start>:
  4000b0:   b8 01 00 00 00          mov    $0x1,%eax
  4000b5:   bf 01 00 00 00          mov    $0x1,%edi
  4000ba:   48 be d8 00 60 00 00    movabs $0x6000d8,%rsi
  4000c1:   00 00 00
  4000c4:   ba 0d 00 00 00          mov    $0xd,%edx
  4000c9:   0f 05                   syscall
  4000cb:   b8 3c 00 00 00          mov    $0x3c,%eax
  4000d0:   bf 00 00 00 00          mov    $0x0,%edi
  4000d5:   0f 05                   syscall

So the only thing that changed from the object file are the critical lines:

  4000ba:   48 be d8 00 60 00 00    movabs $0x6000d8,%rsi
  4000c1:   00 00 00

which now point to the address 0x6000d8 (d8 00 60 00 00 00 00 00 in little-endian) instead of 0x0.

Is this the right location for the hello_world string?

To decide we have to check the program headers, which tell Linux where to load each section.

We disassemble them with:

readelf -l hello_world.out

which gives:

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  LOAD           0x0000000000000000 0x0000000000400000 0x0000000000400000
                 0x00000000000000d7 0x00000000000000d7  R E    200000
  LOAD           0x00000000000000d8 0x00000000006000d8 0x00000000006000d8
                 0x000000000000000d 0x000000000000000d  RW     200000

 Section to Segment mapping:
  Segment Sections...
   00     .text
   01     .data

This tells us that the .data section, which is the second one, starts at VirtAddr = 0x06000d8.

And the only thing on the data section is our hello world string.

Bonus level

PIE linking: What is the -fPIE option for position-independent executables in gcc and ld?
_start entry point: What is global _start in assembly language?
fix a variable addresses on the linker script: How to place a variable at a given absolute address in memory (with GCC)
linker-script defined symbols like etext, edata and end: Where are the symbols etext, edata and end defined?
What is the effect of extern "C" in C++?

回复收藏 0 原文

相对绾红妆 2024-09-17 20:56:11

在像“C”这样的语言中，单个代码模块传统上被单独编译成目标代码块，除了模块在其自身之外进行的所有引用（即对库或其他模块）之外，这些目标代码已准备好在各个方面执行。尚未解决（即它们是空白的，等待有人过来并建立所有连接）。

链接器所做的就是一起查看所有模块，查看每个模块需要连接到外部的内容，并查看其导出的所有内容。然后它修复所有问题，并生成最终的可执行文件，然后可以运行。

在动态链接也正在进行的情况下，链接器的输出仍然无法运行 - 仍然有一些对外部库的引用尚未解析，并且它们当时由操作系统解析它加载应用程序（或者甚至可能在运行期间加载）。

回复收藏 0 原文

歌枕肩 2024-09-17 20:56:11

当编译器生成目标文件时，它包括该目标文件中定义的符号的条目以及对该目标文件中未定义的符号的引用。链接器获取这些并将它们放在一起，以便（当一切正常时）每个文件中的所有外部引用都由其他目标文件中定义的符号满足。

然后，它将所有这些目标文件组合在一起，并将地址分配给每个符号，并且当一个目标文件具有对另一个目标文件的外部引用时，它会填充每个符号的地址，无论它被另一个对象使用。在典型情况下，它还会构建一个包含所使用的任何绝对地址的表，因此加载程序可以/将在加载文件时“修复”地址（即，它将基本加载地址添加到每个地址，因此它们都引用正确的内存地址）。

相当多的现代链接器还可以执行一些（在少数情况下很多）其他“东西”，例如以只有在所有模块都可见时才可能的方式优化代码（例如，删除包含的函数，因为其他模块可能可能会调用它们，但是一旦所有模块放在一起，很明显没有任何东西会调用它们）。

回复收藏 0 原文