是否可以在 Linux x86 GAS 程序集中创建没有系统调用的线程？

发布于 2024-07-16 03:53:32 字数 893 浏览 6 评论 0原文

在学习“汇编语言”（在 x86 架构上的 Linux 中，使用 GNU 作为汇编器）时，令人惊奇的时刻之一是可以使用系统调用。这些系统调用非常方便，有时甚至是必要的，因为您的程序在用户空间中运行.
然而，系统调用在性能方面相当昂贵，因为它们需要中断（当然还需要系统调用），这意味着必须从用户空间中当前的活动程序到内核空间中运行的系统进行上下文切换。

我想说的是：我目前正在实现一个编译器（用于大学项目），我想添加的额外功能之一是对多线程代码的支持，以增强编译程序的性能。因为一些多线程代码将由编译器本身自动生成，所以这几乎可以保证其中也会有非常小的多线程代码。为了获得性能优势，我必须确保使用线程能够实现这一点。

然而，我担心的是，为了使用线程，我必须进行系统调用和必要的中断。因此，微小的（自动生成的）线程将受到进行这些系统调用所需时间的极大影响，这甚至可能导致性能损失......

因此，我的问题是双重的（还有一个额外的奖励问题在它下面）：

是否可以编写汇编程序可以运行多个线程的代码同时在多个核心上一次，无需系统打电话？
如果我有非常小的线程（与线程的总执行时间一样小），我会获得性能提升，性能损失，还是根本不值得付出努力？

我的猜测是，如果没有系统调用，多线程汇编代码不可能。即使是这种情况，您是否有建议（甚至更好：一些真实的代码）来尽可能高效地实现线程？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

琉璃繁缕 2024-07-23 03:53:33

简短的回答是你不能。当您编写汇编代码时，它会在一个且仅一个逻辑（即硬件）线程上顺序运行（或带有分支）。如果您希望某些代码在另一个逻辑线程上执行（无论是在同一内核上、在同一 CPU 上的不同内核上还是在不同 CPU 上），您需要让操作系统设置另一个线程的指令指针（ CS:EIP) 指向您要运行的代码。这意味着使用系统调用让操作系统执行您想要的操作。

用户线程不会为您提供所需的线程支持，因为它们都在同一硬件线程上运行。

编辑：将 Ira Baxter 的答案与 Parlanse 结合起来。如果您确保程序的每个逻辑线程中都有一个线程在开始运行，那么您可以构建自己的调度程序，而无需依赖操作系统。无论哪种方式，您都需要一个调度程序来处理从一个线程到另一个线程的跳转。在对调度程序的调用之间，没有特殊的汇编指令来处理多线程。调度程序本身不能依赖于任何特殊的程序集，而是依赖于每个线程中调度程序各部分之间的约定。

不管怎样，无论你是否使用操作系统，你仍然必须依赖一些调度程序来处理跨线程执行。

回复收藏 0 原文

七色彩虹 2024-07-23 03:53:33

“医生，医生，我这样做的时候很痛”。医生：“不要这样做”。

简而言之，您可以进行多线程编程，而无需
调用昂贵的操作系统任务管理原语。简单地忽略线程的操作系统
调度操作。这意味着你必须编写自己的线程
调度程序，并且永远不会将控制权传递回操作系统。
（你必须以某种方式更聪明地处理你的线程开销
比那些非常聪明的操作系统人员）。
我们选择这种方法正是因为windows进程/线程/
光纤呼叫太昂贵而无法支持计算
几百条指令的颗粒。

我们的 PARLANSE 编程语言是一种并行编程语言：
请参阅http://www.semdesigns.com/Products/Parlanse/index.html

PARLANSE在Windows下运行，提供并行“颗粒”作为抽象并行性
通过高度的组合来构造和调度这些谷物
调整的手写调度程序和调度代码生成
考虑grain上下文的PARLANSE编译器
以最大限度地减少调度开销。例如，编译器
确保grain的寄存器在该点不包含任何信息
可能需要调度（例如，“等待”），因此
调度程序代码只需保存PC和SP。实际上，
很多时候调度程序代码根本无法控制；
一个分叉的grain只存储分叉的PC和SP，
切换到编译器预分配的堆栈并跳转到grain
代码。谷物完成后将重新启动分叉器。

通常有一个互锁来同步谷物，实施
由编译器使用本机 LOCK DEC 指令实现
相当于计算信号量。应用领域
逻辑上可以分叉数百万粒谷物；调度程序限制
如果工作队列，则父颗粒不会生成更多工作
足够长，所以更多的工作不会有帮助。调度程序
实现工作窃取以允许缺乏工作的 CPU 抢占
就绪颗粒形成相邻的 CPU 工作队列。这有
已实现最多可处理 32 个 CPU；但我们有点担心
x86 供应商实际上可能会大量使用超过
未来几年！

PARLANSE 是一种成熟的语言；我们从 1997 年就开始使用它，
并已在其中实现了数百万线并行应用。

"Doctor, doctor, it hurts when I do this". Doctor: "Don't do that".

The short answer is you can do multithreaded programming without
calling expensive OS task management primitives. Simply ignore the OS for thread
scheduling operations. This means you have to write your own thread
scheduler, and simply never pass control back to the OS.
(And you have to be cleverer somehow about your thread overhead
than the pretty smart OS guys).
We chose this approach precisely because windows process/thread/
fiber calls were all too expensive to support computation
grains of a few hundred instructions.

Our PARLANSE programming langauge is a parallel programming language:
See http://www.semdesigns.com/Products/Parlanse/index.html

PARLANSE runs under Windows, offers parallel "grains" as the abstract parallelism
construct, and schedules such grains by a combination of a highly
tuned hand-written scheduler and scheduling code generated by the
PARLANSE compiler that takes into account the context of grain
to minimimze scheduling overhead. For instance, the compiler
ensures that the registers of a grain contain no information at the point
where scheduling (e.g., "wait") might be required, and thus
the scheduler code only has to save the PC and SP. In fact,
quite often the scheduler code doesnt get control at all;
a forked grain simply stores the forking PC and SP,
switches to compiler-preallocated stack and jumps to the grain
code. Completion of the grain will restart the forker.

Normally there's an interlock to synchronize grains, implemented
by the compiler using native LOCK DEC instructions that implement
what amounts to counting semaphores. Applications
can fork logically millions of grains; the scheduler limits
parent grains from generating more work if the work queues
are long enough so more work won't be helpful. The scheduler
implements work-stealing to allow work-starved CPUs to grab
ready grains form neighboring CPU work queues. This has
been implemented to handle up to 32 CPUs; but we're a bit worried
that the x86 vendors may actually swamp use with more than
that in the next few years!

PARLANSE is a mature langauge; we've been using it since 1997,
and have implemented a several-million line parallel application in it.

回复收藏 0 原文

末蓝 2024-07-23 03:53:33

实现用户模式线程。

历史上，线程模型被概括为 N:M，也就是说 N 个用户模式线程运行在 M 个内核模型线程上。现代用法是 1:1，但并不总是这样，也不必如此。

您可以在单个内核线程中自由维护任意数量的用户模式线程。只是您有责任经常在它们之间进行切换，以使所有这些看起来都是并发的。当然，您的线程是协作式的而不是抢占式的；您基本上将yield() 调用分散在您自己的代码中，以确保定期进行切换。

回复收藏 0 原文

热鲨 2024-07-23 03:53:33

如果您想获得性能，则必须利用内核线程。只有内核可以帮助您在多个 CPU 核心上同时运行代码。除非您的程序受 I/O 限制（或执行其他阻塞操作），否则执行用户模式协作多线程（也称为纤维）不会为您带来任何性能。您只需执行额外的上下文切换，但无论哪种方式，您的实际线程正在运行的一个 CPU 仍将以 100% 的速度运行。

系统调用变得更快。现代 CPU 支持 sysenter 指令，该指令比旧的 int 指令要快得多。另请参阅本文了解 Linux 如何以最快的方式进行系统调用可能的。

确保自动生成的多线程使线程运行足够长的时间，以便获得性能。不要尝试并行化短代码片段，您只会浪费时间生成和连接线程。还要警惕记忆效应（尽管这些更难以测量和预测）——如果多个线程正在访问独立的数据集，由于缓存一致性问题。

回复收藏 0 原文

倦话 2024-07-23 03:53:33

现在有点晚了，但我自己对这种话题很感兴趣。
事实上，除了并行化/性能之外，线程并没有什么特别需要内核干预的地方。

强制 BLUF：

问题 1：否。至少需要初始系统调用来跨各个 CPU 核心/超线程创建多个内核线程。

Q2：这要看情况。如果您创建/销毁执行微小操作的线程，那么您就会浪费资源（线程创建过程将大大超过线程退出之前所使用的时间）。如果您创建 N 个线程（其中 N 是系统上的核心/超线程数）并重新分配它们的任务，那么答案可能是肯定的，具体取决于您的实现。

Q3：如果您提前知道订购操作的精确方法，您就可以优化操作。具体来说，您可以创建相当于 ROP 链（或前向调用链，但这实际上最终实现起来可能会更复杂）。该 ROP 链（由线程执行）将连续执行“ret”指令（到其自己的堆栈），其中该堆栈连续前置（或在滚动到开头的情况下附加）。在这样一个（奇怪的！）模型中，调度程序保留一个指向每个线程的“ROP 链末端”的指针，并向其中写入新值，从而代码在内存中循环执行函数代码，最终产生 ret 指令。同样，这是一个奇怪的模型，但仍然很有趣。

进入我价值 2 美分的内容。

我最近通过管理各种堆栈区域（通过 mmap 创建）并维护一个专用区域来存储“线程”的控制/个性化信息，创建了在纯汇编中作为线程有效运行的内容。虽然我没有这样设计，但有可能通过 mmap 创建一个大的内存块，我将其细分为每个线程的“私有”区域。因此，只需要一个系统调用（尽管之间的保护页面很智能，但它们需要额外的系统调用）。

此实现仅使用进程生成时创建的基本内核线程，并且在程序的整个执行过程中只有一个用户模式线程。该程序通过内部控制结构更新其自身状态并自行调度。 I/O 等在可能的情况下通过阻塞选项进行处理（以降低复杂性），但这并不是严格要求的。当然，我使用了互斥体和信号量。

为了实现这个系统（完全在用户空间中，如果需要也可以通过非 root 访问）需要以下内容：

线程归结为以下内容的概念：
用于堆栈操作的堆栈（有点不言自明且显而易见）
一组要执行的指令（也很明显）
用于保存各个寄存器内容的小内存块

调度程序可以归结为：
一个管理器，用于管理调度程序指定的有序列表（通常是优先级）中的一系列线程（请注意，进程从未实际执行，只是它们的线程执行）。

线程上下文切换器：
注入到代码各个部分的宏（我通常将它们放在重型函数的末尾），大致相当于“线程产量”，它保存线程的状态并加载另一个线程的状态。

因此，确实可以（完全在汇编中并且除了初始 mmap 和 mprotect 之外没有系统调用）在非根进程中创建用户模式类似线程的构造。

我添加这个答案只是因为您特别提到了 x86 程序集，并且这个答案完全是通过完全用 x86 程序集编写的独立程序得出的，该程序实现了最小化系统调用的目标（减去多核功能），并且还最小化了系统端线程高架。

Quite a bit late now, but I was interested in this kind of topic myself.
In fact, there's nothing all that special about threads that specifically requires the kernel to intervene EXCEPT for parallelization/performance.

Obligatory BLUF:

Q1: No. At least initial system calls are necessary to create multiple kernel threads across the various CPU cores/hyper-threads.

Q2: It depends. If you create/destroy threads that perform tiny operations then you're wasting resources (the thread creation process would greatly exceed the time used by the tread before it exits). If you create N threads (where N is ~# of cores/hyper-threads on the system) and re-task them then the answer COULD be yes depending on your implementation.

Q3: You COULD optimize operation if you KNEW ahead of time a precise method of ordering operations. Specifically, you could create what amounts to a ROP-chain (or a forward call chain, but this may actually end up being more complex to implement). This ROP-chain (as executed by a thread) would continuously execute 'ret' instructions (to its own stack) where that stack is continuously prepended (or appended in the case where it rolls over to the beginning). In such a (weird!) model the scheduler keeps a pointer to each thread's 'ROP-chain end' and writes new values to it whereby the code circles through memory executing function code that ultimately results in a ret instruction. Again, this is a weird model, but is intriguing nonetheless.

Onto my 2-cents worth of content.

I recently created what effectively operate as threads in pure assembly by managing various stack regions (created via mmap) and maintaining a dedicated area to store the control/individualization information for the "threads". It is possible, although I didn't design it this way, to create a single large block of memory via mmap that I subdivide into each thread's 'private' area. Thus only a single syscall would be required (although guard pages between would be smart these would require additional syscalls).

This implementation uses only the base kernel thread created when the process spawns and there is only a single usermode thread throughout the entire execution of the program. The program updates its own state and schedules itself via an internal control structure. I/O and such are handled via blocking options when possible (to reduce complexity), but this isn't strictly required. Of course I made use of mutexes and semaphores.

To implement this system (entirely in userspace and also via non-root access if desired) the following were required:

A notion of what threads boil down to:
A stack for stack operations (kinda self explaining and obvious)
A set of instructions to execute (also obvious)
A small block of memory to hold individual register contents

What a scheduler boils down to:
A manager for a series of threads (note that processes never actually execute, just their thread(s) do) in a scheduler-specified ordered list (usually priority).

A thread context switcher:
A MACRO injected into various parts of code (I usually put these at the end of heavy-duty functions) that equates roughly to 'thread yield', which saves the thread's state and loads another thread's state.

So, it is indeed possible to (entirely in assembly and without system calls other than initial mmap and mprotect) to create usermode thread-like constructs in a non-root process.

I only added this answer because you specifically mention x86 assembly and this answer was entirely derived via a self-contained program written entirely in x86 assembly that achieves the goals (minus multi-core capabilities) of minimizing system calls and also minimizes system-side thread overhead.

回复收藏 0 原文