如何创建并行堆栈并在其上运行协程？

发布于 2024-09-06 04:47:21 字数 2268 浏览 9 评论 0原文

我决定我应该尝试实现协程（我认为我应该这样称呼它们）以获得乐趣和利润。我希望必须使用汇编程序，如果我想让它对任何事情都有用的话，可能还需要一些 C 语言。

请记住，这是出于教育目的。使用已经构建的协程库太容易了（而且真的没什么乐趣）。

你们知道setjmp和longjmp吗？它们允许您将堆栈展开到预定义的位置，并从那里恢复执行。但是，它无法回退到堆栈上的“稍后”。只能早点回来。

jmpbuf_t checkpoint;
int retval = setjmp(&checkpoint); // returns 0 the first time
/* lots of stuff, lots of calls, ... We're not even in the same frame anymore! */
longjmp(checkpoint, 0xcafebabe); // execution resumes where setjmp is, and now it returns 0xcafebabe instead of 0

我想要的是一种无需线程即可在不同堆栈上运行两个函数的方法。（显然，一次只能运行一个。我说过，没有线程。）这两个函数必须能够恢复另一个函数的执行（并停止自己的执行）。有点像他们longjmp到另一个。一旦它返回到另一个函数，它必须从离开的地方恢复（即，在将控制权交给另一个函数的调用期间或之后），有点像 longjmp 返回到 setjmp。

我是这样想的：

函数A创建并行堆栈并将其归零（分配内存等）。
函数A将其所有寄存器压入当前堆栈。
函数A将堆栈指针和基指针设置到该新位置，并推送一个神秘的数据结构，指示跳回何处以及将指令指针设置回何处。
函数A 将其大部分寄存器清零，并将指令指针设置为函数B 的开头。

那是为了初始化。现在，以下情况将无限循环：

函数 B 在该堆栈上工作，执行它需要执行的任何操作。
函数B 到达需要中断并再次给予A 控制权的程度。
函数B将其所有寄存器压入堆栈，获取一开始就给它的神秘数据结构A，并设置堆栈指针以及指向 A 告诉它的位置的指令指针。在此过程中，它交回A一个新的、修改过的数据结构，该结构告诉在哪里恢复B。
函数A醒来，弹出它推入堆栈的所有寄存器，并继续工作，直到它需要中断并再次给予B控制权。

这一切对我来说听起来不错。然而，有很多事情我不太放心。

显然，在好的 x86 上，有一条 pusha 指令可以将所有寄存器发送到堆栈。然而，处理器架构不断发展，现在使用 x86_64，我们有了更多通用寄存器，可能还有几个 SSE 寄存器。我找不到任何证据表明 pusha 确实推动了它们。现代 x86 CPU 中有大约 40 个公共寄存器。我必须自己完成所有的push操作吗？此外，SSE 寄存器没有push（尽管肯定有一个等价的东西——我对整个“x86 汇编器”这件事很陌生）。
改变指令指针有那么简单吗？我可以执行mov rip, rax（英特尔语法）吗？此外，从中获取价值必须有些特殊，因为它不断变化。如果我确实喜欢 mov rax, rip （又是英特尔语法），rip 是否会定位在 mov 指令上，直到它后面的指令，或者介于两者之间？这只是jmp foo。假的。
我曾多次提到神秘的数据结构。到目前为止，我假设它至少需要包含三件事：基指针、堆栈指针和指令指针。还有什么吗？
我是不是忘记了什么？
虽然我真的很想了解事情是如何工作的，但我很确定有一些库可以做到这一点。你知道吗？是否有任何 POSIX 或 BSD 定义的标准方法可以做到这一点，例如线程的 pthread ？

感谢您阅读我的问题文本墙。

原文

I decided I should try to implement coroutines (I think that's how I should call them) for fun and profits. I expect to have to use assembler, and probably some C if I want to make this actually useful for anything.

Bear in mind that this is for educational purposes. Using an already built coroutine library is too easy (and really no fun).

You guys know setjmp and longjmp? They allow you to unwind the stack up to a predefined location, and resumes execution from there. However, it can't rewind to "later" on the stack. Only come back earlier.

jmpbuf_t checkpoint;
int retval = setjmp(&checkpoint); // returns 0 the first time
/* lots of stuff, lots of calls, ... We're not even in the same frame anymore! */
longjmp(checkpoint, 0xcafebabe); // execution resumes where setjmp is, and now it returns 0xcafebabe instead of 0

What I'd like is a way to run, without threading, two functions on different stacks. (Obviously, only one runs at a time. No threading, I said.) These two functions must be able to resume the other's execution (and halt their own). Somewhat like if they were longjmping to the other. Once it returns to the other function, it must resume where it left (that is, during or after the call that gave control to the other function), a bit like how longjmp returns to setjmp.

This is how I thought it:

Function A creates and zeroes a parallel stack (allocates memory and all that).
Function A pushes all its registers to the current stack.
Function A sets the stack pointer and the base pointer to that new location, and pushes a mysterious data structure indicating where to jump back and where to set the instruction pointer back.
Function A zeroes most of its registers and sets the instruction pointer to the beginning of function B.

That's for the initialization. Now, the following situation will indefinitely loop:

Function B works on that stack, does whatever work it needs to.
Function B comes to a point where it needs to interrupt and give A control again.
Function B pushes all of its registers to its stack, takes the mysterious data structure A gave it at the very beginning, and sets the stack pointer and the instruction pointer to where A told it to. In the process, it hands back A a new, modified data structure that tells where to resume B.
Function A wakes up, popping back all the registers it pushed to its stack, and does work until it comes to a point where it needs to interrupt and give B control again.

All this sounds good to me. However, there is a number of things I'm not exactly at ease with.

Apparently, on good ol' x86, there was this pusha instruction that would send all registers to the stack. However, processor architectures evolve, and now with x86_64 we've got a lot more general-purpose registers, and likely several SSE registers. I couldn't find any evidence that pusha does push them. There are about 40 public registers in a mordern x86 CPU. Do I have to do all the pushes myself? Moreover, there is no push for SSE registers (though there's bound to be an equivalent—I'm new to this whole "x86 assembler" thing).
Is changing the instruction pointer as easy as saying it? Can I do, like, mov rip, rax (Intel syntax)? Also, getting the value from it must be somewhat special as it constantly changes. If I do like mov rax, rip (Intel syntax again), will rip be positioned on the mov instruction, to the instruction after it, or somewhere between? It's just jmp foo. Dummy.
I've mentioned a mysterious data structure a few times. Up to now I've assumed it needs to contain at least three things: the base pointer, the stack pointer and the instruction pointer. Is there anything else?
Did I forget anything?
While I'd really like to understand how things work, I'm pretty sure there are a handful of libraries that do just that. Do you know any? Is there any POSIX- or BSD-defined standard way to do it, like pthread for threads?

Thanks for reading my ~~question~~ textwall.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

最后的乘客 2024-09-13 04:47:22

您是正确的，PUSHA 无法在 x64 上工作，它会引发异常 #UD，因为 PUSHA only 推送16 位或 32 位通用寄存器。请参阅英特尔手册，了解您想了解的所有信息。

设置RIP很简单，jmp rax会将RIP设置为RAX。要检索 RIP，如果您已经知道所有协程退出来源，则可以在编译时获取它，或者可以在运行时获取它，您可以在该调用之后调用下一个地址。像这样：

a:
call b
b:
pop rax

RAX 现在将是 b。这是有效的，因为 CALL 压入下一条指令的地址。这项技术也适用于 IA32（尽管我认为在 x64 上有更好的方法，因为它支持 RIP 相对寻址，但我不知道有哪一种）。当然，如果你创建一个函数coroutine_yield，它只能拦截调用者地址:)

由于你无法在一条指令中将所有寄存器推送到堆栈，所以我不建议存储协程堆栈上的状态，因为无论如何这都会使事情变得复杂。我认为最好的做法是为每个协程实例分配一个数据结构。

为什么要将函数 A 中的内容归零？那可能没有必要。

以下是我处理整个事情的方法，试图使其尽可能简单：

创建一个包含以下内容的结构 coroutine_state：

initarg
arg
registers（也包含标志）
caller_registers

创建一个函数：

coroutine_state* coroutine_init(void (*coro_func)(coroutine_state*), void* initarg);

其中 coro_func 是指向协程函数体的指针。

该函数执行以下操作：

分配一个 coroutine_state 结构 cs
将 initarg 分配给 cs.initarg，这些将是协程的初始参数
将 coro_func 分配给 cs.registers.rip
将当前标志复制到 cs.registers （不是寄存器，只有标志，因为我们需要一些理智的标志来防止灾难）
为协程堆栈分配一些大小合适的区域并将其分配给 cs.registers.rsp
返回指向分配的 coroutine_state 结构的指针

现在我们还有另一个函数：

void* coroutine_next(coroutine_state cs, void* arg)，

其中cs是从coroutine_init返回的结构，它表示一个协程实例和 arg 将在协程恢复执行时被输入到协程中。

该函数由协程调用者调用，将一些新参数传递给协程并恢复它，该函数的返回值是协程返回（产生）的任意数据结构。

将所有当前标志/寄存器存储在 cs.caller_registers 中（RSP 除外），请参阅步骤 3。
将 arg 存储在 cs.arg< 中/code>
修复调用者堆栈指针 (cs.caller_registers.rsp)，如果幸运的话，添加 2*sizeof(void*) 将修复它，您可以必须查找这一点来确认它，您可能希望这个函数是 stdcall，这样在调用它之前不会篡改寄存器
mov rax, [rsp]，将 RAX 分配给cs.caller_registers.rip;解释：除非您的编译器处于破解状态，否则[RSP]将保存指向调用该函数的调用指令后面的指令的指令指针（即：返回地址）
从<加载标志和寄存器code>cs.registers
jmp cs.registers.rip，有效恢复协程的执行

请注意，我们永远不会从此函数返回，我们跳转到的协程会为我们“返回”（请参阅协程_yield）。另请注意，在该函数内部，您可能会遇到许多复杂情况，例如 C 编译器生成的函数序言和结尾，也许还有寄存器参数，您必须处理所有这些。就像我说的，stdcall 会为你省去很多麻烦，我认为 gcc 的 -fomit-frame_pointer 会删除尾声的东西。

最后一个函数声明为：

void coroutine_yield(void* ret);

该函数在协程内部调用，以“暂停”协程的执行并返回到coroutine_next的调用者>。

存储标志/寄存器在cs.registers
修复协程堆栈指针(cs.registers.rsp)，再次添加2*sizeof(void*)加载
来自 cs.caller_registers 的 flags/registers
jmp cs.caller_registers.rip 这本质上是从协程调用程序堆栈帧上的 coroutine_next 调用返回的，因为返回值在RAX中传递，我们返回了arg。假设如果 arg 为 NULL，则协程已终止，否则它是任意数据结构。

回顾一下，您可以使用 coroutine_init 初始化协程，然后可以使用 coroutine_next 重复调用实例化的协程。

协程的函数本身声明为：
void my_coro(coroutine_state cs)

cs.initarg 保存初始函数参数（想想构造函数）。每次调用 my_coro 时，cs.arg 都会有一个由 coroutine_next 指定的不同参数。这就是协程调用者与协程通信的方式。最后，每次协程想要暂停自己时，它都会调用coroutine_yield，并向其传递一个参数，该参数是协程调用程序的返回值。

好吧，您现在可能会认为“这很简单！”，但我忽略了以正确的顺序加载寄存器和标志的所有复杂性，同时仍然保持未损坏的堆栈帧并以某种方式保留协程数据结构的地址（您只需以线程安全的方式覆盖所有寄存器）。对于这一部分，您需要了解编译器内部是如何工作的...祝您好运:)

You are correct in that PUSHA wont work on x64 it will raise the exception #UD, as PUSHA only pushes the 16-bit or 32-bit general purpose registers. See the Intel manuals for all the info you ever wanted to know.

Setting RIP is simple, jmp rax will set RIP to RAX. To retrieve RIP, you could either get it at compile time if you already know all the coroutine exit origins, or you could get it at run time, you can make a call to the next address after that call. Like this:

a:
call b
b:
pop rax

RAX will now be b. This works because CALL pushes the address of the next instruction. This technique works on IA32 as well (although I'd suppose there's a nicer way to do it on x64, as it supports RIP-relative addressing, but I don't know of one). Of course if you make a function coroutine_yield, it can just intercept the caller address :)

Since you can't push all the registers to the stack in a single instruction, I wouldn't recommend storing the coroutine state on the stack, as that complicates things anyways. I think the nicest thing to do would be to allocate a data structure for every coroutine instance.

Why are you zeroing things in function A? That's probably not necessary.

Here's how I would approach the entire thing, trying to make it as simple as possible:

Create a structure coroutine_state that holds the following:

initarg
arg
registers (also contains the flags)
caller_registers

Create a function:

coroutine_state* coroutine_init(void (*coro_func)(coroutine_state*), void* initarg);

where coro_func is a pointer to the coroutine function body.

This function does the following:

allocate a coroutine_state structure cs
assign initarg to cs.initarg, these will be the initial argument to the coroutine
assign coro_func to cs.registers.rip
copy current flags to cs.registers (not registers, only flags, as we need some sane flags to prevent an apocalypse)
allocate some decent sized area for the coroutine's stack and assign that to cs.registers.rsp
return the pointer to the allocated coroutine_state structure

Now we have another function:

void* coroutine_next(coroutine_state cs, void* arg)

where cs is the structure returned from coroutine_init which represents a coroutine instance, and arg will be fed into the coroutine as it resumes execution.

This function is called by the coroutine invoker to pass in some new argument to the coroutine and resume it, the return value of this function is an arbitrary data structure returned (yielded) by the coroutine.

store all current flags/registers in cs.caller_registers except for RSP, see step 3.
store the arg in cs.arg
fix the invoker stack pointer (cs.caller_registers.rsp), adding 2*sizeof(void*) will fix it if you're lucky, you'd have to look this up to confirm it, you probably want this function to be stdcall so no registers are tampered with before calling it
mov rax, [rsp], assign RAX to cs.caller_registers.rip; explanation: unless your compiler is on crack, [RSP] will hold the instruction pointer to the instruction that follows the call instruction that called this function (ie: the return address)
load the flags and registers from cs.registers
jmp cs.registers.rip, efectively resuming execution of the coroutine

Note that we never return from this function, the coroutine we jump to "returns" for us (see coroutine_yield). Also note that inside this function you may run into many complications such as function prologue and epilogue generated by the C compiler, and perhaps register arguments, you have to take care of all this. Like I said, stdcall will save you lots of trouble, I think gcc's -fomit-frame_pointer will remove the epilogue stuff.

The last function is declared as:

void coroutine_yield(void* ret);

This function is called inside the coroutine to "pause" execution of the coroutine and return to the caller of coroutine_next.

store flags/registers in cs.registers
fix coroutine stack pointer (cs.registers.rsp), once again, add 2*sizeof(void*) to it, and you want this function to be stdcall as well
mov rax, arg (lets just pretend all the functions in your compiler return their arguments in RAX)
load flags/registers from cs.caller_registers
jmp cs.caller_registers.rip This essentially returns from the coroutine_next call on the coroutine invoker's stack frame, and since the return value is passed in RAX, we returned arg. Let's just say if arg is NULL, then the coroutine has terminated, otherwise it's an arbitrary data structure.

So to recap, you initialize a coroutine using coroutine_init, then you can repeatedly invoke the instantiated coroutine with coroutine_next.

The coroutine's function itself is declared:
void my_coro(coroutine_state cs)

cs.initarg holds the initial function argument (think constructor). Each time my_coro is called, cs.arg has a different argument that was specified by coroutine_next. This is how the coroutine invoker communicates with the coroutine. Finally, every time the coroutine wants to pause itself, it calls coroutine_yield, and passes one argument to it, which is the return value to the coroutine invoker.

Okay, you may now think "thats easy!", but I left out all the complications of loading the registers and flags in the correct order while still maintaining a non corrupt stack frame and somehow keeping the address of your coroutine data structure (you just overwrote all your registers), in a thread-safe manner. For that part you will need to find out how your compiler works internally... good luck :)

回复收藏 0 原文