在 C 虚拟机中实现寄存器
我用 C 语言编写了一个虚拟机作为业余爱好项目。该虚拟机执行的代码与 Intel 语法 x86 汇编非常相似。问题是这个虚拟机使用的寄存器只是名义上的寄存器。在我的 VM 代码中,寄存器的使用方式与 x86 寄存器一样,但机器将它们存储在系统内存中。在 VM 代码中使用寄存器而不是系统内存并没有性能改进。 (我认为仅局部性会在一定程度上提高性能,但实际上,没有任何改变。)
在解释程序时,该虚拟机将指令的参数存储为指针。这允许虚拟指令采用内存地址、常量值、虚拟寄存器或任何东西作为参数。
由于硬件寄存器没有地址,我想不出一种方法来将虚拟机寄存器实际存储在硬件寄存器中。在我的虚拟寄存器类型上使用 register 关键字不起作用,因为我必须获取指向虚拟寄存器的指针才能将其用作参数。有没有什么方法可以让这些虚拟寄存器的性能更像它们的本地寄存器?
如有必要,我非常乐意钻研组装。我知道 JIT 编译此 VM 代码可以让我利用硬件寄存器,但我也希望能够将它们与我的解释代码一起使用。
I've written a virtual machine in C as a hobby project. This virtual machine executes code that's very similar to Intel syntax x86 assembly. The problem is that the registers this virtual machine uses are only registers in name. In my VM code, registers are used just like x86 registers, but the machine stores them in system memory. There are no performance improvements to using registers over system memory in VM code. (I thought that the locality alone would increase performance somewhat, but in practice, nothing has changed.)
When interpreting a program, this virtual machine stores arguments to instructions as pointers. This allows a virtual instruction to take a memory address, constant value, virtual register, or just about anything as an argument.
Since hardware registers don't have addresses, I can't think of a way to actually store my VM registers in hardware registers. Using the register keyword on my virtual register type doesn't work, because I have to get a pointer to the virtual register to use it as an argument. Is there any way to make these virtual registers perform more like their native counterparts?
I'm perfectly comfortable delving into assembly if necessary. I'm aware that JIT compiling this VM code could allow me to utilize hardware registers, but I'd like to be able to use them with my interpreted code as well.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
机器寄存器没有索引支持:在没有代码生成的情况下,您无法使用运行时指定的“索引”访问寄存器,无论这意味着什么。由于您可能会根据指令解码寄存器索引,因此唯一的方法是进行一个巨大的切换(即
switch (opcode) { case ADD_R0_R1: r[0] += r[1]; break; .. }
)。这可能是一个坏主意,因为它过多地增加了解释器循环大小,因此会引入指令缓存抖动。如果我们谈论 x86,另一个问题是通用寄存器的数量非常少;其中一些将用于簿记(存储 PC、存储 VM 堆栈状态、解码指令等) - 您不太可能为 VM 有多个可用寄存器。
即使寄存器索引支持可用,它也不太可能给您带来很大的性能。通常解释器中最大的瓶颈是指令解码; x86 支持基于寄存器值的快速且紧凑的内存寻址(即 mov eax, dword ptr [ebx * 4 + ecx]),因此您不会赢得太多。检查生成的程序集是值得的 - 即确保“寄存器池”地址存储在寄存器中。
加速解释器的最好方法是 JITting;即使是简单的 JIT(即没有智能寄存器分配 - 基本上只是发出与指令循环和 switch 语句执行的代码相同的代码,指令解码除外)也可以将性能提高 3 倍或更多(这些是简单 JITter 的实际结果)在类似 Lua 的基于寄存器的 VM 之上)。解释器最好保留为参考代码(或者用于冷代码以减少 JIT 内存成本 - JIT 生成成本对于简单的 JIT 来说不是问题)。
Machine registers don't have indexing support: you can't access the register with a runtime-specified "index", whatever that would mean, without code generation. Since you're likely decoding the register index from your instructions, the only way is to make a huge switch (i.e.
switch (opcode) { case ADD_R0_R1: r[0] += r[1]; break; ... }
). This is likely a bad idea since it increases the interpreter loop size too much, so it will introduce instruction cache thrashing.If we're talking about x86, the additional problem is that the amount of general-purpose registers is pretty low; some of them will be used for bookkeeping (storing PC, storing your VM stack state, decoding instructions, etc.) - it's unlikely that you'll have more than one free register for the VM.
Even if register indexing support were available, it's unlikely it would give you a lot of performance. Commonly in interpreters the largest bottleneck is instruction decoding; x86 supports fast and compact memory addressing based on register values (i.e.
mov eax, dword ptr [ebx * 4 + ecx]
), so you would not win much. It's worthwhile though to check the generated assembly - i.e. to make sure the 'register pool' address is stored in the register.The best way to accelerate interpreters is JITting; even a simple JIT (i.e. without smart register allocation - basically just emitting the same code you would execute with the instruction loop and a switch statement, except the instruction decoding) can boost your performance 3x or more (these are actual results from a simple JITter on top of a Lua-like register-based VM). An interpreter is best kept as reference code (or for cold code to decrease JIT memory cost - the JIT generation cost is a non-issue for simple JITs).
即使您可以直接访问硬件寄存器,围绕使用寄存器而不是内存的决定包装代码也会慢得多。
为了获得性能,您需要预先针对性能进行设计。
举几个例子。
通过设置所有陷阱以捕获离开其虚拟内存空间的代码来准备 x86 VM。直接执行代码,不要模拟,分支到它并运行。当代码超出其内存/I/O 空间以与设备等通信时,捕获该设备并模拟该设备或它所到达的任何设备,然后将控制权返回给程序。如果代码受处理器限制,它将运行得非常快,如果受 I/O 限制,则运行速度会很慢,但不会像模拟每条指令那么慢。
静态二进制翻译。在运行之前反汇编和翻译代码,例如指令 0x34,0x2E 将在 .c 文件中转换为 ascii:
al ^= 0x2E;
of =0;
cf=0;
sf=al
理想情况下执行大量死代码删除(如果下一条指令也修改标志,则不要在此处修改它们,等等) 。让编译器中的优化器完成剩下的工作。通过这种方式,您可以通过模拟器获得性能增益,性能增益的好坏取决于您优化代码的程度。作为一个新程序,它在硬件上运行,注册内存等等,因此处理器绑定的代码比虚拟机慢,在某些情况下,您不必处理处理器执行异常来捕获内存/io,因为您已经模拟了内存在代码中访问,但这仍然有成本并且无论如何都会调用模拟设备,因此没有节省。
动态翻译,类似于 sbt 但你在运行时执行此操作,我听说过这样做,例如在其他处理器上模拟 x86 代码时,例如 dec alpha,代码会慢慢从 x86 指令更改为本机 alpha 指令,因此下一次它直接执行 alpha 指令,而不是模拟 x86 指令。每次执行代码时,程序都会执行得更快。
或者,也许只是重新设计您的模拟器,从执行的角度来看更加高效。以 MAME 中的模拟处理器为例,为了性能而牺牲了代码的可读性和可维护性。当编写时这一点很重要,今天有了多核千兆赫处理器,您不必那么努力地模拟 1.5GHz 6502 或 3GHz z80。像在表中查找下一个操作码并决定不模拟指令的部分或全部标志计算这样简单的事情可以给您带来显着的提升。
最重要的是,如果您有兴趣在运行程序时使用 x86 硬件寄存器 Ax、BX 等来模拟 AX、BX 等寄存器,唯一有效的方法是实际执行指令,而不是执行和执行与单步调试器一样捕获,但执行长指令字符串,同时防止它们离开 VM 空间。有不同的方法可以做到这一点,并且性能结果会有所不同,但这并不意味着它会比性能高效的模拟器更快。这限制了您将处理器与程序相匹配。使用高效的代码和真正好的编译器(好的优化器)模拟寄存器将为您提供合理的性能和可移植性,因为您不必将硬件与正在运行的程序相匹配。
Even if you could directly access hardware registers, wrapping code around the decision to use a register instead of memory is that much slower.
To get performance you need to design for performance up front.
A few examples.
Prepare an x86 VM by setting up all the traps to catch the code leaving its virtual memory space. Execute the code directly, dont emulate, branch to it and run. When the code reaches out of its memory/i/o space to talk to a device, etc, trap that and emulate that device or whatever it was reaching for then return control back to the program. If the code is processor bound it will run really fast, if I/O bound then slow but not as slow as emulating each instruction.
Static binary translation. Disassemble and translate the code before running, for example an instruction 0x34,0x2E would turn into ascii in a .c file:
al ^= 0x2E;
of =0;
cf=0;
sf=al
Ideally performing tons of dead code removal (if the next instruction modifies the flags as well then dont modify them here, etc). And letting the optimizer in the compiler do the rest. You can get a performance gain this way over an emulator, how good of a performance gain depends on how well you can optimize the code. Being a new program it runs on the hardware, registers memory and all, so the processor bound code is slower than a VM, in some cases you dont have to deal with the processor doing exceptions to trap memory/io because you have simulated the memory accesses in the code, but that still has a cost and calls a simulated device anyway so no savings there.
Dynamic translation, similar to sbt but you do this at runtime, I have heard this done for example when simulating x86 code on some other processor say a dec alpha, the code is slowly changed into native alpha instructions from x86 instructions so the next time around it executes the alpha instruction directly instead of emulating the x86 instruction. Each time through the code the program executes faster.
Or maybe just redesign your emulator to be more efficient from an execution standpoint. Look at the emulated processors in MAME for example, the readability and maintainability of the code has been sacrificed for performance. When written that was important, today with multi-core gigahertz processors you dont have to work so hard to emulate a 1.5ghz 6502 or 3ghz z80. Something as simple as looking the next opcode up in a table and deciding not to emulate some or all of the flag calculation for an instruction can give you a noticeable boost.
Bottom line, if you are interested in using the x86 hardware registers, Ax, BX, etc to emulate AX, BX, etc registers when running a program, the only efficient way to do that is to actually execute the instruction, and not execute and trap as in single stepping a debugger, but execute long strings of instructions while preventing them from leaving the VM space. There are different ways to do this, and performance results will vary, and that doesnt mean it will be faster than a performance efficient emulator. This limits you to matching the processor to the program. Emulating the registers with efficient code and a really good compiler (good optimizer) will give you reasonable performance and portability in that you dont have to match the hardware to the program being run.
在执行之前(提前)转换复杂的、基于寄存器的代码。一个简单的解决方案是用于执行的双栈虚拟机,它提供了将栈顶元素(TOS)缓存在寄存器中的可能性。如果您更喜欢基于寄存器的解决方案,请选择“操作码”格式,该格式捆绑尽可能多的指令(经验法则,如果选择 MISC 样式设计,则最多可以将四个指令捆绑到一个字节中)。这样,虚拟寄存器访问就可以本地解析为每个静态超级指令的物理寄存器引用(clang 和 gcc 能够执行此类优化)。作为副作用,无论具体的寄存器分配如何,降低的 BTB 误预测率都会带来更好的性能。
基于 C 的解释器的最佳线程技术是直接线程(标签即地址扩展)和复制切换线程(符合 ANSI)。
transform your complex, register-based code before execution (ahead of time). A simple solution would be a forth like dual-stack vm for execution which offering the possibility to cache the top-of-stack element (TOS) in a register. If you prefer a register-based solution choose an "opcode" format which bundles as much as possible instructions (thumb rule, up to four instructions can be bundled into a byte if a MISC style design is chosen). This way virtual register accesses are locally resolvable to physical register references for each static super-instruction (clang and gcc able to perform such optimization). As side effect the lowered BTB mis-prediction rate would result in far better performance regardless of specific register allocations.
Best threading techniques for C based interpreters are direct threading (label-as-address extension) and replicated switch-threading (ANSI conform).
因此,您正在编写一个 x86 解释器,它必然比实际硬件慢 1 到 10 的 3 次方。在真实的硬件中,说
mov mem, foo
会比mov reg, foo
花费更多的时间,而在程序中mem[adr] = foo
大约需要myRegVars[regnum] = foo
(模缓存)。那么您期望相同的速度差吗?如果你想模拟寄存器和内存之间的速度差异,你将不得不做一些类似于 Cachegrind 的事情。也就是说,保留一个模拟时钟,当它进行内存引用时,它会添加一个很大的数字。
So you're writing an x86 interpreter, which is bound to be between 1 and 3 powers of 10 slower that the actual hardware. In the real hardware, saying
mov mem, foo
is going take a lot more time thanmov reg, foo
, while in your programmem[adr] = foo
is going to take about as long asmyRegVars[regnum] = foo
(modulo cacheing). So you're expecting the same speed differential?If you want to simulate the speed differential between registers and memory, you're going to have to do something like what Cachegrind does. That is, keep a simulated clock, and when it does a memory reference, it adds a big number to that.
您的虚拟机似乎太复杂,无法进行有效的解释。一个明显的优化是拥有一个“微代码”VM,带有寄存器加载/存储指令,甚至可能是基于堆栈的指令。您可以在执行之前将高级虚拟机转换为更简单的虚拟机。另一个有用的优化取决于 gcc 可计算标签扩展,请参阅 Objective Caml VM 解释器以获取此类线程 VM 实现的示例。
Your VM seems to be too complicated for an efficient interpretation. An obvious optimisation is to have a "microcode" VM, with register load/store instructions, probably even a stack-based one. You can translate your high level VM into a simpler one before the execution. Another useful optimisation depends on a gcc computable labels extension, see the Objective Caml VM interpreter for the example of such a threaded VM implementation.
要回答您提出的具体问题:
您可以指示您的 C 编译器留下一堆寄存器供您使用。指向内存第一页的指针通常是不允许的,它们被保留用于 NULL 指针检查,因此您可能会滥用初始指针来标记寄存器。如果您有一些空闲的本机寄存器,这会有所帮助,因此我的示例使用 64 位模式来模拟 4 个寄存器。交换机的额外开销很可能会减慢执行速度,而不是使其更快。另请参阅其他答案以获取一般建议。
To answer the specific question you asked:
You could instruct your C compiler to leave a bunch of registers free for your use. Pointers to the first page of memory are usually not allowed, they are reserved for NULL pointer checks, so you could abuse the initial pointers for marking registers. It helps if you have a few native registers to spare, so my example uses 64 bit mode to simulate 4 registers. It may very well be that the additional overhead of the switch slows down execution instead of making it faster. Also see the other answers for general advices.