如何在标准 C/C++ 上包含和翻译自定义指令/扩展代码保持较高的性能

发布于 2024-12-26 10:16:48 字数 1255 浏览 6 评论 0原文

我正在为 FPGA 和 ASIC 开发通用图像处理核心。这个想法是将标准处理器与其连接起来。我遇到的问题之一是如何对其进行“编程”。让我解释一下：核心有一个用于我的“自定义”扩展的指令解码器。例如：

vector_addition $vector[0], $vector[1], $vector[2]    // (i.e. v2 = v0+v1)

还有很多类似的事情。该操作由处理器通过总线发送到核心，使用处理器进行循环、非向量操作等，如下所示：

for (i=0; i<15;i++)           // to be executed in the processor
     vector_add(v0, v1, v2)   // to be executed in my custom core

程序是用 C/C++ 编写的。内核只需要指令本身，在机器代码中

opcode = vector_add = 0x12h
register_src_1 = v0 = 0x00h
register_src_2 = v1 = 0x01h
register_dst = v2 = 0x02h
机器代码 = opcore | v0 | v1 | v2 = 0x7606E600h

（或者其他什么，只是不同字段的串联，以二进制方式构建指令）

一旦通过总线将其发送到内核，内核就能够使用专用总线从内存请求所有数据，并处理所有数据，而无需使用处理器。最大的提示是：如何将前面的指令转换为其十六进制表示形式？（通过总线发送它不是问题）。我想到的一些选项是

运行解释代码（在处理器中运行时翻译为机器代码）--> 非常慢，甚至使用某种内联宏使用
外部自定义编译器编译自定义部分，从外部存储器加载二进制文件并使用一些独特的指令将其移动到核心 -->难以阅读/理解源代码，SDK 集成较差，如果代码是非常分段的
JIT 编译，则部分过多 -->就为了这个而复杂？
扩展编译器 -->一场噩梦！
一个连接到定制核心的定制处理器来处理一切：循环、指针、内存分配、变量...... -->工作量太大

这个问题与软件/编译器有关，但对于那些对此主题有深入了解的人来说，这是 FPGA 中的 SoC，主处理器是 MicroBlaze，IP 核采用 AXI4 总线。

我希望我解释正确...提前致谢！

原文

I'm developing a general purpose image processing core for FPGAs and ASICs. The idea is to interface a standard processor with it. One of the problems I have is how to "program" it. Let me explain: The core has a instruction decoder for my "custom" extensions. For instance:

vector_addition $vector[0], $vector[1], $vector[2]    // (i.e. v2 = v0+v1)

and many more like that. This operation is sended by the processor through the bus to the core, using the processor for loops, non-vector operations, etc, like that:

for (i=0; i<15;i++)           // to be executed in the processor
     vector_add(v0, v1, v2)   // to be executed in my custom core

Program is written in C/C++. The core only need the instruction itself, in machine code

opcode = vector_add = 0x12h
register_src_1 = v0 = 0x00h
register_src_2 = v1 = 0x01h
register_dst = v2 = 0x02h
machine code = opcore | v0 | v1 | v2 = 0x7606E600h

(or whatever, just a contatenation of different fields to build the instruction in binary)

Once sending it through the bus to the core, the core is able to request all data from memory with dedicated buses and to handle everything without use the processor. The big cuestion is: how can I translate the previous instruction to its hexadecimal representation? (send it throught the bus is not a problem). Some options that come to mind are

Run interpreted code (translate to machine code at runtime in the processor) -->
very slow, even using some kind of inline macro
Compile the custom sections with an external custom compiler, load the binary from the external memory and move it to the core with some unique instruction --> hard to read/understand source code, poor SDK integration, too many sections if code is very segmented
JIT compilation --> to complex just for this?
Extending the compiler --> a nightmare!
A custom processor connected to the custom core to handle everything: loops, pointers, memory allocation, variables... --> too much work

The problem is about software/compilers, but for those that have deep knowledge in this topic, this is a SoC in an FPGA, the main processor is a MicroBlaze and the IP Core employes AXI4 buses.

I hope I explained it correctly... Thanks in advance!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

倾城泪 2025-01-02 10:16:48

我不确定我完全理解，但我想我以前也遇到过类似的事情。根据对罗德里戈回复的评论，听起来您的代码中散布着一些小指令。您还提到外部编译器是可能的，只是很痛苦。如果将外部编译器与 C 宏结合起来，您可以获得一些不错的东西。

考虑以下代码：

for (i=0; i<15;i++)
     CORE_EXEC(vector_add(v0, v1, v2), ref1)

CORE_EXEC 宏有两个用途：

您可以使用外部工具扫描源文件中的这些条目并编译核心代码。该代码将使用“ref1”名称作为变量链接到 C（仅生成带有二进制位的 C 文件）。
在 C 中，您将定义 CORE_EXEC 宏以将“ref1”字符串传递到核心进行处理。

因此，第 1 阶段将生成一个已编译的二进制核心指令的文件，例如上面可能有一个这样的字符串：

const char * const cx_ref1[] = { 0x12, 0x00, 0x01, 0x02 };

您可能会像这样定义 CORE_EXEC：

#define CORE_EXEC( code, name ) send_core_exec( cx_##name )

显然，您可以选择您想要的前缀，尽管在 C++ 中您可能希望使用而是命名空间。

就工具链而言，您可以为所有位生成一个文件，或者为每个 C++ 文件生成一个文件——这可能更容易进行脏检测。然后您只需将生成的文件包含在源代码中即可。

I'm not sure I entirely understand, but I think I've been faced with something similar before. Based on the comment to rodrigo's response it sounds like you have small instruction pieces scattered through your code. You also mention an external compiler is possible, just a pain. If you combine the external compiler with a C macro you can get something decent.

Consider this code:

for (i=0; i<15;i++)
     CORE_EXEC(vector_add(v0, v1, v2), ref1)

The CORE_EXEC macro will serve two purposes:

You can use an external tool to scan your source files for these entries and compile the core code. This code will be linked to C (just produce a C file with binary bits) using the "ref1" name as a variable.
In C you'll define the CORE_EXEC macro to pass the "ref1" string to the core for processing.

So stage 1 will produce a file of compiled binary core instructions, for example the above might have a string like this:

const char * const cx_ref1[] = { 0x12, 0x00, 0x01, 0x02 };

And you might define CORE_EXEC like this:

#define CORE_EXEC( code, name ) send_core_exec( cx_##name )

Obviously you can choose the prefixes however you want, though in C++ you might wish to use a namespace instead.

In terms of toolchain you could produce one file for all your bits or produce one file per C++ file -- which might be easier to dirty detection. Then you can simply include the generated files in your source code.

回复收藏 0 原文

只等公子 2025-01-02 10:16:48

难道您不能在程序开始时将所有代码部分转换为机器代码（仅一次），将它们以二进制格式保存在内存块中，然后在需要时使用这些二进制文件吗？

这基本上就是 OpenGL 着色器的工作原理，而且我发现这很容易管理。

主要缺点是内存消耗，因为内存中同时存在同一脚本的文本和二进制表示形式。我不知道这对你来说是否是一个问题。如果是，则有部分解决方案，例如在编译后卸载源文本。

回复收藏 0 原文

心在旅行 2025-01-02 10:16:48

假设我要修改一个 Arm 核心以添加一些自定义指令，并且我想要运行的操作在编译时就已知（将在一秒内到达运行时）。

例如，我会使用汇编：

.globl vecabc
vecabc:
   .word 0x7606E600 ;@ special instruction
   bx lr

或者将其内联到您的编译器的任何内联语法中，如果您需要使用处理器寄存器，例如c编译器在内联汇编语言中填充寄存器，然后汇编器进行汇编，那么它会变得更加困难那些指示。我发现编写实际的 asm 并只是将单词注入指令流中，如上所述，只有编译器将一些字节区分为数据和一些字节作为指令，核心会按写入的顺序看到它们。

如果你需要实时做事，你可以使用自修改代码，我再次喜欢使用 asm 来蹦床。构建您想要在 ram 中某处运行的指令，例如在地址 0x20000000 处，然后让一个蹦床调用它：

.globl tramp
tramp:
    bx r0 ;@ assuming you encoded a return in your instructions

调用它

tramp(0x20000000);

使用上面的另一个相关路径是修改汇编器以添加新指令，为这些指令创建语法。那么你可以随意使用直接汇编语言或内联汇编语言，不修改编译器就无法让编译器使用它们，这是修改汇编器后要走的另一条路。

Lets say I was going to modify an arm core to add some custom instructions, and the operations I wanted to run were known at compile time (will get to runtime in a sec).

I would use assembly, for example:

.globl vecabc
vecabc:
   .word 0x7606E600 ;@ special instruction
   bx lr

or inline it with whatever the inline syntax is for your compiler it makes it harder if you need to use processor registers for example where the c compiler fills in the registers in the inline assembly language then the assembler assembles those instructions. I find writing actual asm and just injecting the words in the instruction stream as above, only the compiler distingushes some bytes as data and some bytes as instructions, the core will see them in order as written.

If you need to do things real time you can use self-modifying-code, again I like to use asm to trampoline. Build the instructions you want to run somewhere in ram, say at address 0x20000000 then have a trampoline call it:

.globl tramp
tramp:
    bx r0 ;@ assuming you encoded a return in your instructions

call it with

tramp(0x20000000);

An other path related one above is to modify the assembler to add the new instructions, create a syntax for those instructions. Then you can use straight assembly language or inline assembly language at will, you wont get the compiler to use them without modifying the compiler, which is another path to take after the assembler has been modified.

回复收藏 0 原文

~没有更多了~