在 ARM 中将向量寄存器操作为 float32x4_t C 变量

发布于 2025-01-11 17:23:57 字数 1169 浏览 0 评论 0 原文

我在 ARM 中使用内联汇编来进行科学应用。 在我的汇编代码中,我必须(参见最后的注释)名义上指示我要使用哪些向量寄存器。例如,在我的代码中,我使用 asm volatile("fadd v12.4S, v12.4S, v7.4S") 在之间执行向量浮点加法向量寄存器 7 和 12,将结果以及其他内联汇编指令存储在向量寄存器 12 中。

在“关键”汇编代码部分之后,我想检索上述结果变量,并将它们作为 C 中的 arm neon 变量进行操作。在我的例子中,向量将具有 4x 32 位变量,因此它们的类型为 float32x4_t。

到目前为止,我可以做类似的事情:

float32_t my_var[4];
asm volatile("st1  {v12.4S}, [%[addr]]\n\t" : : [addr]"r"(my_var) :  "x0",  "x1");
/*from here on I can operate on my_var[0], my_var[1], etc without having to write asm code*/

即,我使用向量存储指令将向量寄存器的内容写入 C 向量变量。这将导致对该变量的后续访问被加载,我想避免这种情况,因为该变量已经存在于寄存器中。

我想要类似的东西

float32x4_t my_var;
asm volatile("some code that make sure my_var 'binds' to vector 12");
/*from here on I could use intrinsic such as vgetq_lane_f32(my_var, 1) to get each value of the vector and not having to write asm code also*/

但是,我找不到执行第二种方法的方法。 这个老问题有类似的担忧,但它是针对较旧的ARM ISA(我的目标是 v8),并从单个(非向量)变量加载(而不是存储到)。

注意:我不能从一开始就使用内部调用(这会让事情变得更容易),因为我正在模拟器中建模新指令,并且我需要编写到该部分的低级汇编。

I'm using inline assembly in ARM for a scientific application.
In my assembly code, I have to (see note in the end) nominally indicate which vector registers I want to use. For example, in my code, I have asm volatile("fadd v12.4S, v12.4S, v7.4S") to do a vector floating-point add between vector registers 7 and 12, storing the result in vector register 12, among other inline assembly instructions.

After the 'critical' assembly code part, I want to retrieve the said resulting variables and operate on them as arm neon variables in C. In my case, vectors will have 4x 32-bit variables, so they will be of type float32x4_t.

So far I can do something like:

float32_t my_var[4];
asm volatile("st1  {v12.4S}, [%[addr]]\n\t" : : [addr]"r"(my_var) :  "x0",  "x1");
/*from here on I can operate on my_var[0], my_var[1], etc without having to write asm code*/

I.e., I'm using a vector store instruction to write the contents of the vector register into a C vector variable. This will cause subsequent accesses to that variable to be loads, which I want to avoid because the variable exists in a register already.

I'd like to have something similar to

float32x4_t my_var;
asm volatile("some code that make sure my_var 'binds' to vector 12");
/*from here on I could use intrinsic such as vgetq_lane_f32(my_var, 1) to get each value of the vector and not having to write asm code also*/

However, I could not find a way to do the second approach. This old question had similar concerns, but it was for an older ARM ISA (I'm targeting v8), and to load from (not store to) a single (not vector) variable.

Note: I cannot use intrinsic calls from the beginning (which would make things much easier), because I'm modeling new instructions in a simulator, and I need to write low-level assembly up to that part.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

碍人泪离人颜 2025-01-18 17:23:57

您可以使用 w 机器约束 将 SIMD 寄存器作为操作数传递给内联汇编语句。这会导致编译器为您选择一个 SIMD 寄存器。

float32x4_t add(float32x4_t a, float32x4_t b)
{
    float32x4_t c;

    asm ("fadd %0.4s, %1.4s, %2.4s" : "=w"(c) : "w"(a), "w"(b));

    return c;
}

请注意,编译器可以在内联汇编语句之间覆盖任意寄存器。这是不可能阻止的。

不过,您可以使用 本地寄存器变量。这并不能保证变量始终驻留在指示的寄存器中,但至少可以保证在每个内联 asm 语句之前或之后,在该语句中该变量被列为输入响应。输出操作数(详细信息请参阅手册)。

float32x4_t add(float32x4_t a_, float32x4_t b_)
{
    register float32x4_t c asm ("v12");
    register float32x4_t a asm ("v12") = a_;
    register float32x4_t b asm ("v4") = b_;

    asm ("fadd %0.4s, %1.4s, %2.4s" : "=w"(c) : "w"(a), "w"(b));

    return c;
}

理论上,也应该可以在汇编时使用算术来构建正确的操作码,但似乎没有办法让 gcc 打印它选择的寄存器号而不进行任何修饰(嘘!)。假设有这样一个模板修饰符 X,这样的代码可能如下所示:

float32x4_t add3(float32x4_t a, float32x4_t b)
{
    float32x4_t c;

    asm (".inst 0x4e20d40 + %X0 + (%X1<<5) + (%X2<<16)" : "=w"(c) : "w"(a), "w"(b));

    return c;
}

如果您需要此功能,则可能值得将对此类内容的支持修补到本地 gcc/clang 构建中。

You can use the w machine constraint to pass SIMD registers as operands to an inline assembly statement. This causes the compiler to pick a SIMD register for you.

float32x4_t add(float32x4_t a, float32x4_t b)
{
    float32x4_t c;

    asm ("fadd %0.4s, %1.4s, %2.4s" : "=w"(c) : "w"(a), "w"(b));

    return c;
}

Note that the compiler is permitted to overwrite arbitrary registers inbetween inline assembly statements. It is not possible to prevent this.

You can however tell the compiler which SIMD register to use for an operand using local register variables. This does not guarantee that the variable will reside in the indicated register at all times, but it is at least guaranteed right before or after each inline asm statement in which that variable is listed as an input resp. output operand (see the manual for details).

float32x4_t add(float32x4_t a_, float32x4_t b_)
{
    register float32x4_t c asm ("v12");
    register float32x4_t a asm ("v12") = a_;
    register float32x4_t b asm ("v4") = b_;

    asm ("fadd %0.4s, %1.4s, %2.4s" : "=w"(c) : "w"(a), "w"(b));

    return c;
}

Theoretically it should also be possible to use arithmetic to build the correct opcode at assembly time, but there does not seem to be a way to get gcc to print the register number it chose without any decoration (boo!). Suppose there was such a template modifier X, such code could look like this:

float32x4_t add3(float32x4_t a, float32x4_t b)
{
    float32x4_t c;

    asm (".inst 0x4e20d40 + %X0 + (%X1<<5) + (%X2<<16)" : "=w"(c) : "w"(a), "w"(b));

    return c;
}

It might be worth patching support for such a thing into your local gcc/clang build if you need this feature.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文