gcc 内联汇编中与 PC 相关的跳转

发布于 2024-10-16 07:15:18 字数 582 浏览 6 评论 0原文

我有一个 asm 循环，保证不会超过 128 次迭代，我想用 PC 相对跳转来展开它。这个想法是以相反的顺序展开每个迭代，然后跳到需要的循环中。代码看起来像这样：

#define __mul(i) \
    "movq -"#i"(%3,%5,8),%%rax;" \
    "mulq "#i"(%4,%6,8);" \
    "addq %%rax,%0;" \
    "adcq %%rdx,%1;" \
    "adcq $0,%2;"

asm("jmp (128-count)*size_of_one_iteration" // I need to figure this jump out
    __mul(127)
    __mul(126)
    __mul(125)
    ...
    __mul(1)
    __mul(0)
    : "+r"(lo),"+r"(hi),"+r"(overflow)
    : "r"(a.data),"r"(b.data),"r"(i-k),"r"(k)
    : "%rax","%rdx");

Is like this possible with gcc inline assembly?

原文

I have an asm loop guaranteed not to go over 128 iterations that I want to unroll with a PC-relative jump. The idea is to unroll each iteration in reverse order and then jump however far into the loop it needs to be. The code would look like this:

#define __mul(i) \
    "movq -"#i"(%3,%5,8),%%rax;" \
    "mulq "#i"(%4,%6,8);" \
    "addq %%rax,%0;" \
    "adcq %%rdx,%1;" \
    "adcq $0,%2;"

asm("jmp (128-count)*size_of_one_iteration" // I need to figure this jump out
    __mul(127)
    __mul(126)
    __mul(125)
    ...
    __mul(1)
    __mul(0)
    : "+r"(lo),"+r"(hi),"+r"(overflow)
    : "r"(a.data),"r"(b.data),"r"(i-k),"r"(k)
    : "%rax","%rdx");

Is something like this possible with gcc inline assembly?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

吹泡泡o 2024-10-23 07:15:18

在 gcc 内联汇编中，您可以使用标签并让汇编器为您整理跳转目标。就像（人为的例子）：

int max(int a, int b)
{
    int result;
    __asm__ __volatile__(
        "movl %1, %0\n"
        "cmpl %2, %0\n"
        "jeq  a_is_larger\n"
        "movl %2, %0\n"
        "a_is_larger:\n" : "=r"(result), "r"(a), "r"(b));
    return (result);
}

那是一回事。为了避免乘法，您可以做的另一件事是让汇编器为您对齐块，例如，以 32 字节的倍数对齐（我认为指令序列不适合 16 字节），就像：

#define mul(i)                     \
    ".align 32\n"                  \
    ".Lmul" #i ":\n"               \
    "movq -" #i "(%3,%5,8),%%rax\n"\
    "mulq " #i "(%4,%6,8)\n"       \
    "addq %%rax,%0\n"              \
    "adcq %%rdx,%1\n"              \
    "adcq $0,%2\n"

这将简单地用 nop 填充指令流。如果您选择不对齐这些块，您仍然可以在主表达式中使用生成的本地标签来查找汇编块的大小：

#ifdef UNALIGNED
__asm__ ("imul $(.Lmul0-.Lmul1), %[label]\n"
#else
__asm__ ("shlq $5, %[label]\n"
#endif
    "leaq .Lmulblkstart, %[dummy]\n"        /* this is PC-relative in 64bit */
    "jmp (%[dummy], %[label])\n"
    ".align 32\n"
    ".Lmulblkstart:\n"
    __mul(127)
    ...
    __mul(0)
    : ... [dummy]"=r"(dummy) : [label]"r"((128-count)))

对于 count 是编译的情况时间常数，您甚至可以这样做：

__asm__("jmp .Lmul" #count "\n" ...);

最后的小注释：

如果自动生成的 _mul() 东西可以创建不同长度的序列，则对齐块是一个好主意。对于您使用的常量 0..127 ，情况并非如此，因为它们都适合一个字节，但如果您将它们放大，它将变为 16 位或 32 位值和指令块将随之增长。通过填充指令流，仍然可以使用跳转表技术。

In gcc inline assembly, you can use labels and have the assembler sort out the jump target for you. Something like (contrived example):

int max(int a, int b)
{
    int result;
    __asm__ __volatile__(
        "movl %1, %0\n"
        "cmpl %2, %0\n"
        "jeq  a_is_larger\n"
        "movl %2, %0\n"
        "a_is_larger:\n" : "=r"(result), "r"(a), "r"(b));
    return (result);
}

That's one thing. The other thing you could do to avoid multiplication is to make the assembler align the blocks for you, say, at a multiple of 32 bytes (I don't think the instruction sequence fits into 16 Bytes), like:

#define mul(i)                     \
    ".align 32\n"                  \
    ".Lmul" #i ":\n"               \
    "movq -" #i "(%3,%5,8),%%rax\n"\
    "mulq " #i "(%4,%6,8)\n"       \
    "addq %%rax,%0\n"              \
    "adcq %%rdx,%1\n"              \
    "adcq $0,%2\n"

This will simply pad the instruction stream with nop. If yo do choose not to align these blocks, you can still, in your main expression, use the generated local labels to find the size of the assembly blocks:

#ifdef UNALIGNED
__asm__ ("imul $(.Lmul0-.Lmul1), %[label]\n"
#else
__asm__ ("shlq $5, %[label]\n"
#endif
    "leaq .Lmulblkstart, %[dummy]\n"        /* this is PC-relative in 64bit */
    "jmp (%[dummy], %[label])\n"
    ".align 32\n"
    ".Lmulblkstart:\n"
    __mul(127)
    ...
    __mul(0)
    : ... [dummy]"=r"(dummy) : [label]"r"((128-count)))

And for the case where count is a compile-time constant, you can even do:

__asm__("jmp .Lmul" #count "\n" ...);

Little note on the end:

Aligning the blocks is a good idea if the autogenerated _mul() thing can create sequences of different lengths. For constants 0..127 as you use, that won't be the case as they all fit into a byte, but if you'll scale them larger it would go to 16- or 32-bit values and the instruction block would grow alongside. By padding the instruction stream, the jumptable technique can still be used.

回复收藏 0 原文

苏辞 2024-10-23 07:15:18

这不是直接答案，但您是否考虑过使用变体
Duff 的设备而不是内联
集会？这将采用 switch 语句的形式：

switch(iterations) {
  case 128: /* code for i=128 here */
  case 127: /* code for i=127 here */
  case 126: /* code for i=126 here */
  /* ... */
  case 1:   /* code for i=1 here*/
  break;
  default: die("too many cases");
}

This isn't a direct answer, but have you considered using a variant of
Duff's Device instead of inline
assembly? That would take the form of switch statement:

switch(iterations) {
  case 128: /* code for i=128 here */
  case 127: /* code for i=127 here */
  case 126: /* code for i=126 here */
  /* ... */
  case 1:   /* code for i=1 here*/
  break;
  default: die("too many cases");
}

回复收藏 0 原文

孤云独去闲 2024-10-23 07:15:18

抱歉，我无法提供 ATT 语法的答案，希望您能轻松执行翻译。

如果您在 RCX 中有计数，并且可以在 __mul(0) 之后有一个标签，那么您可以这样做：

; rcx must be in [0..128] range.
    imul ecx, ecx, -size_of_one_iteration ; Notice the multiplier is negative (using ecx is faster, the upper half of RCX will be automatically cleared by CPU)
    lea  rcx, [rcx + the_label] ; There is no memory read here
    jmp  rcx

希望这会有所帮助。

编辑：
我昨天犯了一个错误。我假设引用 [rcx + the_label] 中的标签被解析为 [rcx + rip + disp] 但事实并非如此，因为没有这样的寻址模式（仅存在 [rip + disp32]）

此代码应该可以工作，另外它将使 rcx 保持不变，并会销毁 rax 和 rdx （但您的代码似乎在先写入它们之前不会读取它们）：

; rcx must be in [0..128] range.
    imul edx, ecx, -size_of_one_iteration ; Notice the multiplier is negative (using ecx is faster, the upper half of RCX will be automatically cleared by CPU)
    lea  rax, [the_label] ; PC-relative addressing (There is no memory read here)
    add  rax, rdx
    jmp  rax

Sorry I can't provide the answer in ATT syntax, I hope you can easily perform the translations.

If you have the count in RCX and you can have a label just after __mul(0) then you could do this:

; rcx must be in [0..128] range.
    imul ecx, ecx, -size_of_one_iteration ; Notice the multiplier is negative (using ecx is faster, the upper half of RCX will be automatically cleared by CPU)
    lea  rcx, [rcx + the_label] ; There is no memory read here
    jmp  rcx

Hope this helps.

EDIT:
I made a mistake yesterday. I've assumed that referencing a label in [rcx + the_label] is resolved as [rcx + rip + disp] but it is not since there is no such addressing mode (only [rip + disp32] exists)

This code should work and additionally it will left rcx untouched and will destroy rax and rdx instead (but your code seems to not read them before writing to them first):

; rcx must be in [0..128] range.
    imul edx, ecx, -size_of_one_iteration ; Notice the multiplier is negative (using ecx is faster, the upper half of RCX will be automatically cleared by CPU)
    lea  rax, [the_label] ; PC-relative addressing (There is no memory read here)
    add  rax, rdx
    jmp  rax

回复收藏 0 原文

~没有更多了~