gcc 内联汇编中与 PC 相关的跳转
我有一个 asm 循环,保证不会超过 128 次迭代,我想用 PC 相对跳转来展开它。这个想法是以相反的顺序展开每个迭代,然后跳到需要的循环中。代码看起来像这样:
#define __mul(i) \
"movq -"#i"(%3,%5,8),%%rax;" \
"mulq "#i"(%4,%6,8);" \
"addq %%rax,%0;" \
"adcq %%rdx,%1;" \
"adcq $0,%2;"
asm("jmp (128-count)*size_of_one_iteration" // I need to figure this jump out
__mul(127)
__mul(126)
__mul(125)
...
__mul(1)
__mul(0)
: "+r"(lo),"+r"(hi),"+r"(overflow)
: "r"(a.data),"r"(b.data),"r"(i-k),"r"(k)
: "%rax","%rdx");
Is like this possible with gcc inline assembly?
I have an asm loop guaranteed not to go over 128 iterations that I want to unroll with a PC-relative jump. The idea is to unroll each iteration in reverse order and then jump however far into the loop it needs to be. The code would look like this:
#define __mul(i) \
"movq -"#i"(%3,%5,8),%%rax;" \
"mulq "#i"(%4,%6,8);" \
"addq %%rax,%0;" \
"adcq %%rdx,%1;" \
"adcq $0,%2;"
asm("jmp (128-count)*size_of_one_iteration" // I need to figure this jump out
__mul(127)
__mul(126)
__mul(125)
...
__mul(1)
__mul(0)
: "+r"(lo),"+r"(hi),"+r"(overflow)
: "r"(a.data),"r"(b.data),"r"(i-k),"r"(k)
: "%rax","%rdx");
Is something like this possible with gcc inline assembly?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
在 gcc 内联汇编中,您可以使用标签并让汇编器为您整理跳转目标。就像(人为的例子):
那是一回事。为了避免乘法,您可以做的另一件事是让汇编器为您对齐块,例如,以 32 字节的倍数对齐(我认为指令序列不适合 16 字节),就像:
这将简单地用 nop 填充指令流。如果您选择不对齐这些块,您仍然可以在主表达式中使用生成的本地标签来查找汇编块的大小:
对于
count
是编译的情况时间常数,您甚至可以这样做:最后的小注释:
如果自动生成的
_mul()
东西可以创建不同长度的序列,则对齐块是一个好主意。对于您使用的常量0..127
,情况并非如此,因为它们都适合一个字节,但如果您将它们放大,它将变为 16 位或 32 位值和指令块将随之增长。通过填充指令流,仍然可以使用跳转表技术。In gcc inline assembly, you can use labels and have the assembler sort out the jump target for you. Something like (contrived example):
That's one thing. The other thing you could do to avoid multiplication is to make the assembler align the blocks for you, say, at a multiple of 32 bytes (I don't think the instruction sequence fits into 16 Bytes), like:
This will simply pad the instruction stream with
nop
. If yo do choose not to align these blocks, you can still, in your main expression, use the generated local labels to find the size of the assembly blocks:And for the case where
count
is a compile-time constant, you can even do:Little note on the end:
Aligning the blocks is a good idea if the autogenerated
_mul()
thing can create sequences of different lengths. For constants0..127
as you use, that won't be the case as they all fit into a byte, but if you'll scale them larger it would go to 16- or 32-bit values and the instruction block would grow alongside. By padding the instruction stream, the jumptable technique can still be used.这不是直接答案,但您是否考虑过使用变体
Duff 的设备而不是内联
集会?这将采用 switch 语句的形式:
This isn't a direct answer, but have you considered using a variant of
Duff's Device instead of inline
assembly? That would take the form of switch statement:
抱歉,我无法提供 ATT 语法的答案,希望您能轻松执行翻译。
如果您在 RCX 中有计数,并且可以在 __mul(0) 之后有一个标签,那么您可以这样做:
希望这会有所帮助。
编辑:
我昨天犯了一个错误。我假设引用 [rcx + the_label] 中的标签被解析为 [rcx + rip + disp] 但事实并非如此,因为没有这样的寻址模式(仅存在 [rip + disp32])
此代码应该可以工作,另外它将使 rcx 保持不变,并会销毁 rax 和 rdx (但您的代码似乎在先写入它们之前不会读取它们):
Sorry I can't provide the answer in ATT syntax, I hope you can easily perform the translations.
If you have the count in RCX and you can have a label just after __mul(0) then you could do this:
Hope this helps.
EDIT:
I made a mistake yesterday. I've assumed that referencing a label in [rcx + the_label] is resolved as [rcx + rip + disp] but it is not since there is no such addressing mode (only [rip + disp32] exists)
This code should work and additionally it will left rcx untouched and will destroy rax and rdx instead (but your code seems to not read them before writing to them first):