使用 SSE2 的 Delphi 中的内联汇编程序效率低下
我有一个简单的基于浮点的操作,它总是执行两次。所以我尝试将其翻译为 SSE 但失败了。高级语言是Delphi,因此由于它不支持内部函数,所以我必须编写整个内容。 基本上我只有参数加载/卸载以及一些乘法和加法。:
Procedure TLP1Poly2.Process(Const _a1, _b1, _OldIn1, _OldIn2, _OldOut1, _OldOut2: Double; Var Sample1, Sample2: Double);
Asm
MOVLPD XMM4, _a1
MOVHPD XMM4, _a1
MOVLPD XMM3, _b1
MOVHPD XMM3, _b1
//
MOVLPD XMM0, [Sample1]
MOVHPD XMM0, [Sample2]
MULPD XMM0, XMM4
//
MOVLPD XMM1, _OldIn1
MOVHPD XMM1, _OldIn2
MULPD XMM1, XMM4
//
MOVLPD XMM2, _OldOut1
MOVHPD XMM2, _OldOut2
MULPD XMM2, XMM3
//
ADDPD XMM0, XMM1
ADDPD XMM0, XMM2
//
MOVLPD [Sample1], XMM0
MOVHPD [Sample2], XMM0
//
// which stands for twice this:
// Sample:= Sample*a1 + oldinp*a1 + oldout*b1;
//
End;
但是这个过程不起作用,如果我“nop” Sample1/Sample2 加载/保存之间的所有内容,那就可以了,但否则我的过滤器将保持沉默。我在 SSE 中没有得到什么基本的东西?
附录:
旧类 类:
constructor TLP1.create;
begin
oldfreq := -1 ;
end;
procedure TLp1.process(inp,Frq,SR :single);
begin
if Frq<>oldfreq then
begin
a := 2* SR;
t := Frq * _ppi;
n := 1/ (a+t) ;
b1:= (a - t) * n;
a1:= t * n;
oldfreq := frq;
end;
outlp := (inp+_kd)*a1 + oldinp*a1 + oldout*b1;
oldout := outlp ;
oldinp := inp;
end;
新类:
Procedure TLP2Poly2.SetSamplerate(Const Value: Single);
Begin
If Value = FSamplerate Then Exit;
FSamplerate := Value;
UpdateCoefficients;
End;
Procedure TLP2Poly2.SetFrequency(Const Value: Single);
Begin
If Value = FFrequency Then Exit;
FFrequency := Value;
UpdateCoefficients;
End;
Procedure TLP2Poly2.UpdateCoefficients;
Var
a,t,n: Single;
Begin
a := 2 * FSamplerate ;
t := FFrequency * 2 * pi;
n := 1/ (a+t) ;
b1:= (a - t) * n;
a1:= t * n;
End;
Procedure TLP2Poly2.Process(Var Sample1, Sample2: Double);
Var
o1, o2: Double;
Begin
o1 := Sample1;
o2 := Sample2;
IntProcess( a1, b1, OldIn1, OldIn2, OldOut1, OldOut2, Sample1, Sample2);
OldOut1 := Sample1;
OldOut2 := Sample2;
OldIn1 := o1;
OldIn2 := o2;
End;
Procedure TLP2Poly2.IntProcess(Const _a1, _b1, _OldIn1, _OldIn2, _OldOut1, _OldOut2: Double; Var Sample1, Sample2: Double);
Asm
MOVLPD XMM4, _a1
MOVHPD XMM4, _a1
MOVLPD XMM3, _b1
MOVHPD XMM3, _b1
//
MOVLPD XMM0, [Sample1]
MOVHPD XMM0, [Sample2]
MULPD XMM0, XMM4
//
MOVLPD XMM1, _OldIn1
MOVHPD XMM1, _OldIn2
MULPD XMM1, XMM4
//
MOVLPD XMM2, _OldOut1
MOVHPD XMM2, _OldOut2
MULPD XMM2, XMM3
//
ADDPD XMM0, XMM1
ADDPD XMM0, XMM2
//
MOVLPD [Sample1], XMM0
MOVHPD [Sample2], XMM0
End;
I have a simple floating-point based operation that is always executed twice. So I've tried to translat it to SSE but it just fails. The high level language is Delphi, so as it doesn't support Intrinsics functions, I have to write the whole thing.
Basically I just have parameter load/unload and some multiplications and addditions.:
Procedure TLP1Poly2.Process(Const _a1, _b1, _OldIn1, _OldIn2, _OldOut1, _OldOut2: Double; Var Sample1, Sample2: Double);
Asm
MOVLPD XMM4, _a1
MOVHPD XMM4, _a1
MOVLPD XMM3, _b1
MOVHPD XMM3, _b1
//
MOVLPD XMM0, [Sample1]
MOVHPD XMM0, [Sample2]
MULPD XMM0, XMM4
//
MOVLPD XMM1, _OldIn1
MOVHPD XMM1, _OldIn2
MULPD XMM1, XMM4
//
MOVLPD XMM2, _OldOut1
MOVHPD XMM2, _OldOut2
MULPD XMM2, XMM3
//
ADDPD XMM0, XMM1
ADDPD XMM0, XMM2
//
MOVLPD [Sample1], XMM0
MOVHPD [Sample2], XMM0
//
// which stands for twice this:
// Sample:= Sample*a1 + oldinp*a1 + oldout*b1;
//
End;
but this procedure doesn't work, If I 'nop' everything between Sample1/Sample2 loading/saving it's ok but otherwise my filter is silent. What is the basic thing I don't get with SSE in this ?
Addenum:
old class class:
constructor TLP1.create;
begin
oldfreq := -1 ;
end;
procedure TLp1.process(inp,Frq,SR :single);
begin
if Frq<>oldfreq then
begin
a := 2* SR;
t := Frq * _ppi;
n := 1/ (a+t) ;
b1:= (a - t) * n;
a1:= t * n;
oldfreq := frq;
end;
outlp := (inp+_kd)*a1 + oldinp*a1 + oldout*b1;
oldout := outlp ;
oldinp := inp;
end;
New class:
Procedure TLP2Poly2.SetSamplerate(Const Value: Single);
Begin
If Value = FSamplerate Then Exit;
FSamplerate := Value;
UpdateCoefficients;
End;
Procedure TLP2Poly2.SetFrequency(Const Value: Single);
Begin
If Value = FFrequency Then Exit;
FFrequency := Value;
UpdateCoefficients;
End;
Procedure TLP2Poly2.UpdateCoefficients;
Var
a,t,n: Single;
Begin
a := 2 * FSamplerate ;
t := FFrequency * 2 * pi;
n := 1/ (a+t) ;
b1:= (a - t) * n;
a1:= t * n;
End;
Procedure TLP2Poly2.Process(Var Sample1, Sample2: Double);
Var
o1, o2: Double;
Begin
o1 := Sample1;
o2 := Sample2;
IntProcess( a1, b1, OldIn1, OldIn2, OldOut1, OldOut2, Sample1, Sample2);
OldOut1 := Sample1;
OldOut2 := Sample2;
OldIn1 := o1;
OldIn2 := o2;
End;
Procedure TLP2Poly2.IntProcess(Const _a1, _b1, _OldIn1, _OldIn2, _OldOut1, _OldOut2: Double; Var Sample1, Sample2: Double);
Asm
MOVLPD XMM4, _a1
MOVHPD XMM4, _a1
MOVLPD XMM3, _b1
MOVHPD XMM3, _b1
//
MOVLPD XMM0, [Sample1]
MOVHPD XMM0, [Sample2]
MULPD XMM0, XMM4
//
MOVLPD XMM1, _OldIn1
MOVHPD XMM1, _OldIn2
MULPD XMM1, XMM4
//
MOVLPD XMM2, _OldOut1
MOVHPD XMM2, _OldOut2
MULPD XMM2, XMM3
//
ADDPD XMM0, XMM1
ADDPD XMM0, XMM2
//
MOVLPD [Sample1], XMM0
MOVHPD [Sample2], XMM0
End;
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
当为 Delphi 编写汇编程序时,尤其是在 64 位模式下,您应该始终了解参数是如何传递的。我从不使用前 4 个参数的名称,因为它们无论如何都在寄存器中。我直接使用这些寄存器。
请注意,
_a1
、_b1
、_oldIn1
和_oldIn2
在 XMM0 中传递 - < em>XMM3 分别,因此代码的第一部分会覆盖其中一些寄存器。例如,使用_b1
加载 XMM3 将覆盖_oldIn2
。 XMM2 也会发生同样的情况,其中包含_oldIn1
。重新安排寄存器的使用是有意义的,这样您就不必使用内存存储作为中间媒介。
IOW,尝试类似的东西(未经测试):
When writing assembler for Delphi, especially in 64 bit mode, you should always be aware of how parameters are passed. I never use the names of the first 4 parameters, as these are in registers anyway. I use these registers directly.
Note that
_a1
,_b1
,_oldIn1
and_oldIn2
are passed in XMM0 - XMM3 respectively, so the first part of your code overwrites some of these registers. For instance, loading XMM3 with_b1
would overwrite_oldIn2
. The same happens with XMM2, which holds_oldIn1
.It would make sense to rearrange your register usage so you don't have to use memory storage as an inbetween.
IOW, try something like (untested):
在 Delphi 中,有一个调试器窗格(“FPU”),它显示 SSE 寄存器。因此,如果您向过滤器提供一些非零值,您应该能够找到静默输出的来源。
In Delphi there's a debugger pane ("FPU") which shows the SSE registers. So if you feed your filter some non-zero values you should be able to find where the silent output comes from.