代码优化技巧:
我正在使用以下 ASM 例程对数组进行冒泡排序。我想知道我的代码效率低下:
.386
.model flat, c
option casemap:none
.code
public sample
sample PROC
;[ebp+0Ch]Length
;[ebp+08h]Array
push ebp
mov ebp, esp
push ecx
push edx
push esi
push eax
mov ecx,[ebp+0Ch]
mov esi,[ebp+08h]
_bubbleSort:
push ecx
push esi
cmp ecx,1
je _exitLoop
sub ecx,01h
_miniLoop:
push ecx
mov edx,DWORD PTR [esi+4]
cmp DWORD PTR [esi],edx
ja _swap
jmp _continueLoop
_swap:
lodsd
mov DWORD PTR [esi-4],edx
xchg DWORD PTR [esi],eax
jmp _skipIncrementESI
_continueLoop:
add esi,4
_skipIncrementESI:
pop ecx
loop _miniLoop
_exitLoop:
pop esi
pop ecx
loop _bubbleSort
pop eax
pop esi
pop edx
pop ecx
pop ebp
ret
sample ENDP
END
基本上我有两个循环,就像通常的冒泡排序算法一样。外循环的 ecx 值为 10,内循环的 ecx 值为 [ecx-1]。我已经尝试过该例程,它可以成功编译并运行,但我不确定它是否有效。
I am using the following ASM routine to bubble sort an array. I want to know of the inefficiencies of my code:
.386
.model flat, c
option casemap:none
.code
public sample
sample PROC
;[ebp+0Ch]Length
;[ebp+08h]Array
push ebp
mov ebp, esp
push ecx
push edx
push esi
push eax
mov ecx,[ebp+0Ch]
mov esi,[ebp+08h]
_bubbleSort:
push ecx
push esi
cmp ecx,1
je _exitLoop
sub ecx,01h
_miniLoop:
push ecx
mov edx,DWORD PTR [esi+4]
cmp DWORD PTR [esi],edx
ja _swap
jmp _continueLoop
_swap:
lodsd
mov DWORD PTR [esi-4],edx
xchg DWORD PTR [esi],eax
jmp _skipIncrementESI
_continueLoop:
add esi,4
_skipIncrementESI:
pop ecx
loop _miniLoop
_exitLoop:
pop esi
pop ecx
loop _bubbleSort
pop eax
pop esi
pop edx
pop ecx
pop ebp
ret
sample ENDP
END
Basically I have two loops, as usual for the bubble sort algorithm. The value of ecx for the outer loop is 10, and for the inner loop it is [ecx-1]. I have tried the routine and it compiles and runs successfully, but I am not sure if it is efficient.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您可以采取多种措施来加快汇编代码的速度:
不要执行诸如
ja label_1 之类的操作; jmp label_2
。只需执行jbe label_2
即可。loop
是一条非常慢的指令。 dec ebx; jnz Loopstart 速度更快使用所有寄存器而不是重复推送/弹出 ecx 和 esi。也使用 ebx 和 edi。
jmp-targets 应该很好地对齐。在两个循环开始之前和
jbe
从 Intel 获取一份有关您的 cpu 的手册(您可以将其下载为 pdf),其中包含以下时间操作码,也许它还有其他提示。
There are several things you can do to speed up your assembly code:
don't do things like
ja label_1 ; jmp label_2
. Just dojbe label_2
instead.loop
is a very slow instruction.dec ebx; jnz loopstart
is much fasteruse all registers instead of repeatedly push/pop ecx and esi. Use
ebx
andedi
too.jmp-targets should be well aligned. Use
align 4
before the two loop-starts and after thejbe
Get yourself a manual for your cpu from Intel (you can download it as pdf), it has the timings for the opcodes, maybe it has other hints too.
几个简单的技巧:
1)尽量减少条件跳转的次数,因为它们非常昂贵。如果可能的话展开。
2) 重新排序指令以最大程度地减少因数据依赖性而导致的停顿:
3) 避免旧的复合指令(
dec
、jnz
对比loop
更快,并且不绑定到ecx
寄存器)编写比优化 C 编译器生成的代码更快的汇编代码是相当困难的,因为您应该考虑很多因素:数据和指令缓存的大小、对齐方式、流水线、指令时序。您可以在此处找到一些有关此内容的优秀文档。我特别推荐第一本书:用 C++ 优化软件
Several simple tips:
1) Try to minimize the number of conditional jumps, because they are very expensive. Unroll if possible.
2) Reorder instructions to minimize stalls because of data depencency:
3) Avoid old composite instructions (
dec
,jnz
pair is faster thanloop
and is not bound toecx
register)It would be quite difficult to write assembly code that is faster than the code generated by optimizing C compiler, because you should consider lots of factors: size of data and instruction caches, alignments, pipeline, instruction timings. You can find some good documentation about this here. I especially recommend the first book: Optimizing software in C++
如果我们不需要该指令的标志,则替换为“add esi,4”:
Substitute for "add esi,4" if we do not need a flag for this instruction: