Legend:
First a short comment about the subroutine we are going to ASM-optimize.
Second we go through the Steps.
About the Subroutine:
Sometimes we use
REDIM PRESERVE
to increase an Array. instead of Redim'ing every time its more efficient to increase in bigger steps.
And keep a separate Counter on "dimmed Elements" and "last used Element".
For this we can use an universal Subroutine, which is excellent to demonstrate some optimization issues.
Here is it:
FUNCTION A_AY(BYVAL a AS LONG,BYVAL b AS LONG) AS DWORD
LOCAL e AS DWORD
e=((INT(a/b)+1)*b)
!NOP
FUNCTION=e
END FUNCTION
Looks quite short and efficient. Anyway we can not really be shure how efficient the Subroutine is unless we have taken a Look into
DisASM. Here is it:
40241C DB450C FILD LONG PTR [EBP+0C]
40241F D9E8 FLD1
402421 DB450C FILD LONG PTR [EBP+0C]
402424 DA7D08 FIDIVR LONG PTR [EBP+08]
402427 E808110000 CALL INTSUB
40242C DEC1 FADDP ST(1), ST
40242E DEC9 FMULP ST(1), ST
402430 DF7DA4 FISTP QUAD PTR [EBP-5C]
402433 8B75A4 MOV ESI, DWORD PTR [EBP-5C]
402436 90 NOP
402437 89B578FFFFFF MOV DWORD PTR [EBP+FFFFFF78], ESI
INTSUB:
FLDCW WORD PTR [0040287E]
FRNDINT
FLDCW WORD PTR [00402876]
RET NEAR
What we see is that the compiler uses a lot of floating-point Mnemonics.
I see two reasons:
First reason is maybe that he doesn't know what we exactly want and to be shure to prevent an overflow, thats what he is doing.
Second reason is that this way there would be no problem to do the same using QUAD variables.
Now whats the next step in optimization?
STEP 1: Replace DWORD's where possible with LONG's
In this case it does not help us anything. Thats why we go to
STEP 2: Make it easier for the compiler to understand what you want.
Make it "easier" means "make it les complex".
FUNCTION A_AY(BYVAL a AS LONG,BYVAL b AS LONG) AS DWORD
REGISTER e AS LONG,f AS LONG
f=INT(a/b)
e=(f+1)*b
FUNCTION=e
!NOP
END FUNCTION
'
' Here is what we get:
'
40241E DB450C FILD LONG PTR [EBP+0C]
402421 DA7D08 FIDIVR LONG PTR [EBP+08]
402424 E81B110000 CALL L403544
402429 DF7DA4 FISTP QUAD PTR [EBP-5C]
40242C 8B7DA4 MOV EDI, DWORD PTR [EBP-5C]
40242F 8B450C MOV EAX, DWORD PTR [EBP+0C]
402432 B901000000 MOV ECX, DWORD 00000001
402437 03CF ADD ECX, EDI
402439 91 XCHG EAX, ECX
40243A F7E9 IMUL ECX
40243C 89C6 MOV ESI, EAX
40243E 89B578FFFFFF MOV DWORD PTR [EBP+FFFFFF78], ESI
It doesn't look shorter, but if you take a close look, we have just
saved 4 Floating-Point Operations.
Instead we use fast Integer Operations.
We have on our List:
- The INT() is still Floating-point.
- The (a/b) is still a FIDIVR (Floating-point Division).
What we do here is, we calculate a division to higher precision just to make a INT later.
Lets try something else.
First we need to define this MACRO:
MACRO A_DIV(P1,P2,P3)
! MOV EAX,P2
! CDQ
! DIV P3
! MOV P1,EAX
END MACRO
'
' Then we can use this:
'
FUNCTION A_AY(BYVAL a AS LONG,BYVAL b AS LONG) AS DWORD
REGISTER e AS LONG,f AS LONG
A_DIV(f,a,b)
e=(f+1)*b
FUNCTION=e
END FUNCTION
'
' And we get this:
'
40241E 8B4508 MOV EAX, DWORD PTR [EBP+08]
402421 99 CDQ EDX:EAX
402422 F7750C DIV DWORD PTR [EBP+0C]
402425 8BF8 MOV EDI, EAX
402427 8B450C MOV EAX, DWORD PTR [EBP+0C]
40242A B901000000 MOV ECX, DWORD 00000001
40242F 03CF ADD ECX, EDI
402431 91 XCHG EAX, ECX
402432 F7E9 IMUL ECX
402434 89C6 MOV ESI, EAX
402436 89B578FFFFFF MOV DWORD PTR [EBP+FFFFFF78], ESI
No more F's in the code. Looks Ok now. What comes now is just a bit of cosmetics.
Here are some more optimization Macros. Use them with caution, as they do not have the overflow protection like the original PB-Code.
' P1,P2,P3 sind Variablen oder Registernamen
MACRO A_MUL(P1,P2,P3)
! MOV EAX,P2
! MUL P3
! MOV P1,EAX
END MACRO
'
' Intel Developers manuals say that ADD is faster then INC
' because it does not need to change some CPU flags.
'
MACRO A_INC(P1)
! ADD P1,1
END MACRO
'
'
FUNCTION A_AY(BYVAL a AS LONG,BYVAL b AS LONG) AS DWORD
REGISTER e AS DWORD,f AS DWORD
A_DIV(f,a,b)
A_INC(f)
A_MUL(e,f,b)
FUNCTION=e
END FUNCTION
'
' And here is what we get:
'
4023CF 8B4508 MOV EAX, DWORD PTR [EBP+08]
4023D2 99 CDQ EDX:EAX
4023D3 F7750C DIV DWORD PTR [EBP+0C]
4023D6 8BF8 MOV EDI, EAX
4023D8 83C701 ADD EDI, BYTE 01
4023DB 8BC7 MOV EAX, EDI
4023DD F7650C MUL DWORD PTR [EBP+0C]
4023E0 8BF0 MOV ESI, EAX
4023E2 89B578FFFFFF MOV DWORD PTR [EBP+FFFFFF78], ESI
Which is still good enough for any program. Just to show the final result, we go one step ahead:
FUNCTION A_AY(BYVAL a AS LONG,BYVAL b AS LONG) AS DWORD
REGISTER e AS DWORD,f AS DWORD
A_DIV(f,a,b)
A_INC(f)
!MOV EAX,f
!MUL b
!MOV function,eax
END FUNCTION
'
' Will result in:
'
4023CF 8B4508 MOV EAX, DWORD PTR [EBP+08]
4023D2 99 CDQ EDX:EAX
4023D3 F7750C DIV DWORD PTR [EBP+0C]
4023D6 8BF8 MOV EDI, EAX
4023D8 83C701 ADD EDI, BYTE 01
4023DB 8BC7 MOV EAX, EDI
4023DD F7650C MUL DWORD PTR [EBP+0C]
4023E0 898578FFFFFF MOV DWORD PTR [EBP+FFFFFF78], EAX
Which is the final "Step-by-Step" optimized version.
Lets finally make a raw assumption on how much time we saved with this optimization:
-9- FILD LONG PTR [EBP+0C]
-1- FLD1
-9- FILD LONG PTR [EBP+0C]
-22- FIDIVR LONG PTR [EBP+08]
-46- CALL L403534
-4- FADDP ST(1), ST
-4- FMULP ST(1), ST
-7- FISTP QUAD PTR [EBP-5C]
-1- MOV ESI, DWORD PTR [EBP-5C]
-1- MOV DWORD PTR [EBP+FFFFFF78], ESI
'
' compared with:
'
-1- MOV EAX, DWORD PTR [EBP+08]
-1- CDQ EDX:EAX
-41- DIV DWORD PTR [EBP+0C]
-1- MOV EDI, EAX
-1- ADD EDI, BYTE 01
-1- MOV EAX, EDI
-3- MUL DWORD PTR [EBP+0C]
-1- MOV DWORD PTR [EBP+FFFFFF78], EAX
taking a look on the final Table, we can close with an additional rule:
Avoid divisions where possible :-).
> FUNCTION A_AY(BYVAL a AS LONG,BYVAL b AS LONG) AS DWORD
> REGISTER e AS LONG,f AS LONG
> f=INT(a/b)
> e=(f+1)*b
> FUNCTION=e
> !NOP
> END FUNCTION
Instead of the above, you could do this:
FUNCTION A_AY(BYVAL a AS LONG,BYVAL b AS LONG) AS DWORD
FUNCTION = (a\b+1)*b
END FUNCTION
By using the hierarcy of operators and an integer divide, we can reduce the
process to a single line. Further, because it is so straightforward, we could
replace the function with a Macro and get rid of some more overhead.
@Donad:
It seems to be that easy. But you know - not everything with legs is a horse.
In this case we get a snail.
My Speed-Test shows a difference of ten (the ASM-version is ten times faster).
Experience says that the shortest code is often the fastest. But there are exceptions.
Finally a look into DisASM helps you to get them. This is one.
'ASM-Version - Number 1:
4023D0 8B4508 MOV EAX, DWORD PTR [EBP+08]
4023D3 99 CDQ EDX:EAX
4023D4 F7750C DIV DWORD PTR [EBP+0C]
4023D7 8BF8 MOV EDI, EAX
4023D9 83C701 ADD EDI, BYTE 01
4023DC 8BC7 MOV EAX, EDI
4023DE F7650C MUL DWORD PTR [EBP+0C]
4023E1 898578FFFFFF MOV DWORD PTR [EBP+FFFFFF78], EAX
4023E7 8B8578FFFFFF MOV EAX, DWORD PTR [EBP+FFFFFF78]
4023ED 8D65F4 LEA ESP, DWORD PTR [EBP-0C]
' Short Basic-Code with "Integer divide" - Number 2:
402419 DB450C FILD LONG PTR [EBP+0C]
40241C DB450C FILD LONG PTR [EBP+0C]
40241F DB4508 FILD LONG PTR [EBP+08]
402422 E840230000 CALL L404767
402427 DE0520674000 FIADD INTEGER PTR [00406720]
40242D DEC9 FMULP ST(1), ST
40242F E8F4100000 CALL L403528
402434 898578FFFFFF MOV DWORD PTR [EBP+FFFFFF78], EAX
40243A 8B8578FFFFFF MOV EAX, DWORD PTR [EBP+FFFFFF78]
402440 8D65F4 LEA ESP, DWORD PTR [EBP-0C]
404767 D9FC FRNDINT
404769 D9C9 FXCHST, ST(1)
40476B D9FC FRNDINT
40476D DEF9 FDIVP ST(1), ST ' Here is our "Integer Divide" but you see that "F" infront of it :-))
40476F D92DAC2B4000 FLDCW WORD PTR [00402BAC]
404775 D9FC FRNDINT
404777 D92DA62B4000 FLDCW WORD PTR [00402BA6]
40477D C3 RET NEAR