Taking a closer look on the new PB 9 Compiler: #OPTIMIZE

Theo Gottwald · August 16, 2008, 09:47:32 AM

Now we have it, and its time to take a closer look on the output of the compiler.

This is our Testing Proggy No.1

SUB TestfuncA()
REGISTER R01 AS LONG,R02 AS LONG
! NOP
! NOP
! NOP
FOR R01=0 TO 100000
    GOSUB Laba
NEXT R01
EXIT SUB
Laba:
GOSUB Labb
RETURN
Labb:
RETURN
END SUB

Now lets first take a look, what we got with PB 8.04:

Executable Size: 21504 Bytes

Code Select


4023D0   NOP
4023D1   NOP
4023D2   NOP
4023D3   MOV ESI, DWORD 00000000
4023D9   CALL L4023ED
4023DE   INC ESI
4023E0   CMP ESI, DWORD 000186A0
4023E6   JLE SHORT L4023D9
4023E8   JMP L4023F4
4023ED   CALL L4023F3
4023F2   RET NEAR
4023F3   RET NEAR

No surprise so far. Now lets take a look on the Output of PB 9 using
#OPTIMIZE SIZE

Code Select


#OPTIMIZE SIZE
24576 Bytes

4024DB   NOP
4024DC   NOP
4024DD   NOP
4024DE   MOV ESI, DWORD 00000000
4024E4   CALL L4024F8
4024E9   INC ESI
4024EB   CMP ESI, DWORD 000186A0
4024F1   JLE SHORT L4024E4
4024F3   JMP L4024FF
4024F8   CALL L4024FE
4024FD   RET NEAR
4024FE   RET NEAR

In fact we get exactly the same. Just notice the
Executable site is 24576 Bytes now.

Now we use PB 9 and the new
#OPTIMIZE SPEED
which is the default mode, means, its switched ON by default.

What we expect to see is, some more "NOPs" because the new compiler will try to Byte ALIGN Loops (etc.) to increase the speed of execution. Thats why PB 9 programms will often be faster then PB 8 programms,
just by recompiling with the new compiler.

#OPTIMIZE SPEED and Default Mode without #OPTIMIZE
24576 Bytes

Code Select


4024EB   NOP
4024EC   NOP
4024ED   NOP ; these are our three NOP's
4024EE   MOV ESI, DWORD 00000000 
4024F4   NOP ; These are the NOP's from the #OPTIMIZE SPEED
4024F5   NOP
4024F6   NOP
4024F7   NOP
4024F8   NOP
4024F9   NOP
4024FA   NOP
4024FB   NOP
4024FC   NOP
4024FD   NOP
4024FE   NOP
4024FF   NOP
402500   CALL L402514
402505   INC ESI
402507   CMP ESI, DWORD 000186A0
40250D   JLE SHORT L402500
40250F   JMP L40251B
402514   CALL L40251A
402519   RET NEAR
40251A   RET NEAR

We find the compiler to insert up to 15 NOP's to ALIGN our Loop perfectly on a 16 Byte boundary for best execution speed. What I've noticed here is, that the total file size did not change in this case.

Where are the times?

Normally in such postings you would like to see times and examples.
In this case I have left them away.

Because I found, that in constructed small loops like we use them for testing,
the #OPTIMIZE has even an disadvantage over the unaligned Loop.

The result ofthe automatic ALIGNMENT, depend heavily on:
a ) The overall CPU architecture
b) the cache Size in the CPU
c) The size of the Loop

In case of very small loops I had the effect, that the code even may run slower using the alignment inside the Loop.

Code Select

SUB TestfuncA()
'REGISTER R01 AS dword,R02 AS dword
'#register NONE
REGISTER R01 AS LONG,R02 AS LONG
LOCAL D01,D02 AS DOUBLE
! NOP
'#ALIGN 32
FOR R01=0 TO 1000
   FOR R02=0 TO 1000
    GOSUB Laba
! NOP
! NOP
! NOP
   NEXT R02
NEXT R01
EXIT SUB
Laba:
GOSUB Labb
RETURN
Labb:
D01=SIN(D02)
! NOP
! NOP
! NOP
RETURN
END SUB

You can test it on your CPU with this constructed example. This does however not play a role in normal programms, as there are mostly much bigger Loops, and we do not have such cache effects.

Or let me say it like this:
If you make highly optimized programms with very small loops, you may get an Speed advantage, if you do the ALIGNMENT manually using #ALIGN and switching off the automatic #OPTIMIZE by using #OPTIMIZE SIZE.

If you have a normal complicated program, which is not handoptimized, you can just forget about it and the compiler will do the best Alignment for you.

News:

Taking a closer look on the new PB 9 Compiler: #OPTIMIZE

Theo Gottwald