This shows how easily it translates.
FreeBasic
' POSTFIX or REVERSE-POLISH NOTATION AND x86 ASSEMBLER
' Charles E V Pegge
' 04 Aug 2007
' FREEBASIC
' FLOATING POINT EXAMPLE
dim as double ans, a=1, b=2, c=3, d=4
' ans=(a+b)*(c+d) ' conventional notation
'
' a b + c d + * to ans ' postfix notation
'======================'
asm
fld qword ptr [a] ' push A onto FPU stack
fadd qword ptr [b] ' add B
fld qword ptr [c] ' push C onto FPU stack
fadd qword ptr [d] ' add D
fmulp st(1) ' multiply the 2 results and pop the stack
fstp qword ptr [ans] ' store the result and pop the stack
end asm '
'======================'
print ans
Hi Charles,
nice samples. RPN is very nice.
Here my two tries:
First for PB/WIN 8.03:
#COMPILE EXE
#DIM ALL
FUNCTION PBMAIN () AS LONG
#REGISTER NONE
LOCAL a, b, c, d AS DOUBLE
LOCAL ans AS DOUBLE
a = 1
b = 2
c = 3
d = 4
' ans=(a+b)*(c+d) ' conventional notation
' a b + c d + * to ans ' postfix notation
'======================'
! fld a ; push A onto FPU stack
! fadd b ; add B
! fld c ; push C onto FPU stack
! fadd d ; add D
! fmulp st(1), st(0) ; multiply the 2 results and pop the stack
! fstp ans ; store the result and pop the stack
'======================'
MSGBOX STR$(ans)
END FUNCTION
Second for HP-42S ;D
RCL "A"
RCL+ "B"
RCL "B"
RCL+ "D"
*
STO "ANS"
Thanks,
Petr
Thanks Petr,
Here is the same thing using long integers on the CPU.
The first block shows a very literal translation of RPN into x86 assembler. This is not as futile as it looks because this strategy may be applied to expressions of almost unlimited complexity on just about any CPU that can add, multiply and has a stack. But when there are general purpose registers available then optimising the code can make a big difference as you can see in the second block.
FreeBasic
' LONG INTEGER EXAMPLE
dim as long ans, a=1, b=2, c=3, d=4
' ans=(a+b)*(c+d) ' conventional notation
'
' a b + c d + * to ans ' postfix notation
'
'========================='
' RIGID INTERPRETATION
' no regs used for storing
' only for doing operations
'========================='
asm '
push dword ptr [a] ' load A onto stack
push dword ptr [b] ' load B onto stack
'-------------------------'
pop ecx ' pop operand
pop eax ' pop accummulator
add eax,ecx ' do op
push eax ' leave result on stack
'-------------------------'
push dword ptr [c] ' load C onto stack
push dword ptr [d] ' load D onto stack
'-------------------------'
pop ecx ' pop operand
pop eax ' pop accummulator
add eax,ecx ' do op
push eax ' leave result on stack
'-------------------------'
pop ecx ' pop operand
pop eax ' pop accummulator
mul ecx ' do op
push eax ' leave result on stack
'-------------------------'
pop eax ' pop result
mov [ans],eax ' store it
end asm '
'========================='
print ans
' OPTIMISED INTERPRETATION
'========================'
asm '
mov ecx, [a] ' load A into ecx accummulator
add ecx, [b] ' add B to it
mov eax, [c] ' load C into into eax accummulator
add eax, [d] ' add D to it
mul ecx ' multiply the two accumulators (result in eax)
mov [ans],eax ' save result
end asm '
'========================'
print ans
Hi,
here comes again PB/WIN version
#COMPILE EXE
#DIM ALL
DECLARE FUNCTION GetTickCount LIB "KERNEL32.DLL" ALIAS "GetTickCount" () AS DWORD
FUNCTION PBMAIN () AS LONG
#REGISTER NONE
LOCAL t1, t2 AS DWORD
LOCAL i AS LONG
LOCAL a, b, c, d AS LONG
LOCAL ans AS LONG
a = 1
b = 2
c = 3
d = 4
' ans=(a+b)*(c+d) ' conventional notation
'
' a b + c d + * to ans ' postfix notation
'
'========================='
' RIGID INTERPRETATION
' no regs used for storing
' only for doing operations
'========================='
t1 = GetTickCount
FOR i = 1 TO 1000000000
! push a ' load A onto stack
! push b ' load B onto stack
'-------------------------'
! pop ecx ' pop operand
! pop eax ' pop accummulator
! add eax,ecx ' do op
! push eax ' leave result on stack
'-------------------------'
! push c ' load C onto stack
! push d ' load D onto stack
'-------------------------'
! pop ecx ' pop operand
! pop eax ' pop accummulator
! add eax,ecx ' do op
! push eax ' leave result on stack
'-------------------------'
! pop ecx ' pop operand
! pop eax ' pop accummulator
! mul ecx ' do op
! push eax ' leave result on stack
'-------------------------'
! pop eax ' pop result
! mov ans,eax ' store it
NEXT
t2 = GetTickCount
MSGBOX "ASM General:"+STR$((t2-t1)/1000)+"s, result:"+STR$(ans)
' OPTIMISED INTERPRETATION
'========================'
'
t1 = GetTickCount
FOR i = 1 TO 1000000000
! mov ecx, a ' load A into ecx accummulator
! add ecx, b ' add B to it
! mov eax, c ' load C into into eax accummulator
! add eax, d ' add D to it
! mul ecx ' multiply the two accumulators (result in eax)
! mov ans, eax ' save result
NEXT
t2 = GetTickCount
MSGBOX "ASM Optimized:"+STR$((t2-t1)/1000)+"s, result:"+STR$(ans)
PB without ASM "suggestions"
'========================'
t1 = GetTickCount
FOR i = 1 TO 1000000000
ans = (a+b)*(c+d)
NEXT
t2 = GetTickCount
MSGBOX "PB Classic:"+STR$((t2-t1)/1000)+"s, result:"+STR$(ans)
END FUNCTION
I presume I cannot use EBX as PB uses it already ?
I have added #REGISTER NONE to not get in troubles too :)
It is interesting to watch the speed difference.
General approach takes ~12.7 seconds for 1 000 000 000 iterations, while optimized just 3.5 seconds, which is quite the same as PB does by optimizing high-level expression.
Thanks,
Petr
Have a Hint on EBX for you. Not really from me, but directly from Bob Zale, copied from the PB Forum:
QuoteActually, the register preservation requirements for EBX/ESI/EDI only apply to assembler code which precedes BASIC statements within a sub/function. PowerBASIC assumes that it can use these 3 registers to store data from one statement to the next. If there are no more statements, it's of no consequence. PowerBASIC must preserve registers between sub/function calls, and does so automatically. Of course, ESP and EBP must always be preserved.
One note... never change a segment register. Win32 in unforgiving.
Source: PB-Forum (http://www.powerbasic.com/support/forums/Forum4/HTML/002398.html)
Thanks for the info!,
Petr
Thanks for taking those measurements Petr. It shows the cost of generalised methods without optimisation. Of course once you start using interpreter techniques, the time penalty is much more extreme.
I am not sure when PB requires the EBX and ESI registers to be left unaltered in between lines, or for referencing global and static variables, arrays etc.
Pushing them only takes 1 clock each as does popping them.
One potential GOTCHA, Theo, is in loading one of these variables into a reserved register which is normally used to index them, causing a GPF when you try to access the next variable.
There is a simple and cheap method to protect against this, costing 3 clocks per variable loaded instead of 1
dim a as global long, b as global long, c as global long
'....
! push a ' push contents of all the variables to be used
! push b '
! push c '
! pop edi ' now pop then into the target registers
! pop esi '
! pop ebx '
'.... '
This is way beyond me, but in looking through the code I can see sort of how it is going... I don't understand how it knows which operand to use? It is never pushed or put into memory as the variable values are? How does it know to add instead of subtract?
I am not sure when PB requires the EBX and ESI registers to be left unaltered in between lines,
I can help you on this Charles. Reading my own post from before, I can tell you this:
PowerBasic uses these registers:
EBX - internal usage
ESI/EDI - for REGISTER variables, if you do not use #REGISTER NONE
Therefore they need to be preserved and restored if you have BASIC CODE and ASM Code intermixed.
As we see in the Disassembly, the EBP-Register is used to adress local datastructures and variables.
Example:
MOV DWORD PTR [EBP+FFFFFF6C], ESI
Would load a local variable with a LONG or DWORD Value.
Thanks Theo, we have to be aware of shifting sands - different compilers.
Kent,
As Push'n Pop is Last-in First-out buffering, so the above snippet stacks the values of the variables a,b,c which we want to use in the registers . EDI ESI and EBX, may be used to index these variables, so we must not overwrite them while obtaining the values of a,b and c.
Once their values are on the stack, we can safely pop them into these registers.
The stack is indexed by means of the stack pointer ESP. which is automatically decremented before PUSHing and incremented after POPping. but this stack pointer register is seldom altered by us directly.
In this example, the values of a, b and c end up in ebx, esi and edi respectively.
Of course, the three registers we have just overwritten cannot be used
to access further BASIC variables unless we have taken steps to preserve them with another shell of pushes and pops around our assembler code.
However, it is safe to assume that LOCAL variables, including function parameters are indexed only by the EBP register, which is set up for us by PB at the beginning of the function, and it is not normally used for anything else unless you are desperate for a spare reg to make your algorithm zing.
Thanks for the explanation Charles and Theo. Amazing how you guys keep all of which compiler does what etc. The ASM files look different from compiler to compiler too.
It's not usually pointed out, but RPN notation works well with ASM because it reduces complex instructions to their simplest form, wirking first with values (or variables), then performing the specified operation. If you took a typical ASM operation, like ADD eax,a, and wrote it this way: eax,a ADD, it would easily show the similarity in structure.
More specifically, the use of RPN favors a stack structure, but if we can limit or
confine our efforts to the general purpose registers, we will benefit because the registers are more quickly accessed. Aside from the fact that some instructions are specific to certain registers, the use of the stack involves RAM memory, which has to be referenced via the stack pointer, which in term has to incremented or decremented, and this makes it quite a bit slower overall.
Another consideration with RPN is that it favors the retention and reuse of
intermediate results. You can store them somewhere temporarily, then reintroduce them when needed, which can save computational time and effort.
One example of this capability is when multiplying a number by ten by using the
shift-and-add capabilities instead. Let me describe the general principle and you can then play with it.
Begin with some integer and call it x. Copy the original x into EAX. Then copy it from EAX into a second register. Then shift EAX left two places, which
effectively multiplies your x value by four. Add the value of x from the temporary register back into EAX, so that the total is now equivalent to five times x. Then shift EAX left one more place, and your 5x will become 10x. This
will work for any integer value of x, as long as the value when multipled does not result in a register overflow. You can rearrange the shifts and adds slightly and get the same result. For instance, start with a left shift of 1 bit, save the
temporary result, shift left two more places to the left, and add in the temporary result again. There you are.
Unfortunately, there is no clever method available for dividing by ten, which
would be extremely useful for convirting base 2 numbers to base 10 (decimal).
division
Yes the only way to divide a number (unless its a simple power of 2) is by
successive approximation/ This makes both CPU and FPU work very hard, taking more than 40 clocks, about 4 times longer than multiplying.
So whenever possible replace dividing factors with multiplying factors eg:
a*001 instead of a/1000.
multiplying EAX by ten
! shl eax,1
! mov ecx,eax
! shl eax,2
! add eax,ecx
This takes 4 clocks on the CPU instead of 10 using MUL, however floating point multiplications on the FPU are about 3 clocks!
RPN notation
A SHL DUP SHL SHL +
pretending that the EAX and ECX registers are part of a stack.
PS: the new Intel chips do things on half clocks so take my figures as being relative.
While the FPU is highly optimized, there is still a time penalty for conversion of numbers to floating point form and back. Since the PB compilers are optimized for
use with Longs (and will generally revert to floating point form for other numeric
types), your best bet for fastest code is to attempt to use Longs wherever possible.