Notes on unrolling assembler algorithms.

Steve Hutchesson · February 06, 2010, 02:27:49 PM

In the simple sense a programmer unrolls an algorithm to try and reduce the loop overhead of the algorithm for each iteration of the loop.

With simple code like this,

Code Select


  label:
    mov eax, [esi+ecx]
    sub ecx, 1
    jnz label

What the processor effectively sees is a scheduled sequence that looks much more like this.

Code Select


    mov eax, [esi+ecx]
    sub ecx, 1
    jnz label            ; jump taken
    mov eax, [esi+ecx]
    sub ecx, 1
    jnz label            ; jump taken
    mov eax, [esi+ecx]
    sub ecx, 1
    jnz label            ; jump taken
    mov eax, [esi+ecx]
    sub ecx, 1
    jnz label            ; jump taken
    mov eax, [esi+ecx]
    sub ecx, 1
    jnz label            ; jump taken
    mov eax, [esi+ecx]
    sub ecx, 1
    jnz label            ; jump taken
    etc .....

When you unroll an algorithm you do in the simplest sense something like this.

Code Select


  label:
    mov eax, [esi+ecx]
    sub ecx, 1
    jz next              ; invert the jump test

    mov eax, [esi+ecx]
    sub ecx, 1
    jnz label
  next:

What the processor now sees has less jumps that are taken and therefore less overhead for the same number of operations.

Code Select


    mov eax, [esi+ecx]
    sub ecx, 1
    jz next               ; jump NOT taken
    mov eax, [esi+ecx]
    sub ecx, 1
    jnz label             ; jump taken
    mov eax, [esi+ecx]
    sub ecx, 1
    jz next               ; jump NOT taken
    mov eax, [esi+ecx]
    sub ecx, 1
    jnz label             ; jump taken
    mov eax, [esi+ecx]
    sub ecx, 1
    jz next               ; jump NOT taken
    mov eax, [esi+ecx]
    sub ecx, 1
    jnz label             ; jump taken
    mov eax, [esi+ecx]
    sub ecx, 1
    jz next               ; jump NOT taken
    mov eax, [esi+ecx]
    sub ecx, 1
    jnz label             ; jump taken

The simple example above is an unroll by 2 but it is often profitable to unroll by larger amounts, 4, 8 and even by 16 is not uncommon if it yields measurable results. Intel recommend unrolling small code loops up to the limit of what the processor's trace cache will hold and as the byte size of this style of code in trivial, if it yields a speed improvement then its worth coding unrolled loops.

Now there is a factor that imposes an absolute limit of the execution speed of code and that is the absolute limit to how fast you can read and write memory. If an algorithm can achieve the maximum read and write speed of memory, nothing you do to it will make it faster no matter how many other sophisticated optimisation you do to the rest of the code.

This tends to be the major limiting factor in unrolling loops in code so even if their is a theoretical gain in unrolling a loop, if the memory reads or writes are at the saturation level of memory, it will never go any faster. What is going for you is that if you get an algorithm up close to the memory read / write limit, it is probably going a lot faster than it did in the first place.

Now there are many other ways to unroll loop code to try and reduce overhead. If you are performing block copy using the normal incremented pointer technique, code like the following can be made faster to a limit.

Code Select


  label:                  ; assuming ECX is a negative value.
    mov eax, [esi+ecx]    ; read value into EAX
    mov [edi+ecx], eax    ; write EAX back to destination address
    add ecx, 4            ; add 4 to ECX for the next read and write
    jnz label             ; jump back if ECX is still less than zero

You can reduce the loop overhead by putting more instructions in the loop.

Code Select


  label:
    mov eax, [esi+ecx]    ; read value into EAX
    mov ebx, [esi+ecx+4]  ; read next value into EBX
    mov [edi+ecx], eax    ; write EAX back to destination address
    mov [edi+ecx+4], ebx  ; write EAX back to destination address
    add ecx, 8
    jnz label

This involves copy 8 bytes at a time instead of 4 and its only really suitable for block copy operations but the latter cuts the loop overhead in half so it is often faster. The same limit will apply though, once you get to the speed limit of memory reads and writes it simply will not go faster.

The outline of the unrolling technique is the more operations you can put in a loop, the less overhead you get for each iteration, as long as you understand there are limits, the final size of the trace cache is one and the absolute limit of memory speed is the other.

Theo Gottwald · February 08, 2010, 09:27:57 AM

Unrolling. Most often the situation is not so easy.
Most often we call Subroutines from within the Loop.

Then I find myself with a Loop like this (Schematic in Basic):

FOR I=0 TO 100
GOSUB DoTheJob1
GOSUB DoTheJob2
NEXT

Means that the Loop is very simple, but inside we call a bigger Subprogramm.
Which may not even fit into the cache at all.
I guess in such situations Unrolling doesn't make sense.
What would you say, Steve?

I think in newer Intel CPU's the CPU may still - even in this case profit from its internal Loop-Caches.

Steve Hutchesson · February 08, 2010, 08:41:42 PM

Theo,

With nested looping in Basic, the type of things you avoid if the loop speed matters is making more function calls than you have to as basic has reasonably high call overhead. Anyone who suffered ol Microsoft basic and even the later VB dialect has a tendency to overcode to get around the error system built into them and this lead to more function calls and code than is needed which in turn slowed the code down even further.

Essentially instead of making many small function calls, if you can inline more of the code without it becoming too unwieldy to manage you reduce the loop overhead. In places where it matters check your sequence of basic functions to make sure you don't have redundant or unnecessary code as each extra function you call slows down your code.

Theo Gottwald · February 11, 2010, 07:37:29 PM

Quoteas basic has reasonably high call overhead.

Hutch not in Powerbasic and using GOSUB. It will just compile to a simple "CALL".
Question is, makes any unroling sense if i have to call some subroutines?

Steve Hutchesson · February 12, 2010, 12:27:44 AM

Theo,

The "call" or "gosub" followed by a "ret" or "return" in most basic dialects is in fact very efficient in terms of overhead as long as you don't mind about inlining the code in the GOSUB within the SUB or FUNCTION. Now the other factor with addressing call overhead is if the called code is small enough to worry about the overhead then its probably small enough to directly inline the code instead of calling it so you have no call overhead at all.

Now with unrolling an algorithm you have to weight any loop overhead you are trying to reduce against whatever the trace cache size in the particular processor may happen to be and if the looped code is larger than the trace cache it gets slower per iteration.

News:

Notes on unrolling assembler algorithms.

Steve Hutchesson

Theo Gottwald

Steve Hutchesson

Theo Gottwald

Steve Hutchesson