This function is used to generate a unique code from a word. Part of the hash can then be used to obtain the base indexes of linked-lists for rapid retrieval of data. I use this algorithm in Asmosphere to lookup the opcodes for the instruction set keywords.
The base index is formed from the lower 9 bits of the hash code, giving 512 base locations for a linked list. The full 32 bit hash code is used to get a precice match - traversing the linked list.
Very few steps are required to locate the data. In the case of 300 keywords, most are located within two steps.
PowerBasic
#COMPILE EXE
#DIM ALL
FUNCTION hashcode(s AS STRING) AS LONG
DIM le AS LONG
DIM p AS LONG
le=LEN(s):p=STRPTR(s)
! xor eax,eax
! mov ecx, le ' length
! mov edx, p ' name pointer
! mov ah, cl ' length
! mov al, [edx] ' 1st letter
! shl eax, 8
! mov al, [edx+1] ' 2nd letter
! shl eax,8
nextchar:
! rol eax,1
! xor al, [edx]
! inc edx
! dec ecx
! jg nextchar
! mov function,eax
END FUNCTION
FUNCTION PBMAIN () AS LONG
MSGBOX "movups "+HEX$(hashcode("movups"))
END FUNCTION
FreeBasic
function hashcode(s as string) as long
dim as long le
dim as any ptr p
le=len(s):p=strptr(s)
asm
xor eax,eax
mov ecx, [le]
mov edx, [p] ' name pointer
mov ah, cl ' length
mov al, [edx] ' 1st letter
shl eax, 8
mov al, [edx+1] ' 2nd letter
shl eax,8
nextchar:
rol eax,1
xor al, [edx]
inc edx
dec ecx
jg nextchar
mov [function],eax
end asm
end function
Charles,
I'm very interested in this function.
As you know I'm using something very similar in thinBasic to maintain and use different Hash Tables for Keywords, variables, ...
Your function is very efficient compared to mine. One problem of mine version is that I need that the hash key is a number between 1 and MaxRange where MaxRange can vary depending on the hash table (every hash table has its own max number of buckets where every bucket is a pointer to a linked list).
The following is my function compare to yours. Do you have any idea on how to improve perfomances (of mine of course ;) yours is already very fast ).
Thanks a lot
Eros
PB Code:
#COMPILE EXE
#Dim All
Function hashcode(s As String) As Long
Dim le As Long
Dim p As Long
le=Len(s):p=StrPtr(s)
! mov ecx, le ' length
! mov edx, p ' name pointer
! mov ah, cl ' length
! mov al, [edx] ' 1st letter
! shl eax, 8
! mov al, [edx+1] ' 2nd letter
! shl eax,8
nextchar:
! rol eax,1
! Xor al, [edx]
! inc edx
! dec ecx
! jg nextchar
! mov Function,eax
End Function
Function HashTable_Hash(ByVal ptrKeyBuffer As Byte Ptr, ByVal MaxRange As Long) As Dword
Static gdwN As Dword
gdwN = 0???
Do While @ptrKeyBuffer
gdwN = 31??? * gdwN + @ptrKeyBuffer
Incr ptrKeyBuffer
Loop
Function = gdwN Mod MaxRange + 1&
End Function
Function PBMain () As Long
Dim Counter As Long
Dim MaxLoops As Long
Dim T1 As Double
Dim T2 As Double
Dim sKey As String
Dim pKey As Byte Ptr
sKey = "movups"
pKey = StrPtr(sKey)
MaxLoops = 1000000
T1 = Timer
For counter = 1 To MaxLoops
hashcode(sKey)
Next
T2 = Timer
MsgBox "Time to perform " & Str$(MaxLoops) & " loops: " & Format$(T2 - T1, "#0.000")
T1 = Timer
For counter = 1 To MaxLoops
HashTable_Hash(pKey, 8000)
Next
T2 = Timer
MsgBox "Time to perform " & Str$(MaxLoops) & " loops: " & Format$(T2 - T1, "#0.000")
End Function
Hi Eros,
Ideally the max number of buckets should be a power of 2 then you simply grab the required number of bits off the lower end of the hash. eg hash and 63 (6 bits) to use as the bucket index.
bucket=hash and 63
which is the same as:
! mov eax,hash
! and eax,63
! mov bucket,eax
Taking a modulus for the address is best avoided because division is expensive (40 clocks), but supposing you needed an index for 96 buckets then you could take the first 6 bits, and add the next 5 bits. This sounds cumbersome but it is about 6 times faster than a modulus.
! mov edx,hash
! mov eax,edx
! and eax,63
! shr edx,6
! and edx,31
! add eax,edx
! mov bucket,eax
bucket now contains a value in the range 0..95.
Im am off-PC this afternoon but I hope this gives you a few hints.
Thanks a lot Charles for the help.
Having the max number of buckets a power of 2 is not a problem at all.
I normally use Hash Tables with a pre-loaded 4000 to 10000 buckets in order to sparse enough keys hash values and avoid collisions (eventually managed by linked lists)
Thanks
Eros
I found some publications stating that multiple of 2 hash tables very often result into using the same bucks over and over creating a bad hash distribution.
The best situation seems using a prime number as table size.
In my search almost every hash functions generating a hash value between <1> (or zero) to <HashTableSize> use MOD.
Still searching ...
I think that depends on the hash algorithm being used. If the low order bits, used for the bucket index are thoroughly jumbled, the distribution should be okay. I tried a number of algorithms before arriving at this one. However with a prime number modulus system you may be less sensitive to the algorithm itself.
It is quite easy to monitor the performance of the hash table by tallying the maximum link count.
I've just done a performance check on the current Asmosphere:
There are 512 buckets and 262 ops to go in them.
only 63 of the ops required links and the maximum number of links for any bucket was 2.
When I doubled the bucket count to 1024 then only 36 ops required links with a max of 2
Applying a modulus to this hashing algorithm makes it perform very badly - it is optimised for taking the lower bits.
Best hash so far
for min 512 buckets:
262 words of which 44 require 1 link
FUNCTION hashcode(s AS STRING) AS LONG
DIM le AS LONG
DIM p AS BYTE PTR
le=LEN(s):p=STRPTR(s)
! xor eax,eax
! mov ecx, le
! mov edx, p ' name pointer
! mov ah, cl ' length
! mov al, [edx] ' 1st letter
! cmp cl,1
! jz xhash
! shl eax,3
! xor al, [edx+1] ' 2nd letter
! shl eax,3
nexh:
! rol eax,1
! xor al, [edx]
! inc edx
! dec ecx
! jg nexh
xhash:
! mov function,eax
END FUNCTION
Charles,
how do you use the hash value returned by the function in order to be an index of?
Thanks
Eros
i=hashcode(wd)
v=i and 511
where wd is the keyword i is the hash code and v is the bucket index
This is the instruction lookup function
FreeBasic
function lookupc(wd as string) as long
dim as long v,i
i=hashcode(wd)
v=i and 511
do
if opd(v,0)=i then ' match#
if wd=opn(v) then exit do ' confirm perfect match
end if
v=opd(v,1)
if v=-1 then ert=9:ers="Unidentified instruction: "+wd:exit do ' not found
loop
function=v
end function
Ok, I got it, thanks.
You use overflow of keys (same hash values) on the same array where position 0 is the first bucket and position 1 is a link to the next bucket (and so on).
Mine is a much more complex structure with dynamic allocation of buckets and linked lists.
Even if quite good in speed, it is quite complex. Maybe I need to review and simplify it for better maintainability.
Thanks again.
Eros
Yes Eros, this is a very simple scheme and I get away with it because it is a relatively small fixed set of keywords for the instruction set. But for storing macros & variables with - scoping, allocation and deallocation, it is a bit more complicated. I am still working out the details for Asmosphere which currently has a very simple backwards searching scheme - not even an alphabetic index. This is fine for assembling small scripts. The Opengl demo is the largest so far at 38K and the compile time is barely noticeable. But when dealing with larger programs with many system calls an efficient hashing scheme will be necessary.
This is what I am planning - using a three part array instead of two. The function returns an index to the global macro list (or -1).
It does not matter if the hashes are not unique - the keyword match is always confirmed.
This snippet of code, now operational, shows the hash coder, the lookup function and the hashtable/linked list builder. These are retro-fitted to accelerate a number of search loops in the assembler.
type hashref
h as long ' hash code
l as long ' linked list index
m as long ' macro list index
end type
dim shared mach(4095) as hashref
dim shared as long mle=512 ' macro link list end boundary
function hashcode(s as string) as long
dim as long le
dim as any ptr p
le=len(s):p=strptr(s)
asm
xor eax,eax
mov ecx, [le]
mov edx, [p] ' name pointer
mov ah, cl ' length
mov al, [edx] ' 1st letter
cmp cl,1
jz xhash
shl eax,3
xor al, [edx+1] ' 2nd letter
shl eax,3
nexh:
rol eax,1
xor al, [edx]
inc edx
dec ecx
jnz nexh
xhash:
mov [function],eax
end asm
end function
function lookupm(wd as string) as long
dim as long f,v,i,k
i=hashcode(wd)
v=i and 511
f=-1 ' find last match in chain
do
if mach(v).h=i then ' match#
k=mach(v).m
if k>=me then mach(v).h=0:mach(v).l=0 ' out of scope so break the chain
if (k<me)and(wd=macs(k,0)) then f=k ' match candidate
end if
v=mach(v).l
if v<512 then exit do ' no more in linked list
loop
function=f
end function
sub addm(byval m as long)
dim as long i,j,k,t
i=hashcode(macs(m,0)): if i=0 then ers="zero hash failure: "+macs(m,0):ert=13:exit sub
j=i and 511
do
t+=1
if (mach(j).l<=0)or(mach(j).l>=mle)or(mach(j).m>=me) then
if mle>=4095 then ers="Too many entities to include "+macs(m,0):ert=13:exit do
if (j>=512)or(mach(j).h<>0) then mach(j).l=mle:j=mle:mle+=1
mach(j).h=i:mach(j).l=0:mach(j).m=m: exit do ' added to chain
end if
j=mach(j).l ' next in chain
loop
if t>lmax then lmax=t
end sub