Parsing HTMLand Tables of Tables

Charles Pegge · July 02, 2007, 05:47:44 PM

This is a result of a dialogue with Mike Trader.

It demonstrates how to use the ParserWords code in a complex situation; and HTML script is particularly messy, with its tag attributes, commented areas and embedded scripts with varying syntax rules.

The treatment of quoted text for instance, changes as you go into a tag and traverse the tag attributes. Outside the tag, a quote mark is just an ordinary character. So the lexical rules have to be changed on-the-fly.

And nested tables are frequently found in web pages - tables of tables. How are you supposed to deal with them?

Well, I did not fully appreciate the complexities involved so it took a little longer than the anticipated 2 days but all will be revealed in this download.

Charles Pegge · July 17, 2007, 04:46:23 AM

ParserWords.inc Without Using Inline Assembler

Mike Trader wanted to know what the equivalent code would be just using Basic. I thought that would be an interesting coding exercise, so here it is:

The source code is about 1/3 the length but I cannot guarantee that it will be 3 times easier to understand.

Code Select


' For use with ParserWords.bas which provides data framework and demo.

' This has the equivalent functionality of ParserWords.inc but does not use inline assembler.

'
' Charles E V Pegge
' charles@pegge.net
' www.pegge.net
'
' 18 July 2007:


FUNCTION ReadInWords(BYREF cmz AS STRING) AS LONG
'BASIC equivalent creating a lookup table based on
' 1st 2 letters and a set of linked lists
'----------------------------------'
                                   '
GLOBAL tbl  AS STRING              ' tables string
GLOBAL tblp AS BYTE PTR            ' pointer to tables string

 DIM rc     AS LONG                ' cmz reading position 1
 DIM rd     AS LONG                ' cmz reading position 2
 DIM tkn    AS LONG                ' token
 DIM wd     AS STRING              ' word
 DIM wi     AS LONG                ' index for wd chars
 DIM sig    AS LONG                ' signature
 DIM cha    AS LONG                ' chain pointer
 DIM tb2    AS LONG                ' start of chain address
 DIM nexpos AS LONG                ' next available location
                                   '
'----------------------------------'
                                   '
  nexpos=&h2100+tblp                 ' start of linked lists
  rc=1                             ' start of source token/word list
  DO                               '
   WHILE ASC(cmz,rc)=32:INCR rc:WEND '
                                   ' advance to first non space
   rd=INSTR(rc,cmz," ")            ' get to word boundary
   IF rd<=rc THEN EXIT DO          ' no more word boundaries so exit
   tkn=VAL("&h"+MID$(cmz,rc,rd-rc))' read in token and convert it to a number
   rc=rd+1                         ' advance to start of word
   rd=INSTR(rc,cmz," ")            ' look ahead to the word boundary
   IF rd<=rc THEN EXIT DO          ' if no boundary found then exit
   wd=MID$(cmz,rc,rd-rc)           ' extract the word
   FOR wi=1 TO LEN(wd)             ' convert wd to pseudo uppercase
    MID$(wd,wi)=CHR$(ASC(wd,wi)AND &h5f)
   NEXT                            '
   'wd=ucase$(wd)
   rc=rd                           ' update the pointer ready for next token/word
'----------------------------------'
   sig=256 _                       ' create the word signature index
    +(ASC(wd,1)AND &h1f)*8_        '  from 1st letter
    +(ASC(wd,2)AND &h1f)*256       '  from 2nd letter
   cha=nexpos-tblp                ' make cha the next available location
   tb2=CVL(MID$(tbl,sig+5,4))      ' get the end-of-chain address
                                   '
'----------------------------------'
   IF tb2=0 THEN                   ' if it is not present then                                  '
    MID$(tbl,sig+1)=MKL$(nexpos)   ' this is a new chain so patch it into the signature index table
   ELSE                            '
    MID$(tbl,tb2-tblp+1)=MKL$(nexpos) ' patch in link address at end of chain
   END IF                          '
'----------------------------------'
                                   '
   MID$(tbl,cha+1)= CHR$(LEN(wd)) _
    +wd+MKL$(tkn)+MKL$(0)          ' insert word
   nexpos=nexpos+5+LEN(wd)         ' points to location of link address
   MID$(tbl,sig+5)=MKL$(nexpos)    ' patch in pointer to address
   nexpos=nexpos+4                 ' update next available space
   IF nexpos-tblp>=LEN(tbl) THEN EXIT DO ' out of allocated space
  LOOP                             '
'----------------------------------'
FUNCTION=0
END FUNCTION


FUNCTION GetNextTokenPB(BYVAL ps AS BYTE PTR, BYVAL le AS LONG, BYREF Tstart AS LONG, BYREF Tend AS LONG ) AS LONG
'=================================================='
'                                                  '
' Parameters:                                      '
' 1 pointer to the string                          '
' 2 length od string to be parsed                  '
' 3 address of start index of word                 '
' 4 address of word end boundary index             '
'                                                  '
' Results:                                         '
' paramers 3 and 4 are updated                     '
'                                                  '
' Return:                                          '
' tokentype or token for the word/symbol identified'
                                                   '
                                                   '
                                                   '
'=================================================='
 DIM en  AS BYTE PTR                               ' string boundary
 DIM pb  AS BYTE PTR                               ' start of word pointer
 DIM pp  AS BYTE PTR                               ' a char pointer
 DIM pe  AS BYTE PTR                               ' word boundary pointer
 DIM tkn AS LONG                                   ' token or tokentype
 DIM ch AS LONG                                    ' character
 DIM sig AS LONG PTR                               ' signature / index address
 DIM pch AS BYTE PTR                               ' char pointer for use in the word/token chain
 DIM lw AS LONG                                    ' word length down-counter
'--------------------------------------------------'
 en=ps+le                                          ' derive the string boundary
 pb=ps+Tend-1                                      ' init the word start pointer
 pe=pb                                             ' init the word end pointer
 pp=pb-1                                           ' set initial char pointer predecremented
'=================================================='
 DO                                                ' loop to skip leading space tokens
  ' for start of next word or token                '
  INCR pp                                          ' next char address
  IF pp>=en THEN pb=en:pe=en:GOTO xit              ' check string boundary
  tkn = @tblp[ @pp ]                               ' get the char and lookup up its tokentype
  IF tkn>32 THEN EXIT LOOP                         ' if this is not a space token then exit this loop
 LOOP                                              ' repeat
 pb=pp
                                                   ' mark the start position of the word
'--------------------------------------------------' for quotes
 IF tkn=33 THEN                                    ' is it a quote tokentype?
  ch=@pp                                           ' if so then get the actual character
  DO                                               ' loop thru to the end of the quote
   INCR pp                                         ' next char position
   IF pp>=en THEN pe=en: GOTO xit                  ' check against string boundary
   IF @pp=ch THEN EXIT LOOP                        ' check character against quote char. Exit if they match
  LOOP                                             ' repeat
  pe=pp+1: GOTO xit                                ' set the end boundary: and finish
 END IF                                            '
'--------------------------------------------------' for self delimiting tokens
                                                   '
 IF tkn<47 THEN pe=pp+1: GOTO xit                  ' set the word boundary and finish
                                                   '
'==================================================' for words and numbers
 DO                                                '
  INCR pp                                          '
  IF pp>=en THEN pb=en:pe=en:EXIT LOOP             ' end of string
  ch=@tblp [ @pp ]                                 ' read the tokentype
  IF ch<48 THEN pe=pp:EXIT LOOP                    ' end of word
 LOOP                                              '
 '-------------------------------------------------'
 IF tkn<65 THEN GOTO xit                           ' the word was a number so finish here
                                                   '
                                                   ' identify word
 sig=256+(@pb[0] AND &h1f)*8+(@pb[1]AND &h1f)*256+tblp ' work out the word's signature to use as a table index
 IF @sig=0 THEN GOTO xit                           ' if this is zero then there is no word chain for this signature so finish here
 pch=@sig                                          ' get the address of the word chain
'--------------------------------------------------'
 DO                                                ' loop to check thru each word in the chain until there is a match
  lw=@pch                                          ' get the length of the first word
  pp=pb                                            ' point to the word in the source text
  DECR pp                                          ' predecrement pp
'--------------------------------------------------'
  DO                                               ' loop for character matching.
   DECR lw:IF lw<0 THEN EXIT DO                    ' check length of word by downcounting lw. If this value is negative then there is a successful match
   INCR pch:INCR pp                                ' next chars in word chain
   IF pp>=en THEN EXIT DO                          ' reached end of string without completing the match
   ch=(@pp AND &h5f)                               ' convert to pseudo uppercase
   IF ch<>@pch THEN                                ' if the characters dont match
    IF @pch<>95 THEN EXIT DO                       ' is it an underscore? if not then there is no possible mactch so exit this comparison
    IF ch=13 THEN ch=0                             ' dash 45 - 32
    IF ch=95 THEN ch=0                             ' underscore
    IF (ch<>0) THEN EXIT DO                        ' if the char to be matched is not one of these then there is no possible match
'--------------------------------------------------'
    DO                                             ' skipping over multiple spacing
     IF pp+1>=en THEN GOTO xit                     ' looking ahead, end of string before the match is completed
     IF @pp[1]>32 THEN EXIT DO                     ' look ahead: is the char a space? then continue till non space char
     INCR pp                                       ' next char address
    LOOP                                           ' continue space skipping
'--------------------------------------------------'
   END IF                                          '
  LOOP                                             ' repeat char matching loop
'--------------------------------------------------'
  IF lw<0 THEN                                     ' match was successful ?
   tkn=PEEK(LONG, pch+1): pe=pp+1:GOTO xit         ' so overwrite tokentype with token, update word boundary and finish
  END IF                                           '
                                                   ' otherwise try the next in the chain
  pch=pch+lw+5                                     ' skip remainder of word
                                                   ' skip the token
  pch=PEEK(LONG,pch)                               ' get the link address for the next word in the chain
  IF pch=0 THEN GOTO xit                           ' there are no more words in this chain so no success
  pp=pb                                            ' reset to start of word
 LOOP                                              ' continue with next word in the chain
'--------------------------------------------------'
xit:                                               ' finalise results
 Tstart=pb-ps+1                                    ' calc the tokenstart
 Tend=pe-ps+1                                      ' calc the token end
 FUNCTION=tkn                                      ' function returns the toketype or token
END FUNCTION                                       ' quit
'=================================================='

News:

Parsing HTMLand Tables of Tables

Charles Pegge

Charles Pegge