FreeBASIC CWstr

José Roca · December 27, 2018, 08:31:54 PM

I get correct results if I only include "ustring.inc", but wrong results if I also include "cwstr.inc".


'#CONSOLE ON
#define UNICODE
#INCLUDE ONCE "windows.bi"
'#INCLUDE ONCE "Afx/cwstr.inc"    ' Wrong results if you unrem it
#INCLUDE ONCE "ustring.inc"

  dim w as wstring * 50 = wstr(70000000001)
  dim u as ustring = wstr(70000000001)

  print "Test: CLNGINT"
  print "  u: -" & CLNGINT(u) & "  -  " & u
  print "  w: -" & CLNGINT(w) & "  -  " & w

  if CLNGINT(w) <> CLNGINT(u) then
    print 
    Print "--- ERROR ---"
  else
    print 
    print "---  OK  ---"
  end if


  print 
  print "Test: CULNGINT"
  print "  u: -" & CULNGINT(u) & "  -  " & u
  print "  w: -" & CULNGINT(w) & "  -  " & w

  if CULNGINT(w) <> CULNGINT(u) then
    print 
    Print "--- ERROR ---"
  else
    print 
    print "---  OK  ---"
  end if


PRINT
PRINT "Press any key..."
SLEEP

Juergen Kuehlwein · December 28, 2018, 11:45:20 AM

Ok!

Now (still using the new compiler) when you change the code in CWstr.inc like this:

PRIVATE FUNCTION ValLng OVERLOAD (BYREF cws AS CWSTR) AS LONGINT
RETURN .ValLng(*cast(WSTRING PTR, cws.m_pBuffer))
END FUNCTION

PRIVATE FUNCTION ValULng OVERLOAD (BYREF cws AS CWSTR) AS ULONGINT
RETURN .ValULng(*cast(WSTRING PTR, cws.m_pBuffer))
END FUNCTION

- it delivers correct results.

Please add a debug message to both of these functions and run it with "u" and with "**u", and with "AS LONG(INT)" and "AS ULONG(INT)" respectively. You may change "C(U)LNGINT" to "VAL(U)LNG" in the test code and see, what happens then

The naming of these overloadable functions (VALLNG, VALULNG) is unfortunate, if not misileading...

JK

José Roca · December 28, 2018, 03:52:22 PM

Good catch! I have modified these functions.

Juergen Kuehlwein · December 29, 2018, 10:50:33 AM

José,

yet another thing - when you change the overloaded [] operator to:

Code Select


PRIVATE OPERATOR DWSTR.[] (BYVAL nIndex AS ulong) byref AS USHORT
'***********************************************************************************************
' Returns the corresponding ASCII or Unicode integer representation of the character at the
' zerobased position specified by the nIndex parameter. Can be used to change a value too.
'***********************************************************************************************
static zero as ushort                                 'fallback for nIndex outside valid data

  IF nIndex > (m_BufferLen \ 2) - 1 THEN
    zero = 0
    OPERATOR = zero                                   'return 0
    exit operator
  end if
  
  OPERATOR = *cast(USHORT ptr, m_pBuffer + (nIndex * 2))
END OPERATOR

- it works both ways. That is, you can not only get a character value, but you also can set one! Please check.

JK

José Roca · December 29, 2018, 03:08:10 PM

Yes, it works. It can also be implemented for CBSTR:

Code Select


' ========================================================================================
' Returns the corresponding ASCII or Unicode integer representation of the character at
' the zero-based position specified by the nIndex parameter (0 for the first character,
' 1 for the second, etc.), e.g. value = cws[1], cws[1] = value.
' ========================================================================================
PRIVATE OPERATOR CWStr.[] (BYVAL nIndex AS UINT) BYREF AS USHORT
   STATIC Zero AS USHORT = 0
   IF nIndex < 0 OR nIndex > (m_BufferLen \ 2) - 1 THEN RETURN Zero
   ' Get the numeric character code at position nIndex
   OPERATOR = *CAST(USHORT PTR, m_pBuffer + (nIndex * 2))
END OPERATOR
' ========================================================================================

' ========================================================================================
' Returns the corresponding ASCII or Unicode integer representation of the character at
' the zero-based position specified by the nIndex parameter (0 for the first character,
' 1 for the second, etc.), e.g. value = cbs[1], cbs[1] = value.
' ========================================================================================
PRIVATE OPERATOR CBStr.[] (BYVAL nIndex AS UINT) BYREF AS USHORT
   STATIC Zero AS USHORT = 0
   IF nIndex < 0 OR nIndex > SysStringLen(m_bstr) - 1 THEN RETURN Zero
   ' Get the numeric character code at position nIndex
   OPERATOR = *CAST(USHORT PTR, m_bstr + nIndex)
END OPERATOR
' ========================================================================================

Juergen Kuehlwein · December 29, 2018, 07:20:31 PM

José,

you must set "zero" to zero each time, as i did in my code. Initializing it to zero is not enough.

See, what happens without it, if you run this code:

Code Select


dim u as Ustring = "asdfg"

  u[-1] = 1234

  print  u[-1] 
  sleep

Maybe we should set the result to hFFFF, if an invalid index is given, because hFFFF is an invalid UTF16 character too - 0 isn´t, and there could be 0 at a valid index too. Or am´i wrong with this?

JK

José Roca · December 29, 2018, 08:28:45 PM

What I think is that we should remove error checking and warn that if the index is invalid the result will be undefined, as the FB intrinsic [] does. The main purpose of using [] is speed and we will lose it adding error checking. Those wanting error checking can use the Char property instead.

Jeff Marshall · December 30, 2018, 06:30:00 AM

Quote from: Juergen Kuehlwein on December 20, 2018, 04:12:21 PM
regarding GIT, i (at least intentionally) didn´t change anything else than some files in the compiler folder. The .jkp file is the project file for my IDE i obviously forgot to add to the ignore list. So please turn down all other changes.

I see now it is due to the added .gitattributes and your local machine probably has the default system wide setting of core.autocrlf=false. To work better with the fbc sources on multiple platforms, it is better (for the fbc source tree in particular) to have this set to core.autocrlf=true. Line endings will automatically be converted to CRLF for all files on windows (LF on linux). There's only a couple places in the source tree that we force LF line endings.

Also, fbc source mostly has TAB character for indent. Comments for statements on the line above (not in line or at end of line). And no need for long '' --- * 70 comment breaks.

>> then just modify the AST expr.
>I actually don´t change the expr, i just redirect the code flow, based on the fact that one of our UDTs is recognized (by it´s name for now).
>>Avoid the GOTO's, some of them are skipping over DIM statements.
> But the alternative would be to make major changes in code structure and thus potentially breaking things

You are making it really hard on your self, and me trying to read your code. Have a look at https://github.com/jayrm/fbc/commits/udt-wstring specifically in rtl-string.bas. 50 lines changed instead of 500.

QuoteBetter would be to use #pragma push/pop

> but what for is #pragma push/pop?

Push/Pop, or assignment, allows to turn the new behaviour on and off within the same source code.

>>attach the "UDT-is-a-Wstring" flag to the UDT's type-def[/quote]
> Which would require some new syntax for specifying this flag. Why not just check for "can-cast-as-wstring", which would have the same effect without the need for any syntax change/addition.

I agree. There should be no need to check for special names (it slows down the compiler). Attaching "can-cast-as-wstring" flag to the UDT can help speed up the compiler if the check is made often.

>For a schedule i would propose these steps:
>1.) make sure, that, what we have now, really works FLAWLESSLY for Windows, LINUX and maybe other targets, which can deal with wide strings.

Where are your test files? What source you are using to check that your changes work?

> 2.) integrate this into the next official release, maybe with #2 of your list, maybe without it. This is a timing thing: when will be the next release, and how much time will it need to get #2 tested and implemented?

"fixing UDT implicit casting for built in [w]string rtlib functions", which is what you've been working on? The #pragma tells fbc to prefer casting to WSTRING type if it is a UDT, rather than STRING type which fbc does by default. That's a pretty good solution, especially if you are building UNICODE only programs.

> 3.) solve #2 and #3 of your list.

The problems of #3 will be noticed with respect to strings when a UDT has both cast to wstring and string. However, if we have fbc just prefer one or the other, there's no ambiguity for the built in run time functions. Though the problem still exists.

> 4.) make USTRING a built-in type (#1 of your list). Personally i can live with fact of having to include an extra file for dynamic wide strings, so it is last in my personal priority and my schedule. What do you think?

Yes, later.

----
in https://github.com/jayrm/fbc/commits/udt-wstring
- I (re)added the pragma as push/pop
- made some changes to rtl-string.bas
- I didn't look much in to the other additions you are making, as I have no idea what the cases are that you are fixing.

For example, here is a test I created for the few additions that I made (sorry, macro abuse):

Code Select


#include once "../WinFBX/Afx/CWStr.inc"

#pragma udt_wstring=true

type some_wide_string as CWStr

private function EscapeUnicodeToAscii _
	( _
		byref w as const wstring _
	) as string

	dim ret as string

	for i as integer = 0 to len(w)-1
		select case w[i]
		case 9
			ret &= "\t"
		case 32 to 127
			ret &= chr(w[i])
		case else
			ret &= "\u" & hex(w[i],sizeof(wstring)*2)
		end select
	next

	function = ret

end function

private sub wcheck overload _
	( _	
		byref tag as const string, _
		byref expr as const string, _
		byref a as const wstring, _
		byref b as const wstring _
	)
	
	dim fail as boolean

	print tag & ", " & expr & ": ";

	if( len(a) <> len(b) ) then
		print !"Failed!: length does not match"
		fail = true
	elseif( a <> b ) then
		print !"Failed!: not matched"
		fail = true
	else
		print !"OK"
	end if

	if( fail ) then
		print !"\tA: """ & EscapeUnicodeToAscii( a ) & """"
		print !"\tB: """ & EscapeUnicodeToAscii( b ) & """"
		end 1
	end if
	print

end sub

private sub wcheck overload _
	( _
		byref tag as const string, _
		byref expr as const string, _
		byval a as const integer, _
		byval b as const integer _
	)

	print tag & ", " & expr & ": ";

	if( a = b ) then
		print !"OK"
	else
		print !"Failed"
	end if
	print

end sub

#macro t( expr )
	#define Xexpr(X) expr
	a2 = Xexpr(a1)
	b2 = Xexpr(b1)
	wcheck( "A=B", #expr, a2, b2 )
	wcheck( "f(A)=f(B)", #expr, Xexpr(a2), Xexpr(b2) )
	wcheck( "f(A)=f(**B)", #expr, Xexpr(a2), Xexpr(**b2) )
	#undef Xexpr
#endmacro

#macro t_int( expr )
	#define Xexpr(X) expr
	a_int = Xexpr(a1)
	b_int = Xexpr(b1)
	c_int = Xexpr(c1)
	wcheck( "f(A)=f(B)", #expr, a_int, b_int )
	wcheck( "f(A)=f(**B)", #expr, a_int, c_int )
	#undef Xexpr
#endmacro

private sub do_test _
	( _
		byref text as const wstring _
	)

	dim a1 as wstring * 40 = text
	dim b1 as some_wide_string = text
	dim c1 as wstring * 40 = **b1

	dim a2 as wstring * 40
	dim b2 as some_wide_string

	dim a_int as integer
	dim b_int as integer
	dim c_int as integer


	t( left( X, 3 ) )
	t( right( X, 3 ) )

	t( ltrim( X ) )
	t( rtrim( X ) )
	t( trim( X ) )

	t( ltrim( X, any !" \t" ) )
	t( rtrim( X, any !" \t" ) )
	t( trim( X, any !" \t" ) )

	t( mid( X, 2 ) )
	t( mid( X, 3 ) )
	t( mid( X, 2, 3 ) )
	t( mid( X, 4, 6 ) )

	t( lcase( X ) )
	t( ucase( X ) )

	t_int( asc( X ) )

	t_int( instr( X, !"\u304B" ) )
	t_int( instr( X, !"not-here" ) )
	t_int( instr( X, !"Hel" ) )
	t_int( instr( X, !"\u3055\u3093" ) )

	t_int( instrrev( X, !"\u304B" ) )
	t_int( instrrev( X, !"not-here" ) )
	t_int( instrrev( X, !"Hel" ) )
	t_int( instrrev( X, !"\u3055\u3093" ) )

end sub

do_test( !"Hello \u304A\u304B\u3042\u3055\u3093" )
do_test( !"  Hello \u304A\u304B\u3042\u3055\u3093  " )
do_test( !" \tHello \u304A\u304B\u3042\u3055\u3093\t  " )
do_test( !" \u3042 " )
do_test( !"\u3042" )

I wrote the test this way so I can inspect the results on any dumb ascii terminal, and can convert to fbc's test-suite later. Anything you change or add, you should have a test in mind, that can be automated.

Juergen Kuehlwein · December 30, 2018, 03:38:56 PM

Jeff,

QuoteYou are making it really hard on your self, and me trying to read your code. Have a look at https://github.com/jayrm/fbc/commits/udt-wstring specifically in rtl-string.bas. 50 lines changed instead of 500.

Please have a look at post #145:

QuotePS:
@Jeff,

fixing UDT implicit casting for built in [w]string rtlib functions

I think i have a quite simple solution for this, just have a look at "Parser-Compound-Select.bas", line 119. We could make can-cast-to-(w)string functions out of it and insert tests, in all places i currently check for the UDT´s name. This would be a generic approach then.

you coded, what i basically proposed as a generic approach.

As for the test code: every time i had a new version of "ustring.inc", the code to test (test.bas) and/or the compiler, i added it as "ustring.zip" as an attachment at the bottom of my post. You find the last one in post #147 (at the bottom of it) As a registered member of this forum you can download it. I removed previous ones in previous posts in order not to waste José´s web space by outdated versions. My latest is attached right here

JK

Juergen Kuehlwein · December 30, 2018, 05:03:44 PM

And Jeff,

i don´t want to make it hard on anyone! As i said before:

QuoteA compiler is a very complicated piece of software, and as long as i don´t understand in depth, what happens, and why exactly it happens, i´m extremely careful with changes and try to isolate my new code as much as possible. In other words i know that "goto" is often considered to be bad coding style. But the alternative would be to make major changes in code structure and thus potentially breaking things, i even don´t know of. I took care that skipping DIM statements is safe, in that the skipped variables are not needed anymore in places i´m jumping to.

How long have you been working on the compiler and how long have i been working on it? There is a huge difference in knowledge and experience between us on that matter. So how would you start to change a unknown software consisting of almost 200 separate files? Very carefully, i think.

Don´t let yourself be fooled by the style i added my changes. I did it exactly this way, deliberately and on purpose. It makes debugging easier for me, because i change the original code and code flow as minimal as possible in order not to introduce bugs, which weren´t there before. When using a generic function, a change in one place affects code in many places. If i keep it separate (accepting the overhead of doing the same or at least similar in several places, as i did) i have a chance of testing each change separately, which otherwise could cross influence each other.

This doesn´t mean that i expect it to stay this way. The current version is a test version for me, where i test:
- is it possible at all (yes, i´m convinced, it is possible)
- do the changes work as expected (as far as i can tell - yes)
- do the changes break existing things (as far as i can tell - no)

I don´t insist on long '' --- * 70 comment breaks. I´m not partricularily fond of my initials (JK) even, if i sign my posts with it. I use my initials to mark the code i added, so i just have to search for "jk", which isn´t a very common character sequence in FB. And i use the heavy comment breaks to visually mark my changes. So i can easily decide, what was has already been there, and what is my addition/change. All of this "jk"- specific things should be removed for a release version.

Quoteas I have no idea what the cases are that you are fixing

I fixed every place where a wstring worked and an ustring didn´t without prepending "**". Currently i´m in a position (and this was my first goal), where i can say: it is possible and it works. It needs more testing by other people to verify this. I provided test code, maybe you would like to add some more tests, or maybe you will find yet another syntax element or a bug, which must be fixed.

I have several own projects with ustrings, which compile and work with the new compiler version(s) and i can compile Paul´s WinFBE with the new compiler, even if i remove all prepended "**". (the latest version of the compiler is in my fork, a compiled Windows version of the compiler + ustring.inc + test code is in the attachment a the end of my previous post)

Ustring.inc has been adapted to (hopefully) work for Linux too. Stw currently runs tests with it, because i don´t have Linux.

Let´s assume it works in Linux as well, then we have a proof concept in that the compiler and ustring.inc work together flawlessly. As a next step, we could make it generic and replace code, which is all the same in various places. In fact it isn´t all the same in all places and i would have had a hard time debugging my changes, if i hadn´t kept it separate - maybe you wouldn´t, because you are much more familiar with the compiler than i am.

If you need some references about my coding skills, take a look at my IDE (https://jk-ide.jimdo.com/), which works for PowerBASIC and FreeBASIC. There might be some undetected quirks with FreeBASIC, because it is a recent adaption still, but it is the only IDE i know of for FreeBASIC, with a built-in stepping debugger for 32 and 64 bit apps. There are other built-in debugging features, which might come in handy.

JK

Jeff Marshall · December 30, 2018, 09:59:50 PM

Juergen, knowledge and experience of the compiler or coding skill has nothing to do with it. I'm only looking at the code in front of me, and offering suggestions for next steps, as if someone wanted to merge this in to fbc's code base. I don't how else to offer suggestions. More smilies?

Factoring common logic in to a subroutine, making the code concise, is a good programming practice. Eliminates duplicate code. And a well named subroutine can make the caller's code easier to read for the programmer. Also, well named variables can give the reader an idea of what the variable is used for without having to read through all the logic to figure it out.

The other "style" choices (like tabs vs spaces) are a historical artifact of the compiler's development. Unneccessary changes create meaningless diffs that are a pain to look through when inspecting change sets. I've seen other projects that have very rigid formatting rules, and the code is beautiful to read. We are kind of lazy about it, so if it doesn't get addressed up front, it probably never will.

I got the ustring.inc file, I missed seeing that before, thanks. Yes, you can not underestimate the value of a well written test-suite. Have you looked at any of tests in ./tests to see the format? wstring tests are somewhat lacking. ustring.inc is a decent start. It's time consuming to write good tests that cover the range of use.

Anyway, consider using astMaybeConvertUdtToWstring(), I have a feeling it will simplify code in more places than you think. That way you can debug/change common logic in one place. (For example, check if UDT has both cast wstring/string and do something different if it does, etc, in future.)

Also, careful in astNewCONV. The tricky part of this procedure is that callers expect it to behave a certain way. Changing astNewCONV won't break user code, instead you get bugs down the road where USTRING might silently succeed in places where it shouldn't. Making changes to astNewCONV usually means inspecting every place it is called from.

I want to work on fbc 1.06 release. I was hoping to wrap up varargs feature, but I have a feeling the internal compiler changes are too aggressive to merge in just before a release, and there are a few loose ends on the feature, so I might hold off on that just now to get working on the release. I am guessing 2 months to get a release done. Mostly because I haven't done it in like 10 years, and my tool chains for some targets are broken.

Juergen Kuehlwein · December 30, 2018, 11:18:00 PM

Jeff,

Quote
Anyway, consider using astMaybeConvertUdtToWstring(), I have a feeling it will simplify code in more places than you think. That way you can debug/change common logic in one place. (For example, check if UDT has both cast wstring/string and do something different if it does, etc, in future.)

you are always one step ahead. I´m not reluctant to do it the way you propose, this will be the next step. But i hope you understand now, why i did it the way i did it. I must be sure not to spoil something and i must be able to debug it, until i can be sure that my changes don´t have unwanted side effects. This is because currently i understand about 10% of the code. The other 90% are still "unknown ground" i at best can make guesses at.

Did you have a look at "test.bas", which tries to test the applied changes? Should i adapt it to the format used in \tests?

I will try to comply to the coding rules you stated for FB in order to make merging as easy as possible. Don´t hesitate to correct me in this area and don´t hesitate to correct me, if i´m wrong with my coding.

Regarding astNewCONV: i changed the codeflow for conversions to Single and Double from Ustring and i changed "ldtype" to "FB_DATATYPE_WCHAR" whenever an ustring is passed. As far as i understand it, this doesn´t break anything, it enables ustrings to be processed like wstrings, which seems to work in my test.bas and other places - or did i miss something?

Happy New Year - i´ll probably be offline the next two days

JK

Paul Squires · December 31, 2018, 04:07:06 PM

Quote from: Jeff Marshall on December 30, 2018, 09:59:50 PM
The other "style" choices (like tabs vs spaces) are a historical artifact of the compiler's development. Unneccessary changes create meaningless diffs that are a pain to look through when inspecting change sets. I've seen other projects that have very rigid formatting rules, and the code is beautiful to read. We are kind of lazy about it, so if it doesn't get addressed up front, it probably never will.

Hi Jeff, I love the idea of having a formatting code rulebook for the compiler (or FB source code in general). I need to adopt such a consistent style to my own coding as well as I have flip flopped between styles over the years and have yet to find one that I particularly like. I find the compiler source code very nice because it uses liberal use of whitespace, large indents, and consistent use of spacing between variable names, unary operators, parenthesis, etc.

Would you be cool if I start assembling a list of such formatting items that the compiler source code uses and post them here or in the FB forum for yourself and the other compiler pros to comment on and improve? It could possibly help with future contributors wanting to share source code.

Happy New Year to you! You're doing great work with FB and should be acknowledged for it.

Juergen Kuehlwein · January 01, 2019, 07:22:00 PM

Formatting...

while i´m wiling to accept, what Jeff stated as formatting rules, i prefer a slightly different format, which i have been using for many years.

My formatting rules are:

- Indentation by 2 spaces
- Procedures (or other logic entities) are separated by a separating line (*, =, whatever)
- A comment box after the procedure header explains what the procedure is for and evtl. what goes in and what goes out
- All local variables are defined at the beginning of a procedure, each one getting a separate line
- Comments before code lines only for important things
- Comments at the end of a code line (column aligned) for a regular comment

personally i like to comment very much, so for me this is harder to read and to understand:

Code Select

...
'***********************************************************************************************
' do this or that, return succes = 1, fail = 0
'***********************************************************************************************
function do_it(s as string, p as long) as long
'loop variable
dim i as long    
'character
dim c as byte         


  'scan characters
  for i = 1 to len(s)      
    c = s[i]

    'do this
    if p > c then     
      return 1

    'do that
    elseif p < c then   
      return 1

    'otherwise do the following
    elseif p = c then    
      return 0
        
    end if
  next i
  
  
end function


'***********************************************************************************************
...

than this

Code Select

...
'***********************************************************************************************


function do_it(s as string, p as long) as long
'***********************************************************************************************
' do this or that, return succes = 1, fail = 0
' maybe more lines here, if needed ...
'***********************************************************************************************
dim i as long                                         'loop variable
dim c as byte                                         'character


  for i = 1 to len(s)                                 'scan characters
    c = s[i]

    if p > c then                                     'do this
      return 1

    elseif p < c then                                 'do that
      return 1

    elseif p = c then                                 'otherwise do the following
      return 0
        
    end if
  next i
  
  
end function


'***********************************************************************************************
...

Due to the structure you can easily detect the procedure header, the variables used in this procedure, where it starts, where it ends and what it does in general.

On the left side is the code without distracting interspersed comments, this way it is tighter, you need less lines and thus you can see more code lines on one screen.

On the right side neatly aligned are short comments explaining what the code does. So you have code and comments separated but still linked to each other.

I don´t use tabs, because i always found it irritating, when lines jump horizontally. One space for indentation is not enough for visually structuring nested code, two spaces do the job quite well. More spaces tend to stretch the code to much horizontally for heavily nested code, especially if you want to have comments in the same line. My IDE (ab)uses the tab key to move the caret to a fixed column and to align existing comments to this column, if possible.

I use meaningful names for procedures, but i tend to use short names for local variables like i,n,x,y,z for numeric types and a,b,c,d,s for string types. Whenever such a "meaningless" name is used, i add a comment to the definition line for clarity.

After having written and tested a procedure, i remove debugging code and add all kinds of comments in places, which proved to be difficult to code. e.g why i did it this way and why that way doesn´t work, possible side effects and the like. This seems to be much of an effort, but in the long run, at least for me, it has proved to be of enormous help.

JK

Juergen Kuehlwein · January 01, 2019, 07:39:49 PM

@Paul,

obviously you are following this thread and you could be of great help, because your WinBFE is a large project implementing José´s WinFBX.

Currently i´m able to compile your sources (all "**" removed) and as far as i can tell, it runs as expected. Because it is your project, you must have better means of testing than i have. Testing only ones own code is by no means sufficient for detecting all bugs, as we both definitely know. Would you help me testing the compiler changes with your project ? There soon will be a new version with generic routines as Jeff requested and i would appreciate someone as experienced as you running tests with it.

JK

News:

FreeBASIC CWstr