An Unicode reflexion

Frederick J. Harris · April 18, 2011, 10:32:04 PM

Wow! Thanks for the enlightenment Jose.

Now I'm going to see what interactions are going on here by adding database access to the equation.

Jeff Blakeney · April 19, 2011, 01:10:26 AM

I've been staying out of this discussion as I really don't know much about Unicode or the internal workings of Windows and haven't had much chance to play with PB10 yet. Having the WSTRING and WSTRINGZ types added to PB I think is nice to have as an option and if I ever write something that might have an international audience, then I can see myself making use of those data types. However, I am Canadian and, even though my country if officially bilingual, I cannot read or write french so have no need for extended characters for my own use. I have never added any foreign language support to my computer or applications because, even if foreign languages get displayed correctly on my machine, I still won't be able to understand it.

José mentioned that there is no down side to using the unicode strings as opposed to the ansi strings but I can think of one. The fact that unicode strings use twice as much memory as the ansi versions. This is a big reason why I personally won't be switching over to using only the unicode string types.

José Roca · April 19, 2011, 02:47:14 AM

Quote
José mentioned that there is no down side to using the unicode strings as opposed to the ansi strings but I can think of one. The fact that unicode strings use twice as much memory as the ansi versions. This is a big reason why I personally won't be switching over to using only the unicode string types.

Sorry, but this is not what I have said. The subject of this discussion is about the convenience of using the Windows API unicode functions instead of the Windows API ansi versions, and I think that it has made clear that, for new applications, we should use the unicode ones. This is not quite the same as using unicode or ansi for other purposes.

I'm not going to remove the ansi stuff, but I will promote the use of the unicode API functions by using them in my code examples.

Edwin Knoppert · April 19, 2011, 03:01:04 PM

In c# there is barely support for ansi, by normal use you'll use a byte array for byte data and strings for unicode text.
In PB we don't tend to use byte arrays for filedata for example but simply use a string.
This is an aspect you'll need to learn to avoid.
Use the vartypes as intended and binary data should be in a byte array but PB still gives you the convienance to use ansi strings in your code.
Therefore unicode is used very little actually.
Every aspect in your app with uses representational data should be in unicode.
Therefore MS enforces unicode in resources from day one for these parts (like string tables and dialog parts)
Don't try to avoid unicode.

Ever thought ms may abandon ansicode at all, i could imagne since they do have a virtual XP machine for you which gives them the oppurtunity to make drastic steps.
Anyway, i don't think they'll drop it within 2 decades

(+ they could also abandon the winapi as we know it

)

Jeff Blakeney · April 20, 2011, 04:08:16 AM

José, in reply #25, you said "There are advantages and no disadvantages." Yes, you are referring to there being advantages to using the Unicode API calls but I still stand by my statement that the fact that the strings I pass to the Unicode API use twice as much memory as the same call to the ANSI API is a down side or, using your word, a disadvantage.

I can't control how much memory Windows uses but I do have some control over how much memory my application uses and having all my strings using twice as much memory, especially when, in my case, every second character will be a null, doesn't make much sense. However, like I said, if I write stuff that will be shared and other language support is needed, then I will create either an international version of the program or a version that can be set to work with either Unicode or ANSI.

José Roca · April 20, 2011, 04:22:19 AM

Quote
José, in reply #25, you said "There are advantages and no disadvantages." Yes, you are referring to there being advantages to using the Unicode API calls but I still stand by my statement that the fact that the strings I pass to the Unicode API use twice as much memory as the same call to the ANSI API is a down side or, using your word, a disadvantage.

In fact, calling the ansi version you use more memory, because what most of the ansi functions do is to convert the string parameters to unicode and call the unicode version. So you are using the memory that you have allocated and the one allocated by Windows to convert it to unicode.

See, for example, what LoadLibraryExA does:

Code Select


896 /******************************************************************
897  *              LoadLibraryExA          (KERNEL32.@)
898  *
899  * Load a dll file into the process address space.
900  *
901  * PARAMS
902  *  libname [I] Name of the file to load
903  *  hfile   [I] Reserved, must be 0.
904  *  flags   [I] Flags for loading the dll
905  *
906  * RETURNS
907  *  Success: A handle to the loaded dll.
908  *  Failure: A NULL handle. Use GetLastError() to determine the cause.
909  *
910  * NOTES
911  * The HFILE parameter is not used and marked reserved in the SDK. I can
912  * only guess that it should force a file to be mapped, but I rather
913  * ignore the parameter because it would be extremely difficult to
914  * integrate this with different types of module representations.
915  */
916 HMODULE WINAPI DECLSPEC_HOTPATCH LoadLibraryExA(LPCSTR libname, HANDLE hfile, DWORD flags)
917 {
918     WCHAR *libnameW;
919 
920     if (!(libnameW = FILE_name_AtoW( libname, FALSE ))) return 0;
921     return LoadLibraryExW( libnameW, hfile, flags );
922 }
923 
924 /***********************************************************************
925  *           LoadLibraryExW       (KERNEL32.@)
926  *
927  * Unicode version of LoadLibraryExA.
928  */
929 HMODULE WINAPI DECLSPEC_HOTPATCH LoadLibraryExW(LPCWSTR libnameW, HANDLE hfile, DWORD flags)
930 {
931     UNICODE_STRING      wstr;
932     HMODULE             res;
933 
934     if (!libnameW)
935     {
936         SetLastError(ERROR_INVALID_PARAMETER);
937         return 0;
938     }
939     RtlInitUnicodeString( &wstr, libnameW );
940     if (wstr.Buffer[wstr.Length/sizeof(WCHAR) - 1] != ' ')
941         return load_library( &wstr, flags );
942 
943     /* Library name has trailing spaces */
944     RtlCreateUnicodeString( &wstr, libnameW );
945     while (wstr.Length > sizeof(WCHAR) &&
946            wstr.Buffer[wstr.Length/sizeof(WCHAR) - 1] == ' ')
947     {
948         wstr.Length -= sizeof(WCHAR);
949     }
950     wstr.Buffer[wstr.Length/sizeof(WCHAR)] = '\0';
951     res = load_library( &wstr, flags );
952     RtlFreeUnicodeString( &wstr );
953     return res;
954 }

Do not believe that M$ is going to duplicate code to save memory.

See that the ansi version calls FILE_name_AtoW and then LoadLibraryExW.

And this is what FILE_name_AtoW does:

Code Select


230 /***********************************************************************
231  *           FILE_name_AtoW
232  *
233  * Convert a file name to Unicode, taking into account the OEM/Ansi API mode.
234  *
235  * If alloc is FALSE uses the TEB static buffer, so it can only be used when
236  * there is no possibility for the function to do that twice, taking into
237  * account any called function.
238  */
239 WCHAR *FILE_name_AtoW( LPCSTR name, BOOL alloc )
240 {
241     ANSI_STRING str;
242     UNICODE_STRING strW, *pstrW;
243     NTSTATUS status;
244 
245     RtlInitAnsiString( &str, name );
246     pstrW = alloc ? &strW : &NtCurrentTeb()->StaticUnicodeString;
247     if (oem_file_apis)
248         status = RtlOemStringToUnicodeString( pstrW, &str, alloc );
249     else
250         status = RtlAnsiStringToUnicodeString( pstrW, &str, alloc );
251     if (status == STATUS_SUCCESS) return pstrW->Buffer;
252 
253     if (status == STATUS_BUFFER_OVERFLOW)
254         SetLastError( ERROR_FILENAME_EXCED_RANGE );
255     else
256         SetLastError( RtlNtStatusToDosError(status) );
257     return NULL;
258 }

Where are the memory savings?

Frederick J. Harris · April 20, 2011, 04:34:49 PM

I fought the good fight for many years against wide character strings too Jeff. And it was a hard fight, and it lasted for many years (since around 2000 I'd say). Here a year or two ago I finally decided to surrender. Its a fight that can't be won. Of course my battles were all waged on C and C++ battlegrounds and mostly coding for Windows CE. But I've surrendered and life is better now.

The main issue for me isn't international support. Switching to unicode (I actually prefer the term 'wide character string', but its longer) has merit simply based on the idea that since Windows NT in the early 90s Windows operating systems have internally worked with the two byte character set. I hear in the Linux world four bytes are used for characters. In any case, I've always suspected what Jose described above where the use of ansi would likely increase memory usage and 'heat' at the OS level rather than minimize it.

I know you are a C coder too, and I have to say the use of unicode in PowerBASIC seems to be considerably cleaner than in C/C++. For example, I'm sure that you, like I, have a good many of the C runtime functions memorized such as strcpy(), printf(), strcat(), etc., etc., etc. It became horrendous to use the tchar.h macros such as _tcscpy(), _ftprintf(), _T("Hello, World!"), TEXT("Hello, World!"), L"Hello, World", etc., etc., etc. Some of them are so ugly one has to constantly be looking them up. The only way it was solved was for Microsoft to create a new language ( C# ) and that way eliminate compatability issues with legacy code. In other words, just start out fresh with everything as a wide char string.

I really think PowerBASIC's implementation is absolutely as good as it can be without totally abandoning the language and starting out with a new one.

José Roca · April 20, 2011, 06:44:37 PM

I have decided to do the same that Microsoft: to keep the ansi stuff and, instead of duplicating the code of the wrappers, the ansi wrapper functions will convert strings to unicode and call the unicode wrapper function. There is no advantage in duplicate the code because Windows will convert it to unicode anyway, so using the "A" functions what you get is more memory usage and more overhead, contrarily to what it seems to be a popular belief.

This is not the same as using ansi or unicode in the internal code of your application. If you need to use an array of strings, ansi strings will use less memory. But when calling the Windows API, the result is as if when you use DIM myArray(10) AS STRING, the PB compiler converted it to DIM myArray(10) AS WSTRING. This is more or less what Visual Basic does, isn't it?

Frederick J. Harris · April 20, 2011, 09:29:06 PM

In terms of old 'unmanaged' pre .NET Visual Basic, this is what Bruce McKinney says in his good article "Strings The OLE Way"

Quote
Visual Basic—The designers had to make some tough decisions about how they would represent strings internally. They might have chosen ANSI, because it's the common subset of Windows 95 and Windows NT, and converted to Unicode whenever they needed to deal with OLE. But since Visual Basic 4.0 is OLE inside and out, they chose Unicode as the internal format, despite potential incompatibilities with Windows 95. The Unicode choice caused many problems and inefficiencies both for the developers of Visual Basic and for Visual Basic developers—but the alternative would have been worse.

Link to above...

http://social.msdn.microsoft.com/Forums/en-US/vblanguage/thread/88f6f6ce-46cb-4d19-8b7b-c92f5f34775c

José Roca · April 20, 2011, 09:30:37 PM

This is what I mean:

Code Select


#INCLUDE ONCE "windows.inc"
#INCLUDE ONCE "commdlg.inc"

' ========================================================================================
' Open File Dialog
' ========================================================================================
FUNCTION AfxOpenFileDialogW ( _
   BYVAL hwnd AS DWORD _                         ' // Parent window
 , BYVAL bstrCaption AS WSTRING _                ' // Caption
 , BYREF bstrFileSpec AS WSTRING _               ' // Filename
 , BYVAL bstrInitialDir AS WSTRING _             ' // Start directory
 , BYVAL bstrFilter AS WSTRING _                 ' // Filename filter
 , BYVAL bstrDefExtension AS WSTRING _           ' // Default extension
 , BYREF dwFlags AS DWORD _                      ' // Flags
 ) COMMON AS LONG

   LOCAL ix AS LONG
   LOCAL ofn AS OPENFILENAMEW
   LOCAL wszFileTitle AS WSTRINGZ * %MAX_PATH

   ' // Filter is a sequence of ASCIIZ strings with a final (extra) $NUL terminator
   REPLACE "|" WITH $NUL IN bstrFilter
   bstrFilter += $$NUL

   IF LEN(bstrInitialDir) = 0 THEN bstrInitialDir = CURDIR$

   ix = INSTR(bstrFileSpec, $NUL)
   IF ix THEN
      bstrFileSpec = LEFT$(bstrFileSpec, ix) & SPACE$(%OFN_FILEBUFFERSIZE - ix)
   ELSE
      bstrFileSpec = bstrFileSpec & $NUL & SPACE$(%OFN_FILEBUFFERSIZE - (LEN(bstrFileSpec) + 1))
   END IF

   ofn.lStructSize      = SIZEOF(ofn)
   ofn.hwndOwner        = hwnd
   ofn.lpstrFilter      = STRPTR(bstrFilter)
   ofn.nFilterIndex     = 1
   ofn.lpstrFile        = STRPTR(bstrFileSpec)
   ofn.nMaxFile         = LEN(bstrFileSpec)
   ofn.lpstrFileTitle   = VARPTR(wszFileTitle)
   ofn.nMaxFileTitle    = SIZEOF(wszFileTitle)
   ofn.lpstrInitialDir = STRPTR(bstrInitialDir)
   IF LEN(bstrCaption) THEN 
      ofn.lpstrTitle    = STRPTR(bstrCaption)
   END IF
   ofn.Flags            = dwFlags
   IF LEN(bstrDefExtension) THEN
      ofn.lpstrDefExt   = STRPTR(bstrDefExtension)
   END IF

   FUNCTION = GetOpenFilenameW(ofn)

   ix = INSTR(bstrFileSpec, $NUL & $NUL)
   IF ix THEN
      bstrFileSpec = LEFT$(bstrFileSpec, ix - 1)
   ELSE
      ix = INSTR(bstrFileSpec, $NUL)
      IF ix THEN
         bstrFileSpec = LEFT$(bstrFileSpec, ix - 1)
      ELSE
         bstrFileSpec = ""
      END IF
   END IF

   dwFlags = ofn.Flags

END FUNCTION
' ========================================================================================

' ========================================================================================
FUNCTION AfxOpenFileDialogA ( _
   BYVAL hwnd AS DWORD _                         ' // Parent window
 , BYVAL strCaption AS STRING _                  ' // Caption
 , BYREF strFileSpec AS STRING _                 ' // Filename
 , BYVAL strInitialDir AS STRING _               ' // Start directory
 , BYVAL strFilter AS STRING _                   ' // Filename filter
 , BYVAL strDefExtension AS STRING _             ' // Default extension
 , BYREF dwFlags AS DWORD _                      ' // Flags
 ) COMMON AS LONG

   LOCAL bstrFileSpec AS WSTRING
   bstrFileSpec = strFileSpec
   FUNCTION = AfxOpenFileDialogW(hwnd, BYCOPY strCaption, bstrFileSpec, BYCOPY strInitialDir, _
              BYCOPY strFilter, BYCOPY strDefExtension, dwFlags)
   strFileSpec = bstrFileSpec

END FUNCTION
' ========================================================================================

AfxOpenFileDialogW is which does the work. AfxOpenFileDialogA simply translates the input string parameters passing them BYCOPY (PB does the translation under the hood) and the input/output parameter is first translated to unicode with bstrFileSpec = strFileSpec and back to ansi with strFileSpec = bstrFileSpec (again PB does the translations automatically).

If there was a full GetOpenFilenameA function, this would be inefficient, but as GetOpenFilenameA does the same as AfxOpenFileDialogA, what is really inefficient is to call GetOpenFilenameA instead of GetOpenFilenameW.

Even those that don't need unicode, should use the "W" API functions for efficiency reasons.

José Roca · April 20, 2011, 09:34:47 PM

Quote from: Frederick J. Harris on April 20, 2011, 09:29:06 PM
In terms of old 'unmanaged' pre .NET Visual Basic, this is what Bruce McKinney says in his good article "Strings The OLE Way"

Quote
Visual Basic—The designers had to make some tough decisions about how they would represent strings internally. They might have chosen ANSI, because it's the common subset of Windows 95 and Windows NT, and converted to Unicode whenever they needed to deal with OLE. But since Visual Basic 4.0 is OLE inside and out, they chose Unicode as the internal format, despite potential incompatibilities with Windows 95. The Unicode choice caused many problems and inefficiencies both for the developers of Visual Basic and for Visual Basic developers—but the alternative would have been worse.

Link to above...

http://social.msdn.microsoft.com/Forums/en-US/vblanguage/thread/88f6f6ce-46cb-4d19-8b7b-c92f5f34775c

So it seems that it is the same decission that I'm making

Frederick J. Harris · April 20, 2011, 09:47:20 PM

Quote
This is not the same as using ansi or unicode in the internal code of your application. If you need to use an array of strings, ansi strings will use less memory. But when calling the Windows API, the result is as if when you use DIM myArray(10) AS STRING, the PB compiler converted it to DIM myArray(10) AS WSTRING.

Even if one is using ansi strings internally, and not explicitely calling any Windows Api functions (no Win32Api even #included), its likely the compiler itself will be calling Api functions on your behalf in terms of whatever your app is doing with ansi strings. And one wouldn't know about that. The bottom line is that the operating system itself uses wide character strings. At least that's my bottom line.

Frederick J. Harris · April 20, 2011, 09:50:28 PM

For example, if you do this...

Local strPath As String
strPath = Curdir$

It might look on the surface like no Api functions are being called, but its almost certain the compiler is calling one. And at that point you are back to allocating buffers, translating between ansi/wide, etc.

Jeff Blakeney · April 21, 2011, 12:41:14 AM

Quote from: Frederick J. Harris on April 20, 2011, 04:34:49 PM
I fought the good fight for many years against wide character strings too Jeff. And it was a hard fight, and it lasted for many years (since around 2000 I'd say). Here a year or two ago I finally decided to surrender. Its a fight that can't be won. Of course my battles were all waged on C and C++ battlegrounds and mostly coding for Windows CE. But I've surrendered and life is better now.

Well, you may have less conflict with supporters of Unicode but that doesn't mean things actually got better.

QuoteThe main issue for me isn't international support. Switching to unicode (I actually prefer the term 'wide character string', but its longer) has merit simply based on the idea that since Windows NT in the early 90s Windows operating systems have internally worked with the two byte character set. I hear in the Linux world four bytes are used for characters. In any case, I've always suspected what Jose described above where the use of ansi would likely increase memory usage and 'heat' at the OS level rather than minimize it.

Like I said earlier, I have no control over what Windows does. An API call could sit and do nothing for 2 seconds before actually doing something, it could make 10 copies of the data I give it, it could translate my english language strings into latin. I have no idea what it does or have any control over it. As long as I get back what I need in the format I need it, I'm fine with that. Microsoft could change things again tomorrow so that it no longer uses Unicode which could mess up the potential benefit of calling the API using Unicode as well. I say "potential" because, as I said, I don't know the internal workings of Windows and can't say for sure that it does things using Unicode or that the ANSI API statements are just wrappers for the Unicode versions. It is probably documented somewhere but I've never looked it up.

QuoteI know you are a C coder too, and I have to say the use of unicode in PowerBASIC seems to be considerably cleaner than in C/C++. For example, I'm sure that you, like I, have a good many of the C runtime functions memorized such as strcpy(), printf(), strcat(), etc., etc., etc. It became horrendous to use the tchar.h macros such as _tcscpy(), _ftprintf(), _T("Hello, World!"), TEXT("Hello, World!"), L"Hello, World", etc., etc., etc. Some of them are so ugly one has to constantly be looking them up. The only way it was solved was for Microsoft to create a new language ( C# ) and that way eliminate compatability issues with legacy code. In other words, just start out fresh with everything as a wide char string.

Actually, I'm not a C programmer. I learned C in college after they taught us 8086 assembly language because they felt it was easier for people to learn assembly than to learn C and I tend to agree with them.

I used BASIC and assembly for all my Apple II programming and didn't really start programming PCs until my brother got PowerBASIC for DOS and contracted me to do some work for him. I can translate C code if needed but only with the help of google searches to remind me what all those cryptic symbols mean. I certainly don't program anything from scratch in C.

QuoteI really think PowerBASIC's implementation is absolutely as good as it can be without totally abandoning the language and starting out with a new one.

I agree, I think PB has pretty much seamlessly added support for Unicode and, as I said earlier, I'm glad its there for when I might need it. At present, I'm a hobby programmer and write stuff for myself and have no need for more than 7 bit ASCII so 8 bit characters are fine for me. I'm hoping to write some code to share/sell at some point and I'll most likely need to add Unicode support then so it is nice to know it is going to be easy to add.

Theo Gottwald · April 27, 2011, 07:03:01 AM

QuoteVisual Basic—The designers had to make some tough decisions about how they would represent strings internally. They might have chosen ANSI, because it's the common subset of Windows 95 and Windows NT, and converted to Unicode whenever they needed to deal with OLE. But since Visual Basic 4.0 is OLE inside and out, they chose Unicode as the internal format, despite potential incompatibilities with Windows 95. The Unicode choice caused many problems and inefficiencies both for the developers of Visual Basic and for Visual Basic developers—but the alternative would have been worse.

Seen like this, the new "AS WSTRING" should make it even easier to call a OB DLL from VB because no more conversion is needed - any experts on VB here?

News:

An Unicode reflexion