Internationalisation

Internationalisation often requires support for strings longer than the 255 characters supported by counted strings in the 8 bit character set used by VFX Forth during application development. Such strings may also not be in the character set or size used by the application developer.

Internationalisation often requires third parties to be able to convert text strings without having to recompile the application.

Forth system developers and vendors need to make their systems compatible with their clients existing approaches to internationalisation.

This implementation supports all these requirements, and is a compatible superset of the current ANS Forth Internationalisation proposals, which are available from the downloads section of the MPE web site at: http://www.mpeforth.com

If you are using this software with MPE's VFX Forth system, the source code is in the file Lib\International.fth.

MPE acknowledges the help and support of Construction Computer Software, Cape Town, South Africa, in the design of this software. The CCS application has been internationalised for many years, and their experience has been invaluable, both in defining the Forth 2012 standard and in developing this code.

Long string parsing support

: parse/l       \ char -- c-addr len ; like PARSE over lines
Parse the next token from the terminal input buffer using <char> as the delimiter. The text up to the delimiter is returned as a c-addr u string. PARSE/L does not skip leading delimiters. In order to support long strings, PARSE/L can operate over multiple lines of input and line terminators are not included in the text. The string returned by PARSE/L remains in a single global buffer until the next invocation of PARSE/L. PARSE/L is designed for use at compile time and is not thread-safe or winproc-safe.

Data structures

Rationale

Although internationalised strings may be referenced by the addresses of suitable data structures, these addresses will change from build to build of the application. The implementation here permits strings to be given a number which does not change between builds. Together with a compile-time hook which can generate a text file in the development language, application strings can be translated in external text files without rebuilding the application. This is required in situations in which translation is performed locally by dealers or by users themselves.

The /TEXTDEF structure described below permits messages to be accessed either by message number or by the address of the structure.

/TEXTDEF structure

Internationalisation of messages relies on a data structure /TEXTDEF. The /TEXTDEF structure contains a link to the previous TEXTDEF or #TEXTDEF definition, a message identifier which is 0 for non-databased strings in the ISO Latin1 coding, the address of the text, and the length of the text in bytes. The text is followed by two zero bytes, and the text is long aligned. The /TEXTDEF structure is a superset of the /ERRDEF structure used for error messages by VFX Forth.

The words #TEXTDEF and ERR$ are DEFERred. #TEXTDEF is used by TEXTDEF. The user can install alternative versions of these words for internationalised applications. In this context, #TEXTDEF and friends can be used as the basis of any text handler that requires translation. Note that #TEXTDEF can be modified so that a message file is produced at compile time, and ERR$ modified so that the message file is accessed at run time. Similarly, providing that the application language is correctly handled, the run time can access translated messages in other languages, character sets and character sizes.

The messages are linked into the same chain as is used for all error strings that can be internationalised. This chain is anchored by the variable TEXTCHAIN.


struct /textdef \ -- len ; DOES NOT include constant definition
  int td.value          \ value that identifies string
  int td.link           \ link to previous TEXTDEF td.link field
  int td.id             \ 0 or message ID
  int td.caddr          \ address of text string
  int td.len            \ length of text string
  int td.lenInline              \ length of inline text string in bytes
end-struct

String structure

Creating and referencing LOCALE strings

In this implementation, the ANS locale string identifier "lsid" is a pointer to a /TEXTDEF structure.

defer l$CompileHook     \ ^textdef --
A DEFERred hook that the user can modify to produce additional data at compile time. For example, the hook is commonly replaced by code that generates a text file in the development language. This text file then serves as the basis for translation to other languages.

: L$",          \ n -- ; compile a long string
This can be thought of as a multiline version of ",. First a /TEXTDEF structure is created. Then it collects multiline text and lays down an inline string with two zero bytes as termination. The start of the string is aligned on a four-byte boundary. The end of the string is padded to a four-byte boundary.

defer #TEXTDef  \ n -- ; -- n
Define a constant and associated message in the form: <n> #TEXTDEF <name> "<text>". Execution of <name> returns <n>.

: NextText      \ -- addr
Returns the address of the variable holding the next constant used to identify an internationalised string.

: NextText#     \ -- n
Return the contents of NEXTTEXT and increment NEXTTEXT.

: TextDef       \ -- ; -- n ; used as throw/error codes
Define a constant and associated message in the form: TEXTDEF <name> "<text>". Execution of <name> returns the constant automatically allocated by NEXTTEXT#.

: l$find        \ n -- struct|0 ; produce pointer to TEXTDEF structure
Given a message number n, return the address of the /TEXTDEF structure containing its details.

: l$count       \ lsid -- c-addr u
Given a /TEXTDEF structure, the address and length in bytes of the text string are returned.

: l$addr        \ lsid -- c-addr
Given a /TEXTDEF structure, the address of the text string is returned.

: (l$")         \ -- lsid
The runtime action of L$" to return the address of the /TEXTDEF structure associated with the string compiled by L$".

: L$"           \ -- ; -- lsid
Used inside a colon definition to compile a string that will be internationalised. At run time the address of the TEXTDEF structure will be returned.

: LS"           \ -- ; -- caddr u
Used to compile or extract a long string. When used during compilation L$", is used to a lay down a string for internationalisation. At run time the address and length of the string are returned.

: ZLS"          \ -- ; -- c-addr
Used to compile or extract a zero terminated long string. When used during compilation L$", is used a lay a string for internationalisation. At run time the address of the string is returned.

ANS LOCALE word set

In this implementation, the ANS locale string identifier "lsid" is a pointer to a /TEXTDEF structure.

defer set-language      \ lang -- ior
Set the current language code. At the very least, the action of this word must be to set the variable <LANGUAGE>. The action may also include updating the string data in the TD.CADDR and TD.LEN fields of all the /TEXTDEF and /ERRDEF structures. If the operation succeeds, the returned ior is 0. If the operation fails, the returned ior is non-zero and the meaning of the ior is implementation dependent.

: get-language  \ -- lang
Return the current language code.

defer set-country       \ country -- ior
Set the current country code. At the very least, the action of this word must be to set the variable <COUNTRY>. The action may also include updating locale-sensitive routines such as date and time display formatting words. If the operation succeeds, the returned ior is 0. If the operation fails, the returned ior is non-zero and the meaning of the ior is implementation dependent.

: get-country   \ -- country
Return the current country code.

: l"            \ -- ; -- lsid ; L" <native text>"
A locale-sensitive version of C" which returns an lsid (string indentifier) at run-time. The native text may be compiled inline

Interpetation: The interpretation semantics for this word are undefined.

Compilation: \ "ccc<quote>" -- Parse ccc delimited by a " (double-quote) and append the run-time semantics given below to the current definition.

Runtime: \ -- lsid Return lsid, an identifier for a locale string. Other words use lsid to extract language specific information.

: LOCALE@       \ lsid -- addr len(au)
Return the address and length in address units of the string (in the current language) that corresponds to the native string identified by lsid. The format of the string at addr is implementation dependent. The length of the string is returned in address units so that it may be copied by MOVE without knowledge of the character set width.

Text macro substitution is performed by the Forth 2012 word *fo{substitute)

: substitute    \ src slen dest dlen -- dest dlen' n ; 17.6.2.2255
Expand the source string using text macro substitutions, placing the result in the buffer dest/dlen and returning the destination string dest/dlen' and the number n of substitutions made. If an error occurred, n is negative. Ambiguous conditions occur if the result of a substitution is too long to fit into the given buffer or the source and destination buffers are the same.

Substitution occurs left to right from the start of src/slen in one pass and is non-recursive. When text of a potential substitution name, surrounded by ‘%’ (ASCII $25) delimiters is encountered by SUBSTITUTE, the following occurs:
a) If the name is null, a single delimiter character is passed to the output, i.e., %% is replaced by %. The current number of substitutions is not changed.
b) If the text is a valid substitution name, the leading and trailing delimiter characters and the enclosed substitution name are replaced by the substitution text. The current number of substitutions is incremented.
c) If the text is not a valid substitution name, the name with leading and trailing delimiters is passed unchanged to the output. The current number of substitutions is not changed.
d) Parsing of the input string resumes after the trailing delimiter.

The Forth 2012 standard contains a reference implementation for substitute and its friends replaces and unescape

ANS LOCALE extension word set

In this implementation, the ANS locale string identifier "lsid" is a pointer to a /TEXTDEF structure.

defer LOCALE-INDEX      \ lsid --
Updates the internal data structure. Useful if structures are added and changes to internal structures are required.

: LOCALE-LINK   \ lsid1 -- lsid2
Given the address of one LOCALE structure, returns the address of the next.

defer LOCALE-TYPE       \ addr len --
Displays the LOCALE string whose address and length in address units are given.

: NATIVE@       \ lsid -- c-addr len
Given a LOCALE structure, returns the address and length of the corresponding DCS native string that was compiled by L".

Windows language support

Windows contains a large number of predefined language constants of the form LANG_xxx and SUBLANG_xxx. A Windows locale is identified by merging a pair of these as described below.


  +-------------------------+-------------------------+
  |     SubLanguage ID      |   Primary Language ID   |
  +-------------------------+-------------------------+
  15                    10  9                         0   bit

These constants can be viewed from VFXForth by using:

  SIM LANG_
  SIM SUBLANG_

These codes use 0 as the current or neutral code, which matches using 0 as the language code for the development character set, which is ISO Latin 1 for VFX Forth. In this set, the seven bit ASCII character set defined by ANS Forth represents characters 0..127.

: langID        \ primary secondary -- langid
Generate a Windows language code from the primary and secondary codes, e.g.

  LANG_SPANISH SUBLANG_SPANISH_MEXICAN langid