VFX Code Generator

The VFX code generator is a black box that simply does its job. Some implementations may have switches for special cases.

Enabling the VFX optimiser

The optimiser can be enabled and disabled by the words OPTIMISED and UNOPTIMISED. The state of the optimiser can be detected by inspecting the variable OPTIMISING.

Binary inlining

Binary inlining consists of copying the binary code for a word inline without the final return instruction. This avoids the overhead of the call and return instructions. It is useful for very short coded instruction sequences. For high level definitions the source inliner usually gives better results.

The VFX code generator gives some control over the use of binary inlining, controlled by the word INLINING ( n -- ). When the code generator has completed a word, the length of the word is stored. When the word is to be compiled, its length is compared against the value passed to INLINING, and if the length is less than the system value, the word is compiled inline, with the procedure entry and exit code removed. This avoids pipeline stalls, and is very useful for short definitions.

By default four constants are available for inlining control, although any number will be accepted by INLINING.


 NO INLINING               \ 0, binary inlining turned off
 NORMAL INLINING           \ 12-16, ~10% increase in size
 AGGRESSIVE INLINING       \ 255, useful when time critical
 ABSURD INLINING           \ 4096, unlikely to be useful

You can use INLINING anywhere in the code outside a definition.

Colon definitions

Any word that uses words that affect the return stack such as EXIT, or takes items off the return stack that you didn't put there in the same word, will automatically be marked as not being able to be inlined.

Implementations that use absolute calls will disable inlining of any word that makes an absolute call.

Note that when words are inlined, the effects may not be as expected.


: A ... ;                  \ inlined
: B ... A ... ;            \ A inlined, B can be inlined
: C ... B ... B ... ;      \ A, B inlined, C can be inlined

Code definitions

By default CODE definitions are not marked for inlining because the assembler cannot detect all cases which may upset the return stack. If you want to make a code definition available for binary inlining, follow it with the word INLINE.


   CODE <name>
     ...
   END-CODE InLine

VFX Optimiser Switches

Some instructions are only available on later CPUs. Note that CPU selection affects the assembler and the VFX code code generator and compile time, not the run time instruction usage of your application. If you select a higher CPU level than the application runs on, incorrect operation will occur. The default selection is for the Pentium 4 instruction set.

 CPU=386     \ -- ; select base instruction set
 CPU=PPro    \ -- ; Pentium Pro and above with CMOVcc
 CPU=P4      \ -- ; Pentium 4 and above

Aspects of the VFX code generator are controllable by switches. In particular the inlining of the DO ... LOOP entry code and local variable entry code may be turned on and off to suit your particular coding style.

Note also that for large computationally intensive definitions, the SMALLER and FASTER pair of switches may actually give better performance using SMALLER. The impact of these switches varies considerably between CPU types and cache/memory architecture.

#16 value /code-alignment       \ -- n
The default code alignment used by FASTER below. Must be a power of two.

: smaller       \ --
Selects smaller code using the minimum of alignment.

: faster        \ --
Selects faster code using 16 byte alignment, which will increase the size of the dictionary headers.

: +polite       \ -- ; suppresses some warnings
Suppresses some warning messages which some users may feel are commenting on their code. In particular, if you define constants to enable and disable code without using conditional compilation, you can use +POLITE to disable the warnings about conditional branches against a constant. See also -POLITE.

: -polite       \ -- ; enables some warnings
Enables some warning messages which warn you if have used a phrase such as "<literal> IF". See +POLITE. TOS MUST BE IN EBX, EAX is free after shuffle TOS MUST BE IN EBX and EAX free after shuffle TOS MUST BE IN EBX TOS MUST BE IN EBX

0 value MustLoad?       \ -- n
Returns true if indirect accesses are loaded rather than delayed.

: +MustLoad     \ --
Forces indirect memory loads to be fetched into a register rather than delayed. For some applications (mostly calculations with array indexing) this can lead to a performance gain.

: -MustLoad     \ --
Permits indirect memory loads to be delayed. This is the default condition.

: +short-branches       \ --
Enables the VFX optimiser to produce short forward branches. If your code causes a branch limit to be exceeded, you can put -SHORT-BRANCHES and +SHORT-BRANCHES around the offending words. By default, short branch generation is off because it gives better perforance on modern CPUs.

: -short-branches       \ --
Prevents the VFX optimiser producing short forward branches. By default, short branch generation is off.

: short-branches?       \ -- flag ; true for short branches
Returns true if the optimiser will produce short forward branches.

: [-short-branches      \ -- sys
Disables short branch optimisation until the previous state is restored by SHORT-BRANCHES].

: [+short-branches      \ -- sys
Enables short branch optimisation until the previous state is restored by SHORT-BRANCHES].

: short-branches]       \ sys --
restores the short branch optimisation previously saved by +/-SHORT-BRANCHES].

: LoopAlignment \ n --
Set loop starts, e.g. BEGIN..XXX and DO..LOOP to be aligned on an n-byte booundary, where n must be a power of two. This is useful to force the heads of loops onto a cache line boundary. The default is 8.


  #16 LoopAlignment     \ set to 16 byte boundary
  0   LoopAlignment     \ revert to lowest setting

: +fastlvs      \ --
Enables generation of inline local variable entry code. This is the default condition, and is strongly recommended.

: -fastlvs      \ --
Disables generation of inline local variable entry code.

Most modern x86 operating systems use task gates for interrupt handling, which permits some code generation to be better, especially for local variables.

SafeOS? value SafeOS?   \ -- flag
Returns true if the operating system can be assumed to be safe.

: +SafeOS       \ --
Assume a safe modern operating system.

: -SafeOS       \ --
Assume an old-fashioned or raw operating system.

Controlling and Analysing compiled code

These directives control the optimiser

: optimising?   \ -- flag
Returns true if the optimiser is enabled.

: optimised     \ -- ; turn optimisation on
Enables the optimiser.

: unoptimised   \ -- ; turn optimisation off
Disables the optimiser.

These directives are used to turn optimisation on and off around sections of code.

: [opt          \ -- i*x
Save the current state of optimisation at the start of an [OPT ... OPT] structure. You can make no assumptions about what the data stack contains.

: [-opt         \ -- i*x
Save the current state of optimisation at the start of an [-OPT ... OPT] structure and turn optimisation off.

: opt]          \ i*x --
Restore the state of optimisation at the end of an [OPT ... OPT] structure to what it was at the start.

The following directives are IMMEDIATE words that you can put inside your definitions to obtain an idea of how code is being compiled. DIS <name> will disassemble a word.

: []            \ --
Lay a NOP instruction as a marker, without flushing the optimiser.

: [o/f]         \ --
Flush the optimiser state, generating the canonical stack state again with TOS in the EBX register, and all other stack items in the deep (memory) stack.

: [o/s]         \ --
Show the state of the optimiser's working stack.

Hints and Tips

On i32/x86 Pentium-class CPUs the PUSH and POP instructions generated by >R and R> are slow, and the VFX code generator is quite conservative in optimising return stack manipulations as compared with data stack anipulations. Although the code below is convenient, safe and easy to write it is slow. The rect.xxx words are fields in a structure.


: Rect@   \ rect -- l t r b
\ Retrieve the values x y r b from the RECT[ structure at
\ the address given.
  >r
  r@ rect.Left @
  r@ rect.Top @
  r@ rect.right @
  r> rect.bottom @
;

The version below generates far better code when performance is important.


: Rect@   \ rect -- l t r b
\ Retrieve the values x y r b from the RECT[ structure at
\ the address given.
  dup rect.Left @ swap
  dup rect.Top @ swap
  dup rect.right @ swap
  rect.bottom @
;

Because of the limited number of registers, better code is usually generated by passing a pointer to a structure such as a rectangle rather than passing four items on the data stack. Use of words such as Rect@ should be reseved for preparing parameters for a Windows API call.

VFX Forth v4.x

If you have written custom optimisers, the EAX register is no longer free for use, but must be requested like any other working register. CODE definitions require no changes.

Tokeniser

From VFX Forth v4.3, build 2825, the tokeniser replaces the previous source inliner. The change was made to improve ANS and Forth200x standards compliance, and to reduce issues with particularly "guru" code. To prevent breaking your existing code, the tokeniser uses the same word names for its control words.

The tokeniser keeps track of what is compiled for a word, and reruns the compilation of short definitions rather than copying the compiled code inline. This gives the VFX code generator many more opportunities to remove stack operations and produces smaller and faster code while encouraging users to write short definitions. That having been said, the relationship of code size with and without the tokeniser enabled is obscure at best.

Under some rare conditions, usually those requiring tinkering with internal structures of VFX Forth during compilation, it is necessary to have a level of control over the tokeniser. This section documents those words.

Tokeniser state

: discard-sinline       \ --
Stops the current definition from being handled by the tokeniser. This is usually required by a compilation word which generates inline data, and for which repetition of the word containing the inline data would generate large code with little speed advantage.

#128 Value SinThreshold \ -- u
If the binary size of a word is less than this value, it can be tokenised. Subject to change.

: .Tokens       \ xt --
Display the token stream for a word.

: .Tokeniser    \ --
Display tokeniser state

Tokeniser control

FALSE Value Sin?        \ -- flag
A VALUE which enables tokenising when set. Using SIN? enables you to determine the state of the tokeniser

false value sindoes?    \ -- flag
A VALUE which enables tokenising of DOES> clauses when set. Using this value enables you to determine the state of the tokeniser.

false value SinActive?  \ -- flag
Returns true when the tokeniser is active. It is used to inibit some immediate words which must not be rerun when the word they are in is tokenised.

: +sin          \ --
Enable tokenising of following definitions.

: -sin          \ --
Disable tokenising of following definitions.

: +sindoes      \ --
Enable tokenising of the run time portions of defining words. Many defining words produced with CREATE ... DOES> have short run time actions. The address returned by DOES> is a literal and provides many opportunities for both space and speed optimisation.

: -sindoes      \ --
Disable tokenising of the run time portions of defining words.

: [sin          \ -- i*x
[SIN and SIN] define a range of source code and must be used interpretively, not during compilation. [SIN saves the current tokeniser state and SIN] restores it. Often used in the form:

  [SIN -SIN ... SIN]

: sin]          \ i*x --
See [SIN above.

: [-sin         \ -- i*x
[-SIN saves the current tokeniser state, and turns off the tokeniser. SIN] restores the saved tokeniser state. Used in the form:

  [-SIN ... SIN]

: [+sin         \ -- i*x
[+SIN saves the current tokeniser state, and turns on the tokeniser. SIN] restores the saved tokeniser state. Used in the form:

  [+SIN ... SIN]

: Sinlined?     \ xt -- flag
Return true if the word defined by xt can be compiled by the tokeniser.

: RemoveSin     \ xt --
Remove tokeniser information from a word. If the word has no tokeniser information it is unaffected.

: DoNotSin      \ --
If the last word with a dictionary header must not be tokenised, place DoNotSin after its definition, e.g.

  : foo ... ; DoNotSin

: IMMEDIATE     \ --
Mark the last defined word as immediate. Immediate words will execute whenever encountered regardless of STATE. IMMEDIATE also disables tokenising of the last defined word. In practice, this is not a performance issue as IMMEDIATE words are executed at compile time.

: RemoveSINinRange      \ start end --
Remove all tokeniser information for definitions within the given range.

: RemoveAllSins \ --
Remove all tokeniser data in the system. RemoveAllSins is executed by the exit chain during BYE.

Gotchas

These gotchas are very rare conditions. They usually only appear when you write words that affect the semantics (meaning) of compilation. You can use [-sin ... sin] to drill down to the words that are causing problems.


[-sin
: foo ... ;
: poo ... ;
sin]

Immediate and defining words

The tokeniser hooks into the guts of COMPILE, and LITERAL. Compilation performed through these words is unaffected by the tokeniser.

Tokenising of IMMEDIATE words is disabled to reduce problems with "guru" code. In nearly all cases, these words are only executed at compile time, so there is minimal impact on application performance. If an immediate word causes compilation using COMPILE, and LITERAL, the tokeniser will detect this and generate tokens, e.g.


: z1 postpone dup postpone over ; immediate
: z2 z1 ;
' z2 .tokens
StartToken
DUP
OVER
End Token

In the majority of cases the tokeniser handles defining words quite adequately. In a few cases, such as defining new types of xVALUE, better code generation can be obtained by performing some calculation at compile time. Such defining words should set a compiler for their children.

To do this, use SET-COMPILER and INTERP> rather than DOES>. INTERP> indicates to the compiler that what follows is performed when the child is interpreted and that a compiler for the child has been defined. The following example is the kernel definition of VALUE.


: value         \ n -- ; ??? -- ???
  create
    ,  ['] valComp, set-compiler
  interp>
    valInterp
;

Note that the chidren of words using INTERP> are not immediate - they have separate interpretation and compilation actions. SET-COMPILER ( xt -- ) above sets valComp, to be the compiler of the last word CREATEd. SET-COMPILER takes the xt of the word it is to compile so that information can be extracted from the word.

There are rare occasions on which you may want to add a compiler to a non-defining word. Rather than making the word immediate and state-smart, which can lead to problems, you can add the compiler yourself. This is especially desirable when the compiler uses carnal knowledge of VFX Forth rather than just COMPILE, and LITERAL, The example is taken from the VFX Forth kernel.


: DO    \ Run: n1|u1 n2|u2 -- ; R: -- loop-sys
  NoInterp  ;
comp: drop  s_do,  3  ;

Return stack modifiers

In nearly all cases, words that modify the return stack will be detected and these words will not be tokenised. However, in some cases words containing such words should not be tokenised because the flow of control has been modified. The first example below fails, but the second does not. Note that, according to the ANS and Forth200x standards, these words are non-standard because they make the assumption that, on entry to a word, the top item on return stack is the return address. The example below is taken from a third-party application ported to VFX Forth.

This example is correctly detected, but fails because the code also requires the word containing LIST> not to be tokenised.


: list> ( thread -- element )
  BEGIN  @ dup  WHILE  dup r@ execute  REPEAT
  drop r> drop ;
...
: .fonts fonts LIST> .font ;

The example above makes two assumptions, one about the return stack in the use of R@ and R>, and another about how colon definitions begin in EXECUTE.

The solution is to disable the tokeniser when the word is compiled. The containing word is forced to be untokenised.


: (list>) ( thread -- element )
  BEGIN  @ dup  WHILE  dup r@ execute  REPEAT
  drop r> drop ;
: list> ( thread -- element )
  postpone (list>) discard-sinline ; immediate
...
: .fonts fonts LIST> .font ;

If you need to write words such as these, partitioning them as above, plus careful use of :NONAME to create the second part improves portability and maintainability.

Using :

If you build a new compiling word that uses colon, :, its children can themselves be tokenised. If your new word saves and restores data from the return stack indirectly, the tokeniser may not detect this, leading to obscure runtime or compilation errors. This situation can be avoided by adding DISCARD-SINLINE after the use of colon, e.g.


: MY:   \ --
  :  postpone save-state  discard-sinline
;

: MY;   \ --
  postpone restore-state  postpone ;
;

Code size

Some coding styles can lead to excessive expansion of code size by the tokeniser. Apart from turning the tokeniser off, you can try reducing the size set in the value SinThreshold. Note that the relationship between the compiled size of a word and its equivalent after token expansion in another word is often obscure.

Code/Data separation

From VFX Forth v4.3 onwards, code/data separation is turned on by default.

Problem and solution

CPUs from the Pentium 3 onwards have serious performance problems when data is close to code, leading to a wide variation in performance depending on data location. Measurements on the random number generator in the benchmark suite had a variation of 7:1.

The file Sources\Kernel\386Com\OPTIMISE\P4opt.fth (with Professional and Mission versions) contains code for data space management for these processors. Results show that performance is improved by a factor of 2.3 on BENCHMRK.FTH and that performance is now independent of location. There is no degradation of performance on other CPUs. The code generation switches are:


  +IDATA        \ -- ; enable code/data separation
  -IDATA        \ -- ; dsable code/data separation

Note that when enabled, phrases such as

  VARIABLE <name> <size> ALLOT

will not give the expected result. This is discussed in more detail below.

The solution is to separate code and data. When the optimisation is enabled, data is held in IDATA chunks away from code. There is no change to CREATE, ALLOT, comma and friends, which still operate on normal dictionary areas. The notation is derived from cross compiler usage in embedded systems.

Defining words and data allocation

The following is a conventional definition of a character/byte array defined in the dictionary.


: cCARRAY       \ n -- ; i -- c-addr
  CREATE  ALLOT  DOES> + ;

The data space reserved by ALLOT is intermingled with code, leading to bad performance. The second implementation is for best performance with P4 CPUs. IRESERVE ( n -- c-addr ) reserves an n-byte block in the IDATA area and returns its address. The children of ICARRAY are made immediate in order to emulate the effect of the source inliner on children of CCARRAY. The implementation below is is illustrative only. State-smart words (considered "evil" by some) can be be avoided using set-compiler and interp>.


: icarray       \ n -- ; i -- c-addr
  dup ireserve  dup rot erase      \ reserve IDATA space
  create  immediate                \ children are IMMEDIATE
    ,                              \ address in IDATA
  does>
    @  state @ if                  \ compiling
      postpone literal  postpone +
    else                           \ interpreting
      +
    endif
;

In order to make the array defining word CARRAY independent of whether P4 optimisation is enabled CARRAY simply selects which version to use.


: CARRAY        \ n -- ; i -- c-addr
  idata?
  if  icarray  else  ccarray  endif
;

Gotchas

When +IDATA is in use, standard defining words such as VARIABLE and VALUE will reserve space in the IDATA areas, but ALLOT still reserves space in the dictionary. Consequently code such as:

  VARIABLE <name> <size> ALLOT

will break when +IDATA is active. Use:

  <size> BUFFER: <name>

for all such allocations.

Words such as >BODY and BODY> will not work correctly on words whose data area is in an IDATA region.

Glossary

variable iblock         \ -- addr
Holds the address of the current IDATA block.

variable iblock#        \ -- addr
Holds the size of the current IDATA block.

variable idp            \ -- addr
Holds the current location in the current IDATA block.

variable def-igap       \ -- addr
Holds the minimum code/data gap size, by default 8 kbytes.

variable def-iblock#    \ -- addr
Holds the default IDATA block size, by default 64 kbytes.

: bin-align     \ n --
Force alignment to an N byte boundary where N is a power of two. The space stepped over is set to 0.

: alignidef     \ --
Align the dictionary to the IDATA default boundary.

: inoroom?      \ n -- flag
Retuns true if there is not enough room in the current IDATA block.

: make-iblock   \ n --
Make an IDATA block that is at least n bytes long. If n is less than the default size in DEF-IBLOCK# the block will be the default size.

: ialign        \ --
Step the IDATA block pointer to the next 4 byte boundary

: ialign16      \ --
Step the IDATA block pointer to the next 16 byte boundary

: ireserve      \ n -- a-addr
Reserve n bytes in the current IDATA block.

0 value idata?          \ -- flag
Returns true if data is reserved in the IDATA block.

: +idata        \ --
Force data to be reserved in IDATA blocks.

: -idata        \ --
Data is reserved conventionally in the normal dictionary space.

: 2variable     \ -- ; -- addr
If IDATA? is true data is reserved in an IDATA block, otherwise it is reserved in the dictionary.

: variable      \ -- ; -- addr
If IDATA? is true data is reserved in an IDATA block, otherwise it is reserved in the dictionary.

: buffer:       \ size -- ; -- addr
If IDATA? is true data is reserved in an IDATA block, otherwise it is reserved in the dictionary.

: value         \ n -- ; -- n
If IDATA? is true data is reserved in an IDATA block, otherwise it is reserved in the dictionary.

: 2value                \ n -- ; -- n
If IDATA? is true data is reserved in an IDATA block, otherwise it is reserved in the dictionary.

: CARRAY        \ n -- ; i -- c-addr
Creates a byte array. When the child executes, the address of the i'th byte in the array is returned. The index is zero based. If IDATA? is true data is reserved in an IDATA block, otherwise it is reserved in the dictionary.


10 CARRAY MYCARRAY              \ create 10 byte array
  5 MYCARRAY .          \ display address of element 5

: ARRAY         \ n -- ; i -- a-addr
Creates a cell size array. When the child executes, the address of the i'th cell in the array is returned. The index is zero based. If IDATA? is true data is reserved in an IDATA block, otherwise it is reserved in the dictionary.


10 CARRAY MYARRAY               \ create 10 byte array
  6 MYARRAY .           \ display address of element 6