The VFX code generator is a black box that simply does its job. Some implementations may have switches for special cases.
The optimiser can be enabled and disabled by the
words OPTIMISED
and UNOPTIMISED
. The state of the
optimiser can be detected by inspecting the variable
OPTIMISING
.
Binary inlining consists of copying the binary code for a word inline without the final return instruction. This avoids the overhead of the call and return instructions. It is useful for very short coded instruction sequences. For high level definitions the source inliner usually gives better results.
The VFX code generator gives some control over the use of
binary inlining, controlled by the word INLINING ( n -- )
.
When the code generator has completed a word, the length of
the word is stored. When the word is to be compiled, its
length is compared against the value passed to INLINING
,
and if the length is less than the system value, the word is
compiled inline, with the procedure entry and exit code
removed. This avoids pipeline stalls, and is very useful
for short definitions.
By default four constants are available for inlining control,
although any number will be accepted by INLINING
.
NO INLINING \ 0, binary inlining turned off
NORMAL INLINING \ 12-16, ~10% increase in size
AGGRESSIVE INLINING \ 255, useful when time critical
ABSURD INLINING \ 4096, unlikely to be useful
You can use INLINING
anywhere in the code outside a
definition.
Any word that uses words that affect the return stack such as
EXIT
, or takes items off the return stack that you
didn't put there in the same word, will automatically be
marked as not being able to be inlined.
Implementations that use absolute calls will disable inlining of any word that makes an absolute call.
Note that when words are inlined, the effects may not be as expected.
: A ... ; \ inlined
: B ... A ... ; \ A inlined, B can be inlined
: C ... B ... B ... ; \ A, B inlined, C can be inlined
By default CODE
definitions are not marked for inlining
because the assembler cannot detect all cases which may upset
the return stack. If you want to make a code definition
available for binary inlining, follow it with the word
INLINE
.
CODE <name>
...
END-CODE InLine
Some instructions are only available on later CPUs. Note that CPU selection affects the assembler and the VFX code code generator and compile time, not the run time instruction usage of your application. If you select a higher CPU level than the application runs on, incorrect operation will occur. The default selection is for the Pentium 4 instruction set.
CPU=386 \ -- ; select base instruction set
CPU=PPro \ -- ; Pentium Pro and above with CMOVcc
CPU=P4 \ -- ; Pentium 4 and above
Aspects of the VFX code generator are controllable by
switches. In particular the inlining of the DO ... LOOP
entry code
and local variable entry code may be turned on and off to suit your
particular coding style.
Note also that for large computationally intensive definitions,
the SMALLER
and FASTER
pair of switches may
actually give better performance using SMALLER
. The
impact of these switches varies considerably between CPU types
and cache/memory architecture.
#16 value /code-alignment \ -- n
The default code alignment used by FASTER
below.
Must be a power of two.
: smaller \ --
Selects smaller code using the minimum of alignment.
: faster \ --
Selects faster code using 16 byte alignment, which will
increase the size of the dictionary headers.
: +polite \ -- ; suppresses some warnings
Suppresses some warning messages which some users may feel
are commenting on their code. In particular, if you define
constants to enable and disable code without using conditional
compilation, you can use +POLITE
to disable the warnings
about conditional branches against a constant. See also
-POLITE
.
: -polite \ -- ; enables some warnings
Enables some warning messages which warn you if have used a
phrase such as "<literal> IF
". See +POLITE
.
TOS MUST BE IN EBX, EAX is free after shuffle
TOS MUST BE IN EBX and EAX free after shuffle
TOS MUST BE IN EBX
TOS MUST BE IN EBX
0 value MustLoad? \ -- n
Returns true if indirect accesses are loaded rather than delayed.
: +MustLoad \ --
Forces indirect memory loads to be fetched into a register rather
than delayed. For some applications (mostly calculations with
array indexing) this can lead to a performance gain.
: -MustLoad \ --
Permits indirect memory loads to be delayed. This is the default
condition.
: +short-branches \ --
Enables the VFX optimiser to produce short forward branches. If
your code causes a branch limit to be exceeded, you can put
-SHORT-BRANCHES
and +SHORT-BRANCHES
around the
offending words. By default, short branch generation is off
because it gives better perforance on modern CPUs.
: -short-branches \ --
Prevents the VFX optimiser producing short forward branches.
By default, short branch generation is off.
: short-branches? \ -- flag ; true for short branches
Returns true if the optimiser will produce short forward branches.
: [-short-branches \ -- sys
Disables short branch optimisation until the previous state is
restored by SHORT-BRANCHES]
.
: [+short-branches \ -- sys
Enables short branch optimisation until the previous state is
restored by SHORT-BRANCHES]
.
: short-branches] \ sys --
restores the short branch optimisation previously saved
by +/-SHORT-BRANCHES]
.
: LoopAlignment \ n --
Set loop starts, e.g. BEGIN..XXX
and DO..LOOP
to be aligned on an n-byte booundary, where n must be a
power of two. This is useful to force the heads of loops
onto a cache line boundary. The default is 8.
#16 LoopAlignment \ set to 16 byte boundary
0 LoopAlignment \ revert to lowest setting
: +fastlvs \ --
Enables generation of inline local variable entry code. This is
the default condition, and is strongly recommended.
: -fastlvs \ --
Disables generation of inline local variable entry code.
Most modern x86 operating systems use task gates for interrupt handling, which permits some code generation to be better, especially for local variables.
SafeOS? value SafeOS? \ -- flag
Returns true if the operating system can be assumed to be
safe.
: +SafeOS \ --
Assume a safe modern operating system.
: -SafeOS \ --
Assume an old-fashioned or raw operating system.
These directives control the optimiser
: optimising? \ -- flag
Returns true if the optimiser is enabled.
: optimised \ -- ; turn optimisation on
Enables the optimiser.
: unoptimised \ -- ; turn optimisation off
Disables the optimiser.
These directives are used to turn optimisation on and off around sections of code.
: [opt \ -- i*x
Save the current state of optimisation at the start of an
[OPT ... OPT]
structure. You can make no assumptions
about what the data stack contains.
: [-opt \ -- i*x
Save the current state of optimisation at the start of an
[-OPT ... OPT]
structure and turn optimisation off.
: opt] \ i*x --
Restore the state of optimisation at the end of an
[OPT ... OPT]
structure to what it was at the start.
The following directives are IMMEDIATE
words that you
can put inside your definitions to obtain an idea of how
code is being compiled. DIS <name>
will disassemble a word.
: [] \ --
Lay a NOP instruction as a marker, without flushing the
optimiser.
: [o/f] \ --
Flush the optimiser state, generating the canonical stack
state again with TOS in the EBX register, and all other
stack items in the deep (memory) stack.
: [o/s] \ --
Show the state of the optimiser's working stack.
On i32/x86 Pentium-class CPUs the PUSH and POP instructions
generated by >R
and R>
are slow, and the VFX code
generator is quite conservative in optimising return stack
manipulations as compared with data stack anipulations.
Although the code below is convenient, safe and easy to write
it is slow. The rect.xxx
words are fields in a structure.
: Rect@ \ rect -- l t r b
\ Retrieve the values x y r b from the RECT[ structure at
\ the address given.
>r
r@ rect.Left @
r@ rect.Top @
r@ rect.right @
r> rect.bottom @
;
The version below generates far better code when performance is important.
: Rect@ \ rect -- l t r b
\ Retrieve the values x y r b from the RECT[ structure at
\ the address given.
dup rect.Left @ swap
dup rect.Top @ swap
dup rect.right @ swap
rect.bottom @
;
Because of the limited number of registers, better code
is usually generated by passing a pointer to a structure
such as a rectangle rather than passing four items on the
data stack. Use of words such as Rect@
should be
reseved for preparing parameters for a Windows API call.
If you have written custom optimisers, the EAX register is
no longer free for use, but must be requested like any other
working register. CODE
definitions require no changes.
From VFX Forth v4.3, build 2825, the tokeniser replaces the previous source inliner. The change was made to improve ANS and Forth200x standards compliance, and to reduce issues with particularly "guru" code. To prevent breaking your existing code, the tokeniser uses the same word names for its control words.
The tokeniser keeps track of what is compiled for a word, and reruns the compilation of short definitions rather than copying the compiled code inline. This gives the VFX code generator many more opportunities to remove stack operations and produces smaller and faster code while encouraging users to write short definitions. That having been said, the relationship of code size with and without the tokeniser enabled is obscure at best.
Under some rare conditions, usually those requiring tinkering with internal structures of VFX Forth during compilation, it is necessary to have a level of control over the tokeniser. This section documents those words.
: discard-sinline \ --
Stops the current definition from being handled by the
tokeniser. This is usually required by a compilation
word which generates inline data, and for which repetition
of the word containing the inline data would generate large
code with little speed advantage.
#128 Value SinThreshold \ -- u
If the binary size of a word is less than this value,
it can be tokenised. Subject to change.
: .Tokens \ xt --
Display the token stream for a word.
: .Tokeniser \ --
Display tokeniser state
FALSE Value Sin? \ -- flag
A VALUE
which enables tokenising when set. Using
SIN?
enables you to determine the state of the
tokeniser
false value sindoes? \ -- flag
A VALUE
which enables tokenising of DOES>
clauses when set. Using this value enables you to determine
the state of the tokeniser.
false value SinActive? \ -- flag
Returns true when the tokeniser is active. It is used
to inibit some immediate words which must not be rerun when
the word they are in is tokenised.
: +sin \ --
Enable tokenising of following definitions.
: -sin \ --
Disable tokenising of following definitions.
: +sindoes \ --
Enable tokenising of the run time portions of defining words.
Many defining words produced with CREATE ... DOES>
have short
run time actions. The address returned by DOES>
is a literal and
provides many opportunities for both space and speed optimisation.
: -sindoes \ --
Disable tokenising of the run time portions of defining words.
: [sin \ -- i*x
[SIN
and SIN]
define a range of source code and
must be used interpretively, not during compilation.
[SIN
saves the current tokeniser state and SIN]
restores it. Often used in the form:
[SIN -SIN ... SIN]
: sin] \ i*x --
See [SIN
above.
: [-sin \ -- i*x
[-SIN
saves the current tokeniser state, and turns
off the tokeniser. SIN]
restores the saved tokeniser
state. Used in the form:
[-SIN ... SIN]
: [+sin \ -- i*x
[+SIN
saves the current tokeniser state, and turns
on the tokeniser. SIN]
restores the saved tokeniser
state. Used in the form:
[+SIN ... SIN]
: Sinlined? \ xt -- flag
Return true if the word defined by xt can be compiled by
the tokeniser.
: RemoveSin \ xt --
Remove tokeniser information from a word. If the
word has no tokeniser information it is unaffected.
: DoNotSin \ --
If the last word with a dictionary header must not be
tokenised, place DoNotSin
after its definition,
e.g.
: foo ... ; DoNotSin
: IMMEDIATE \ --
Mark the last defined word as immediate. Immediate words will
execute whenever encountered regardless of STATE
.
IMMEDIATE
also disables tokenising of the last
defined word. In practice, this is not a performance issue
as IMMEDIATE
words are executed at compile time.
: RemoveSINinRange \ start end --
Remove all tokeniser information for definitions within the
given range.
: RemoveAllSins \ --
Remove all tokeniser data in the system.
RemoveAllSins
is executed by the exit chain during
BYE
.
These gotchas are very rare conditions. They usually only
appear when you write words that affect the semantics
(meaning) of compilation. You can use [-sin ... sin]
to drill down to the words that are causing problems.
[-sin
: foo ... ;
: poo ... ;
sin]
The tokeniser hooks into the guts of COMPILE,
and
LITERAL
. Compilation performed through these words
is unaffected by the tokeniser.
Tokenising of IMMEDIATE
words is disabled to reduce
problems with "guru" code. In nearly all cases, these words
are only executed at compile time, so there is minimal
impact on application performance. If an immediate word
causes compilation using COMPILE,
and LITERAL
,
the tokeniser will detect this and generate tokens, e.g.
: z1 postpone dup postpone over ; immediate
: z2 z1 ;
' z2 .tokens
StartToken
DUP
OVER
End Token
In the majority of cases the tokeniser handles defining
words quite adequately. In a few cases, such as defining
new types of xVALUE
, better code generation can be
obtained by performing some calculation at compile time.
Such defining words should set a compiler for their children.
To do this, use SET-COMPILER
and INTERP>
rather
than DOES>
. INTERP>
indicates to the compiler
that what follows is performed when the child is
interpreted and that a compiler for the child has been
defined. The following example is the kernel definition
of VALUE
.
: value \ n -- ; ??? -- ???
create
, ['] valComp, set-compiler
interp>
valInterp
;
Note that the chidren of words using INTERP>
are
not immediate - they have separate interpretation and
compilation actions. SET-COMPILER ( xt -- )
above sets
valComp,
to be the compiler of the last word CREATE
d.
SET-COMPILER
takes the xt of the word it is to
compile so that information can be extracted from the word.
There are rare occasions on which you may want to add a
compiler to a non-defining word. Rather than making the
word immediate and state-smart, which can lead to problems,
you can add the compiler yourself. This is especially
desirable when the compiler uses carnal knowledge of VFX Forth
rather than just COMPILE,
and LITERAL
, The example
is taken from the VFX Forth kernel.
: DO \ Run: n1|u1 n2|u2 -- ; R: -- loop-sys
NoInterp ;
comp: drop s_do, 3 ;
In nearly all cases, words that modify the return stack will be detected and these words will not be tokenised. However, in some cases words containing such words should not be tokenised because the flow of control has been modified. The first example below fails, but the second does not. Note that, according to the ANS and Forth200x standards, these words are non-standard because they make the assumption that, on entry to a word, the top item on return stack is the return address. The example below is taken from a third-party application ported to VFX Forth.
This example is correctly detected, but fails because the
code also requires the word containing LIST>
not
to be tokenised.
: list> ( thread -- element )
BEGIN @ dup WHILE dup r@ execute REPEAT
drop r> drop ;
...
: .fonts fonts LIST> .font ;
The example above makes two assumptions, one about the
return stack in the use of R@
and R>
, and
another about how colon definitions begin in EXECUTE
.
The solution is to disable the tokeniser when the word is compiled. The containing word is forced to be untokenised.
: (list>) ( thread -- element )
BEGIN @ dup WHILE dup r@ execute REPEAT
drop r> drop ;
: list> ( thread -- element )
postpone (list>) discard-sinline ; immediate
...
: .fonts fonts LIST> .font ;
If you need to write words such as these, partitioning them
as above, plus careful use of :NONAME
to create the
second part improves portability and maintainability.
If you build a new compiling word that uses colon, :
,
its children can themselves be tokenised. If your new word
saves and restores data from the return stack indirectly,
the tokeniser may not detect this, leading to obscure
runtime or compilation errors. This situation can be avoided
by adding DISCARD-SINLINE
after the use of colon, e.g.
: MY: \ --
: postpone save-state discard-sinline
;
: MY; \ --
postpone restore-state postpone ;
;
Some coding styles can lead to excessive expansion of code
size by the tokeniser. Apart from turning the tokeniser off,
you can try reducing the size set in the value
SinThreshold
. Note that the relationship between
the compiled size of a word and its equivalent after
token expansion in another word is often obscure.
From VFX Forth v4.3 onwards, code/data separation is turned on by default.
CPUs from the Pentium 3 onwards have serious performance problems when data is close to code, leading to a wide variation in performance depending on data location. Measurements on the random number generator in the benchmark suite had a variation of 7:1.
The file Sources\Kernel\386Com\OPTIMISE\P4opt.fth (with Professional and Mission versions) contains code for data space management for these processors. Results show that performance is improved by a factor of 2.3 on BENCHMRK.FTH and that performance is now independent of location. There is no degradation of performance on other CPUs. The code generation switches are:
+IDATA \ -- ; enable code/data separation
-IDATA \ -- ; dsable code/data separation
Note that when enabled, phrases such as
VARIABLE <name> <size> ALLOT
will not give the expected result. This is discussed in more detail below.
The solution is to separate code and data. When the
optimisation is enabled, data is held in IDATA
chunks
away from code. There is no change to CREATE
,
ALLOT
, comma and friends, which still operate on
normal dictionary areas. The notation is derived from cross
compiler usage in embedded systems.
The following is a conventional definition of a character/byte array defined in the dictionary.
: cCARRAY \ n -- ; i -- c-addr
CREATE ALLOT DOES> + ;
The data space reserved by ALLOT is intermingled
with code, leading to bad performance. The second
implementation is for best performance with P4 CPUs.
IRESERVE ( n -- c-addr )
reserves an n-byte block
in the IDATA
area and returns its address. The
children of ICARRAY
are made immediate in order
to emulate the effect of the source inliner on
children of CCARRAY
. The implementation below is
is illustrative only. State-smart words (considered "evil" by
some) can be be avoided using set-compiler
and
interp>
.
: icarray \ n -- ; i -- c-addr
dup ireserve dup rot erase \ reserve IDATA space
create immediate \ children are IMMEDIATE
, \ address in IDATA
does>
@ state @ if \ compiling
postpone literal postpone +
else \ interpreting
+
endif
;
In order to make the array defining word CARRAY
independent of whether P4 optimisation is enabled
CARRAY
simply selects which version to use.
: CARRAY \ n -- ; i -- c-addr
idata?
if icarray else ccarray endif
;
When +IDATA
is in use, standard defining words such
as VARIABLE
and VALUE
will reserve space in the
IDATA
areas, but ALLOT
still reserves space in
the dictionary. Consequently code such as:
VARIABLE <name> <size> ALLOT
will break when +IDATA
is active. Use:
<size> BUFFER: <name>
for all such allocations.
Words such as >BODY
and BODY>
will not work
correctly on words whose data area is in an IDATA region.
variable iblock \ -- addr
Holds the address of the current IDATA block.
variable iblock# \ -- addr
Holds the size of the current IDATA block.
variable idp \ -- addr
Holds the current location in the current IDATA block.
variable def-igap \ -- addr
Holds the minimum code/data gap size, by default 8 kbytes.
variable def-iblock# \ -- addr
Holds the default IDATA block size, by default 64 kbytes.
: bin-align \ n --
Force alignment to an N byte boundary where N is a
power of two. The space stepped over is set to 0.
: alignidef \ --
Align the dictionary to the IDATA default boundary.
: inoroom? \ n -- flag
Retuns true if there is not enough room in the current
IDATA block.
: make-iblock \ n --
Make an IDATA block that is at least n bytes long. If n
is less than the default size in DEF-IBLOCK#
the block
will be the default size.
: ialign \ --
Step the IDATA block pointer to the next 4 byte boundary
: ialign16 \ --
Step the IDATA block pointer to the next 16 byte boundary
: ireserve \ n -- a-addr
Reserve n bytes in the current IDATA block.
0 value idata? \ -- flag
Returns true if data is reserved in the IDATA block.
: +idata \ --
Force data to be reserved in IDATA blocks.
: -idata \ --
Data is reserved conventionally in the normal dictionary
space.
: 2variable \ -- ; -- addr
If IDATA?
is true data is reserved in an IDATA block,
otherwise it is reserved in the dictionary.
: variable \ -- ; -- addr
If IDATA?
is true data is reserved in an IDATA block,
otherwise it is reserved in the dictionary.
: buffer: \ size -- ; -- addr
If IDATA?
is true data is reserved in an IDATA block,
otherwise it is reserved in the dictionary.
: value \ n -- ; -- n
If IDATA?
is true data is reserved in an IDATA block,
otherwise it is reserved in the dictionary.
: 2value \ n -- ; -- n
If IDATA?
is true data is reserved in an IDATA block,
otherwise it is reserved in the dictionary.
: CARRAY \ n -- ; i -- c-addr
Creates a byte array. When the child executes, the
address of the i'th byte in the array is returned.
The index is zero based.
If IDATA?
is true data is reserved in an IDATA block,
otherwise it is reserved in the dictionary.
10 CARRAY MYCARRAY \ create 10 byte array
5 MYCARRAY . \ display address of element 5
: ARRAY \ n -- ; i -- a-addr
Creates a cell size array. When the child executes, the
address of the i'th cell in the array is returned.
The index is zero based.
If IDATA?
is true data is reserved in an IDATA block,
otherwise it is reserved in the dictionary.
10 CARRAY MYARRAY \ create 10 byte array
6 MYARRAY . \ display address of element 6