In the main the VFX code generator is a black box. There are however a few control switches which are useful in limited cases. Note that because of the efficiency of the VFX code generator, it is very rarely necessary to use assembler code. The main exceptions are when dealing with very CPU specific code such as the cache control routines.
As of July 2018, local variables for Cortex-M0 are fully operational.
: +short-branches cc/i \ --
You save memory by compiling branches with an 8 bit offset.
This is the default condition.
: -short-branches cc/i \ --
You may need this for big conditional structures.
: [-short-branches cc/i \ -- x
You may need this for big conditional structures.
[-short-branches ... short-branches]
: [+short-branches cc/i \ -- x
You can save memory by compiling branches with an 8 bit offset.
[+short-branches ... short-branches]
: [short-branches cc/i \ -- x
Save the current short/long branch state.
[short-branches ... short-branches]
: short-branches] cc/i \ x --
Restore the saved short/long branch state.
: [short-branches? cc/i \ -- x
Return true if short/long branches are enabled.
[short-branches ... short-branches]
: +use-litpool \ -- ; enable use of literal pool
Enables use of the literal pool, which results in shorter code
at the expense of more data cache activity. This is usually a
better option than generating four-instruction sequences.
+USE-LITPOOL
is the default for the ARM instruction set
because of the significant reduction in code size.
+USE-LITPOOL
is not permitted for Cortex-M3 onwards,
but is required for Cortex-M0/M1.
: -use-litpool \ -- ; disable use of literal pool
Disables use of the literal pool. Immediate values that cannot
be represented by an ARM 12 bit literal may take up to four
instructions. +USE-LITPOOL
is the default for the ARM,
Cortex-M0 and Cortex-M1 instruction sets.
: T2-mode cc/i \ -- ; select 32 bit mode
Selects generation of code using the full Thumb-2
instruction set and a register allocation model optimised
for Thumb-2. This is the default.
: Cortex-M0 cc/i \ --
Select Cortex-M0 code generation.
: Cortex-M1 cc/i \ --
Select Cortex-M1 code generation.
: Cortex-M3 cc/i \ --
Select Cortex-M3 code generation.
: Cortex-M4 cc/i \ --
Select Cortex-M4 code generation with the integer DSP
extension.
: Cortex-M4F cc/i \ --
Select Cortex-M4F code generation with the integer DSP
and single precision VFP extensions.
: Cortex-M7 cc/i \ --
Select Cortex-M7 code generation with the integer DSP
and single and double precision VFP extensions.
: ARM-mode cc/i \ -- ; select 32 bit mode
Selects generation of code using the ARM 32 bit instruction
set and a register allocation model optimised for Thumb-2.
This switch is for code generation with interworking between
ARM32 and Thumb code.
: Legacy-mode cc/i \ -- ; select 32 bit mode
Selects generation of code using the ARM 32 bit instruction set
and the register allocation model used by previous ARM compilers
that do not support Thumb-2. Selecting this mode provides
compatibility with previous ARM-only source code that may
include assembler.
: ARM7TDMI [+interpreter] Legacy-mode [previous] ;
A synonym for Legacy-mode
.
: 32bit-mode [+interpreter] Legacy-mode [previous] ;
A synonym for Legacy-mode
.
: +ARM32-Cortex-A/R cc/i ( -- ) <ARMunpredictable> on ;
ARM32 code that must run on Cortex-A and Cortex-R cores must
allow for increased restrictions in the register usage,
especially for the LDRx and STRx instructions.
: -ARM32-Cortex-A/R cc/i ( -- ) <ARMunpredictable> off ;
Remove the restrictions introduced by +ARM32-Cortex-A/R
.
To support standalone Forths in which RAM or auxiliary Flash
may be more than 16/32 Mb away from the primary Flash area,
optimisation of >BODY
and TO-DO
can be controlled
by the xLongCallsx
directives below. These are
necessary for processors such as the Philips LPC families,
in which the Flash and RAM locations in memory are fixed.
Note that +LongCalls
is only required for an interactive
Forth or when Forth calls to code outside the normal 16/32Mb
branch range are required.
: +LongCalls cc/i \ --
Permit long call sequences.
: -LongCalls cc/i \ --
Forbid long call sequences (default).
: LongCalls? cc/i \ -- flag
Long call sequences permitted?
: inlining cc/i \ u -- ; base version with protection
Specifies that words of less than u bytes can be binary
inlinined. Use zero to disable binary inlining. Note that
inlining interacts with the +LeafCalls
directive.
The use of INLINING
is no longer recommended.
Cortex and ARM CPUs save subroutine return addresses in the LINK
register. If a word calls another word, the LINK register
must be preserved, usually on the return stack. Faster and
shorter code can be generated by only saving the LINK register
as required. We refer to this as the LeafCall
optimisation. All branches ensure that the LINK register
is saved in case a later CALL/BL is made. If you know that
a call will not be made, use ISLEAF
before
: <name>
to inhibit the save of the LINK register, e.g.
IsLeaf : uDelay \ n -- ; short delay
begin dup 1- 0= until
drop
;
The default is +LeafCalls
.
: +LeafCalls cc/i \ -- ; enable late LINK saving
When used, the LINK register is only saved if it has to be.
Default.
: -LeafCalls cc/i \ -- ; disable late LINK saving
Forces colon definitions to save the LINK register on entry.
: LeafCalls? cc/i \ -- flag ; test for late LINK saving
Returns non-zero if late LINK saves are enabled.
: IsLeaf cc/i \ -- ; next word is a LEAF
Marks the next colon definition as being a leaf node, i.e. it
makes no subroutine calls.
: LoopAlignment cc/i \ n --
Set loop starts, e.g. BEGIN..XXX
and DO..LOOP
to be aligned on an n-byte boundary, where n must be a
power of two. This is useful to force the heads of loops
onto a cache line or memory buffer boundary. The default is 4.
For a Philips LPC CPU DO..LOOP performance can be
improved 10-20% in some cases by using n=16.
#16 LoopAlignment \ set to 16 byte boundary
0 LoopAlignment \ revert to lowest setting
: LoopAlignment? cc/i \ --
Return the current loop alignment setting.
: LoopAlign cc/i \ --
Align with NOPs to head of loop. Used to align loop heads
in assembler code.
: +FastLVs cc/i \ -- ; enable fast local variables
Enables generation of inline local variable entry code. This
is the default condition, and is strongly recommended.
: -FastLVs cc/i \ -- ; disable fast local variables
Disables generation of inline local variable entry code.
: +FixedLVs cc/i \ -- ; enable faster local variable entry code
Enables generation of faster and shorter local variable entry
code. The only penalty is that run-time adjustment of the stack
frame is much more difficult and may require more stack space.
This is the default condition, and is strongly recommended.
: -FixedLvs cc/i \ -- ; disable faster local variable entry code
Disables generation of faster/shorter inline local variable
entry code.
: SmallFrames cc/i \ -- ; max 512 bytes of LVs
Cortex only: A local variable frame is restricted
to a bit less than 512 bytes. This is enough for
embedded systems use and results in smaller/faster entry code.
Requires both +FixedLVs
and +FastLVs
.
: LargeFrames cc/i \ -- ; max 4096 bytes of LVs
Cortex only: A local variable frame is restricted
to a bit less than 4096 bytes. This is enough for
all hosted use.
Requires both +FixedLVs
and +FastLVs
.
Intrinsics are what compiler people call special code
generators that provide access to CPU-specific operations.
The indicator C_
in the names below is just for
internal use by the compiler. The words you use are without
C_
.
The intrinsics in this section are available for all CPUs.
c_arshift arshift \ x u -- x'
Arithmetic right shift.
c_ror ror \ x u -- x'
Rotate right.
fetchop c_w@s \ addr -- sw
16 bit fetch and then sign extended.
fetchop c_c@s \ addr -- sb
8 bit fetch and then sign extended.
addstoreop c_w+! \ n addr --
As +!
but for a 16 bit item.
addstoreop c_c+! \ n addr --
As +!
but for an 8 bit item.
addstoreop c_or! \ mask addr --
OR
the mask into the data at addr.
addstoreop c_and! \ mask addr --
AND
the mask into the data at addr.
addstoreop c_xor! \ mask addr --
XOR
the mask into the data at addr.
addstoreop c_bic! \ mask addr --
Clear the mask bits in the data at addr.
addstoreop c_bor! \ bmask caddr --
OR
the bmask into the 8 bit data at caddr.
addstoreop c_band! \ bmask caddr --
AND
the bmask into the 8 bit data at caddr.
addstoreop c_bxor! \ bmask caddr --
XOR
the bmask into the 8 bit data at caddr.
addstoreop c_bbic! \ bmask caddr --
Clear the bmask bits in the 8 bit data at caddr.
rUP c_reg@ c_up@ \ -- addr
Return the address of the USER
area.
rLP c_reg@ c_lp@ \ -- addr
Return the local variable pointer.
rUP c_reg! c_up! \ addr --
Set the USER
area pointer.
rLP c_reg! c_lp! \ addr --
Set the local variable pointer.
: c_di \ -- ; disable interrupts
For Cortex parts, lays CPS .ID .I
. For ARM parts, lays
a call to the word DI
.
: c_ei \ -- ; enable interrupts
For Cortex parts, lays CPS .IE .I
. For ARM parts, lays
a call to the word EI
.
: c_dfi \ -- ; disable fast interrupts
For Cortex parts, lays CPS .ID .F
. For ARM parts, lays
a call to the word DFI
.
: c_efi \ -- ; enable fast interrupts
For Cortex parts, lays CPS .IE .F
. For ARM parts, lays
a call to the word EFI
.
: c_fsp! \ -- ; set R8
Set the floating point stack pointer R8, e.g.
<address> FSP!
: c_fsp@ \ -- ; get R8
Return the floating point stack pointer R8 in a new TOS.
Intrinsics are what compiler people call special code
generators that provide access to CPU-specific operations.
The indicator C_
in the names below is just for
internal use by the compiler. The words you use are without
C_
.
: c_sys@ \ -- ; <lit> SYS@
Intrinsic for for MRS reg, SYSm. Use this to read a system
register such as CONTROl or BASEPRI.
BASEPRI sys@
: c_sys! \ -- ; <val> <lit> SYS!
Intrinsic for for MSR SYSm reg. Use this to write a system
register.
2 CONTROL sys!
: c_set-sp \ -- ; <lit> SET-SP
Used to set the Forth data stack pointer during initialisation.
Unlike the normal Forth SP!
, the previous top of stack
is not saved. After use of SET-SP
, you can assume
nothing about the data stack except that it is empty.
: c_ByteRevL \ --
Swap the bytes in a 32 bit long word.
: c_ByteRevW \ --
Swap the bytes in the low 16 bits of TOS and swap the bytes
in the upper 16 bits. This is safe to use to get a zero
extended result immediately after W@
.
: c_ByteRevWS \ --
Swap the bytes in the low 16 bits of TOS and sign extend
them to 32 bits.
: c_ByteRevWZ \ --
Swap the bytes in the low 16 bits of TOS and zero extend it.
: c_wfe \ --
Flushes the stack and generates a WFE .N
instruction.
: c_wfi \ --
Flushes the stack and generates a WFI .N
instruction.
: c_dsb#0F \ --
Flushes the stack and generates a DSB # $0F
instruction.
: c_isb#0F \ --
Flushes the stack and generates an ISB # $0F
instruction.
The ARM Cortex VFX optimiser includes a number of special cases which may not be present in the target code, but can be easily written if required. In the main, these are present to support peripheral register handling.
BOR! BAND! BXOR! BBIC! ( mask addr -- )
Logical operation on bytes (8 bits) in memory for CPUs with
8 bit peripheral registers.
OR! AND! XOR! BIC! ( mask addr -- )
Logical operation on words (32 bits) in memory for CPUs with
32 bit peripheral registers (the vast majority).
ARSHIFT ROL ROR ( x count -- x' )
Shift operations in the style of LSHIFT
. ARSHIFT
performs an arithmetic right shift. ROL
and ROR
are 32 bit circular left and right shifts.
Writing good code for ARM and Cortex CPUs can double performance by reducing instruction count and memory accesses. This is particularly true for I/O operations involving bit masking. Although register operations are nearly always single-cycle operations, on most ARM/Cortex CPU implementations memory loads (LDRx instructions) for ARM take three cycles, Cortex loads take two cycles, taken branches need three cycles, and memory stores (STRx instructions) take two cycles. All of these have to be extended by any additional memory overhead such as the effect of running with wait states and/or 16 bit memory. A small amount of instruction cache can make a huge difference to systems with slow memory.
ARM CPUs are not good at handling literals as these are assembled 8 bits at a time. Thus an I/O address may result in four 32 bit instructions. To reduce this, the code generator places literals that cannot be coded in one instruction in a literal pool after the end of the word.
Cortex-M3/4 CPUs have better literal handling plus additional instructions for loading 16-bit immediates. A 32 bit literal can be loaded into a register in two instructions (8 bytes). For Cortex-M3/4, the code density benefit of a literal pool is minimal, but the performance cost is high.
The following code fragments illustrate the benefits of efficient coding. In MPE practice, absolute peripheral addresses are indicated by a leading underscore character '_', whereas offsets from a base address do not have an underscore. These examples are for an NXP LPC2106 CPU with an Ethernet controller connected to GPIO lines.
The first example is the simplistic operation.
: eth>read \ -- ; turns data bus for reading
_IODIR @
$1FFE00FF and \ set data bus to i/p
$E00000F0 or \ set others to o/p
_IODIR !
;
dis eth>read
ETH>READ
ldr r0, [ pc, # $1C ] ( @$B3CC = $E0028008 )
ldr r8, [ r0, # $00 ]
ldr r0, [ pc, # $18 ] ( @$B3D0 = $1FFE00FF )
and r8, r8, r0
ldr r0, [ pc, # $14 ] ( @$B3D4 = $E00000F0 )
orr r8, r8, r0
ldr r1, [ pc, # $10 ] ( @$B3D8 = $E0028008 )
str r8, [ r1, # $00 ]
mov pc, r14
36 bytes, 9 instructions.
To this we have to add four 32 bit literals in the literal pool for a total of 12 words or 48 bytes (excluding the final return). At one cycle for the eight instructions we have to add two cycles each for the three loads (+6) and one cycle for the write (+1). The total is 48 bytes and 8+6+1=15 cycles.
The second example is written knowing that we have an ARM CPU and with understanding of the GPIO structure.
: erdmode \ -- ; turns data bus for reading
_GPIO
dup IODIR + @
edbmask invert and
swap IODIR + !
;
dis erdmode
ERDMODE
ldr r8, [ pc, # $0C ] ( @$B770 = $E0028000 )
ldr r7, [ r8, # $08 ]
bic r7, r7, # $FF00
str r7, [ r8, # $08 ]
mov pc, r14
20 bytes, 5 instructions.
Again ignoring the final return we have four effective instructions, one literal pool fetch and one store, resulting in 20 bytes (vs 48) and 4+2+1=7 cycles (vs 15). Removing the redundant OR has saved us four cycles and twelve bytes, and loading the base address (literal) once has saved us another four cycles and eight bytes.
Most peripheral operations require access to more than one
register, e.g. command/status/data. Loading the base address once
and then accessing individual registers by fixed offsets from the
base allows the VFX code generator to optimise more away at the
expense of slightly longer source code. Use of the built-in
optimiser macros can reduce the expansion of the source code.
Because the PC relative loads have a limited range, the code
generator builds a new literal pool for each word. Most source
files concentrate on a limited set of peripheral register base
addresses. You can avoid the size penalty of the literal pool by
providing a local pointer to the base address in the CDATA
section
and referencing it throughout the source file. This will be
accessed by a PC relative load.
L: ^GPIOe
_GPIO ,
: erdmode2 \ -- ; turns data bus for reading
^GPIOe @
dup IODIR + @
edbmask invert and
swap IODIR + !
;
dis erdmode2
ERDMODE2
ldr r8, [ pc, # $-1C ] ( @$B9F4 = $E0028000 )
ldr r7, [ r8, # $08 ]
bic r7, r7, # $FF00
str r7, [ r8, # $08 ]
mov pc, r14
20 bytes, 5 instructions.
Thus the rule of thumb is to use the Forth stack to DUP
or OVER
literals rather than to reference them several
times as literals, and to avoid use of literals that do not
fit in the ARM 4/8 bit literal format. By doing this you will
also make better use of future improvements to the VFX code
generator.