Building Forth StructuresJack Brien |
Anton Ertl has made many contributions to Forth - GForth, the RAFTS native compiler technology and an object-oriented extension for ANS Forths which makes use of some words for building data structures. In a collaboration with Anton, Jack Brien has written about the data structures with an interesting example and some variations of his own.
struct
or
Pascal's RECORD
but it is simple to provide one if
needed, and a good example of how Forth can be extended to provide
facilities that are not standard. There are many approaches,
depending on how much complexity your application needs. A very basic
one, which I have used a lot, is:
: field \ n ++ n ; declare n as an offset into a record dup constant ;used as in
0 field A 5 chars + \ size of field A field B 3 cells + \ size of field B field C 2 cells + \ size of field C
The words created by FIELD store their own offset, to which the same of the field is then added to calculate the next offset, leaving the total size of the record on the stack at the end. This can be stored in a constant or used to create a named record directly:
constant #FOO #foo create FOO allot
The address of (for example) field B in FOO can be found with:
foo b +
.
Since we always use the offsets with +
, using a defining
word field
that includes the +
in the
action of the defined word offers itself. (Notice how I have factored
out the action of FIELD into a separate word. This is not strictly
necessary, but can be useful, as we will see later):
: DOFIELD \ the action of a field DOES> \ addr1 -- addr2 ; calculate address of field @ + ; : FIELD \ # n ++ #' \ define a field with offset # and size n create over , + \ store offset, add the size to find new offset dofield ; \ set the action of the newly created field 0 5 chars field A 3 cells field B 2 cells field C create FOO allot
Instead of b +
, we now simply write b
so
the address of field B in FOO is now found with: foo b
.
Structures can be nested:
0 some fields #foo field D more fields create BAR allot
And bar d b
returns the address of field B in field D of BAR.
Some Forths have alignment requirements for CELLS FLOATS and DFLOATS such as requiring CELL-based data to begin at even-numbered addresses. In this case FOO above is incorrect, since the cell data at offset B may be at an odd address. One way round this is to start the structure with the data that has the greatest alignment requirement, and ensure that the whole structure is aligned to suit it.
The Standard words that do this are: ALIGN FALIGN
DFALIGN
and SFALIGN
which align the data space
pointer correctly for cells, floats, double floats and single floats
respectively. The related words ALIGNED FALIGNED DFALIGNED
SFALIGNED
each take a address and return the next address
aligned to the required standard. You could use these to ensure your
structures are correct by (for example) using aligned
before the declaration of field B, but this could get tedious when
dealing with complex structures.
Anton Ertl has devised this solution: each type descriptor includes its alignment requirement as well as its size.
1 chars 0 2constant STRUCT \ used to start a structure 1 aligned 1 cells 2constant CELL% \ The % indicates an alignment:size 1 chars 1 chars 2constant CHAR% \ pair passed on the Data Stack. 1 faligned 1 floats 2constant FLOAT% 1 dfaligned 1 dfloats 2constant DFLOAT% 1 sfaligned 1 sfloats 2constant SFLOAT% cell% 2* 2constant DOUBLE%
Use this technique to make descriptors for arrays. For a field containing a string of 10 chars, you might use:
char% 10 * 2constant STRING%
: NALIGN \ addr n -- addr' ; align address to standard n 1- tuck + swap invert and ; : ENDSTRUCT \ align size ++ ; declare a type descriptor for the structure over nalign \ pad size to full alignment 2constant ; : FIELD, \ align1 offset1 align size -- align2 offset2 ; swap rot over nalign dup , \ compile aligned offset rot + \ add size to make new offset >r nalign r> ; \ correct alignment requirement : FIELD \ align1 offset1 align size ++ align2 offset2 ; create field, dofield ;
FIELD has been changed to use the 2 values left by the type
descriptor. STRUCT is used (instead of the zero we were using so
far) to begin the declaration of a structure. Where we wrote
constant #foo
to save the size of the structure, we now
use endstruct foo%
. FOO% can then be used in a nested
structure. It is even possible to have nested arrays of structures, e.g.
foo% 20 * field 20FOO-ARRAY
Anton also supplies foo% %alloc
and foo%
%allot
as convenient ways to
create a dataspace to hold structure FOO%.
: %ALLOT \ align size -- addr ; reserve dataspace for structure here rot nalign \ get aligned address dup here - rot + allot ; : %ALLOC \ align size -- addr ; reserve heapspace nip allocate throw ;
Note that in ANS Forth the body of a create
d word is
aligned
but not necessarily faligned
;
therefore, if foo%
requires greater alignment
create foo foo% allot drop
may not produce the desired
effect. You need to do
foo% %allot constant foo
. The address returned by
ALLOCATE is already aligned to the highest standard, so the same
precaution is not needed there.
The field names that come to (my) mind are often quite generic and, if used, would cause frequent name clashes. E.g. many structures probably contain a
counter
field. The structure names that come to (my) mind are often also the logical choice for the names of words that create such a structure.Therefore, I have adopted the following naming conventions:
- The names of fields are of the form
struct-field
, wherestruct
is the basic name of the structure, andfield
is the basic name of the field. You can think about field words as converting the (address of the) structure into the (address of the) field.- The names of structures are of the form
struct%
, wherestruct
is the basic name of the structure.
The first field is at the base address of a structure and the word for this field actually does not change the address on the stack. You may be tempted to leave it out in the interest of run-time and space efficiency. However, it is possible to reduce the temptation:
: DONOTHING \ makes the last created word do nothing immediate \ do it at compile-time, so no code is generated DOES> drop ; \ drop the address that create leaves
Anton makes field
perform donothing
(and
therefore compile no code) if the offset on the stack is zero. A
different approach is to use another word for the first field:
: FIRST-FIELD \ align size ++ align size ; declare the first field create donothing ;
STRUCT is now redundant, so we can rename ENDSTRUCT as STRUCT. e.g.
Create a cell-sized structure to hold a pointer:
cell% struct POINTER%
Use it to create a structure that holds two pointers:
pointer% first-field &BRANCH-LEFT pointer% field &BRANCH-RIGHT struct BRANCH%
&name is simply my convention for naming a pointer.
BRANCH%es can be used to construct binary trees, but they are not much use without data. This is a characteristic of all generic structures such as lists, trees, stacks and queues. Sorted binary trees for example, can be built easily with any structure whose first field is a BRANCH%:
branch% cell% field CELL-BRANCH-DATA struct CELL-BRANCH% branch% foo% field FOO-BRANCH-DATA struct FOO-BRANCH%
In each case the field
has the same function - to
return the base address of the structure that the branch actually
contains. CELL-BRANCH-DATA and FOO-BRANCH-DATA are identical apart
from the name. You could equally well define the generic:
pointer% first-field &BRANCH-LEFT pointer% field &BRANCH-RIGHT cell 0 field BRANCH-DATA \ just a placemarker struct BRANCH%
Now, since BRANCH-DATA returns the offset as the following data field, we do not actually want to name the data field again. Instead we want a syntax like:
branch% foo% +struct FOO-BRANCH%
\ to build a tree of foo%s
Which is easily defined:
: +STRUCT \ generic% data% ++ ; rot + \ add the data sizes nip \ use the generic alignment struct ;
So you see that the placemarker in BRANCH% or any similar definition must have an alignment requirement suitable for whatever data type will follow. So to allow branches containing dfloats, the definition would be:
1 dfaligned 0 field BRANCH-DATA
The root of a tree is a pointer to a branch.
&left-branch
and &right-branch
point to other branches, or else contain zero.
COMPARE-DATA \ branch-data1 branch-data2 -- f ; true if data1 < data2 \ definition depends on the structure at branch-data : ?L/R \ branch f -- branch' ; IF &left-branch ELSE &right-branch THEN ; : TREE-INSERT \ branch tree -- ; \ insert branch at correct point in tree (ignoring duplicates) over branch-data swap \ branch1 branch-data1 &branch2 BEGIN dup @ WHILE \ not on leaf of tree @ 2dup branch-data compare-data ?l/r REPEAT nip ! ;
This definition would work for any type of tree, were it not for the different versions of COMPARE-DATA required. COMPARE-DATA is what is an object-orientated language calls a method, and object-oriented code is basically a way of making each structure ensure that the correct method is called. That is more complicated than what we have seen so far, but not by very much.
The original simple version can be found in Forthwrite Issue 64.
Chris Jakeman presented a comprehensive set of Array and Record words in Forthwrite Issue 55, developed from code in Dick Pountain's book Object-Oriented Forth (1987), which can also be borrowed from the FIG UK Library
Anton's Object Package
can be downloaded from
http://www.complang.tuwien.ac.at/forth/objects.zip.
This
contains a complete object-oriented system based on the principles given here.
I am indebted to Anton for his advice in writing this article.