Life would be a lot easier with some comments and stack effects. See STATE-INIT in particular, which is surely incorrect.
On x86 and x64 moving the stack pointer is expensive, so try to reduce the use of >R and R> in particular. In fact, try just using the data stack.
I 4 MOD just generates 0, 1, 2, 3, 0, 1... so just call TRANSITION 4 times in the loop and reduce the number of iterations by a factor of 4.
Don't forget to measure the impact of every change - often writing code for performance is non-obvious.