0 votes
by Lee (150 points)
Please see: https://gist.github.com/lf94/2700524fa42b76e86b047f35febd6370?permalink_comment_id=5076321#gistcomment-5076321

The vfxforth time to run 1 billion transitions is about 1.5s. The OCAML version is 1.0s. There is also a Rust version we ran which came out to 0.5s.

Obviously, these are amazing times. Rust uses LLVM, a multi-billion dollar optimizer. Being 1s away from this is crazy!

I've been trying, but having difficulty squeezing further speed though out of vfxforth. I would LOVE for some internal people to chime in here, and tell me how I can go faster! Thank you so much!

1 Answer

0 votes
by Stephen Pelc (4.1k points)
selected by Lee
Best answer
Life would be a lot easier with some comments and stack effects. See STATE-INIT in particular, which is surely incorrect.

On x86 and x64 moving the stack pointer is expensive, so try to reduce the use of >R and R> in particular. In fact, try just using the data stack.

I 4 MOD just generates 0, 1, 2, 3, 0, 1... so just call TRANSITION 4 times in the loop and reduce the number of iterations by a factor of 4.

Don't forget to measure the impact of every change - often writing code for performance is non-obvious.
by Lee (150 points)
Hello Stephen! A pleasure to see a response from you.

Sorry yes, comments and stack effects are something I'm not doing a lot of yet in Forth. I'm trying to write Forth that is just easy to read naturally. I'm currently reading your book (I would love to buy a physical copy by the way...) and will take notes from there on commenting style :)

Roger that about the return stack. I will move to using the data stack and see if it helps.

Huh, I would not expect the unrolled calls to transition would help that much, but I will try this too!

Will report back later.
by Stephen Pelc (4.1k points)
I'm completely anal about comments and stack effects. I have to maintain code that is up to 40 years old. People who do not provide comments and stack effects are not hired by MPE.

You do not need to unroll calls to TRANSITION - the VFX code generator may or may not do that for you automagically. What we are doing is replacing
  I 4 MOD
by a literal.
by Lee (150 points)
I have made the changes. We got a nice speed increase, but we are still worse than ocaml. You can see the changes at the same link.

hyperfine 'dune exec traffic_light --release'
Benchmark 1: dune exec traffic_light --release
  Time (mean ± σ):     625.0 ms ±   8.9 ms    [User: 608.5 ms, System: 8.4 ms]
  Range (min … max):   614.4 ms … 644.1 ms    10 runs

hyperfine 'cat traffic_light2.fs | /usr/local/bin/VfxForth_x64_lin.elf'
Benchmark 1: cat traffic_light2.fs | /usr/local/bin/VfxForth_x64_lin.elf
  Time (mean ± σ):     895.3 ms ±   2.5 ms    [User: 885.0 ms, System: 2.2 ms]
  Range (min … max):   892.2 ms … 898.9 ms    10 runs

hyperfine 'cargo run --release'
Benchmark 1: cargo run --release
  Time (mean ± σ):      21.7 ms ±   7.9 ms    [User: 15.6 ms, System: 6.0 ms]
  Range (min … max):    16.3 ms …  69.3 ms    43 runs

Is there anything else you think we could do?